小编Abh*_*ole的帖子

从Postgres JDBC表中读取Spark的速度很慢

我试图从PostgreSQL数据库加载大约1M行到Spark.使用Spark时需要大约10秒.但是,使用psycopg2驱动程序加载相同的查询需要2s.我正在使用postgresql jdbc驱动程序版本42.0.0

def _loadFromPostGres(name):
    url_connect = "jdbc:postgresql:"+dbname
    properties = {"user": "postgres", "password": "postgres"}
    df = SparkSession.builder.getOrCreate().read.jdbc(url=url_connect, table=name, properties=properties)
    return df

df = _loadFromPostGres("""
    (SELECT "seriesId", "companyId", "userId", "score" 
    FROM user_series_game 
    WHERE "companyId"=655124304077004298) as
user_series_game""")

print measure(lambda : len(df.collect()))
Run Code Online (Sandbox Code Playgroud)

输出是 -

--- 10.7214591503 seconds ---
1076131
Run Code Online (Sandbox Code Playgroud)

使用psycopg2 -

import psycopg2
conn = psycopg2.connect(conn_string)
cur = conn.cursor()

def _exec():
    cur.execute("""(SELECT "seriesId", "companyId", "userId", "score" 
        FROM user_series_game 
        WHERE "companyId"=655124304077004298)""")
    return cur.fetchall()
print measure(lambda : len(_exec()))
cur.close()
conn.close()
Run Code Online (Sandbox Code Playgroud)

输出是 -

--- 2.27961301804 seconds --- …
Run Code Online (Sandbox Code Playgroud)

postgresql jdbc apache-spark pyspark spark-dataframe

6
推荐指数
1
解决办法
4384
查看次数