我试图从PostgreSQL数据库加载大约1M行到Spark.使用Spark时需要大约10秒.但是,使用psycopg2驱动程序加载相同的查询需要2s.我正在使用postgresql jdbc驱动程序版本42.0.0
def _loadFromPostGres(name):
url_connect = "jdbc:postgresql:"+dbname
properties = {"user": "postgres", "password": "postgres"}
df = SparkSession.builder.getOrCreate().read.jdbc(url=url_connect, table=name, properties=properties)
return df
df = _loadFromPostGres("""
(SELECT "seriesId", "companyId", "userId", "score"
FROM user_series_game
WHERE "companyId"=655124304077004298) as
user_series_game""")
print measure(lambda : len(df.collect()))
Run Code Online (Sandbox Code Playgroud)
输出是 -
--- 10.7214591503 seconds ---
1076131
Run Code Online (Sandbox Code Playgroud)
使用psycopg2 -
import psycopg2
conn = psycopg2.connect(conn_string)
cur = conn.cursor()
def _exec():
cur.execute("""(SELECT "seriesId", "companyId", "userId", "score"
FROM user_series_game
WHERE "companyId"=655124304077004298)""")
return cur.fetchall()
print measure(lambda : len(_exec()))
cur.close()
conn.close()
Run Code Online (Sandbox Code Playgroud)
输出是 -
--- 2.27961301804 seconds --- …Run Code Online (Sandbox Code Playgroud)