小编gwy*_*842的帖子

PySpark 的 DataFrame.show() 运行缓慢

这里是新手,我通过 JDBC 从 PySpark 中的 MySQL 读取了一个表(大约 200 万行)作为 Spark 的 DataFrame,并尝试显示前 10 行:

from pyspark.sql import SparkSession

spark_session = SparkSession.builder.master("local[4]").appName("test_log_processing").getOrCreate()
url = "jdbc:mysql://localhost:3306"
table = "test.fakelog"
properties = {"user": "myUser", "password": "********"}
df = spark_session.read.jdbc(url, table, properties=properties)
df.cache()
df.show(10)  # can't get the printed results, and runs pretty slow and consumes 90%+ CPU resources
spark_session.stop()
Run Code Online (Sandbox Code Playgroud)

这是控制台日志:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[Stage …
Run Code Online (Sandbox Code Playgroud)

python apache-spark pyspark

7
推荐指数
1
解决办法
8029
查看次数

标签 统计

apache-spark ×1

pyspark ×1

python ×1