如何在 pyspark 中查找数据帧的大小（以 MB 为单位）？

Question

如何在 pyspark 中查找数据帧的大小（以 MB 为单位）？

Ara*_*ndh 18 scala dataframe apache-spark pyspark databricks

如何在 pyspark 中查找数据帧的大小（以 MB 为单位），

df=spark.read.json("/Filestore/tables/test.json") 我想知道 df 或 test.json 的大小如何

Answer 1

L C*_* Co 11

迟到的答案，但由于谷歌首先把我带到这里，我想我会根据用户@hiryu的评论添加这个答案。

这已经过测试并且对我有用。这需要缓存，因此最好保留在笔记本开发中。

# Need to cache the table (and force the cache to happen)
df.cache()
df.count() # force caching

# need to access hidden parameters from the `SparkSession` and `DataFrame`
catalyst_plan = df._jdf.queryExecution().logical()
size_bytes = spark._jsparkSession.sessionState().executePlan(catalyst_plan).optimizedPlan().stats().sizeInBytes()

# always try to remember to free cached data once finished
df.unpersist()

print("Total table size: ", convert_size_bytes(size_bytes))

Run Code Online (Sandbox Code Playgroud)

您需要访问隐藏变量_jdf和_jSparkSession变量。由于 Python 对象不直接公开所需的属性，因此 IntelliSense 不会显示它们。

奖金：

我的convert_size_bytes函数如下所示：

def convert_size_bytes(size_bytes):
    """
    Converts a size in bytes to a human readable string using SI units.
    """
    import math
    import sys

    if not isinstance(size_bytes, int):
        size_bytes = sys.getsizeof(size_bytes)

    if size_bytes == 0:
        return "0B"

    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
    i = int(math.floor(math.log(size_bytes, 1024)))
    p = math.pow(1024, i)
    s = round(size_bytes / p, 2)
    return "%s %s" % (s, size_name[i])

Run Code Online (Sandbox Code Playgroud)

Answer 2

Rap*_*oth 1

一般来说，这并不容易。你可以

使用org.apache.spark.util.SizeEstimator
使用涉及缓存的方法，请参见/sf/answers/3467031991/
使用df.inputfiles()并使用其他 API 直接获取文件大小（我使用 Hadoop 文件系统 API 来获取文件大小（如何获取文件大小）。并非只有在数据帧未过滤/聚合时才有效

这是斯卡拉。该问题询问有关 Python (PySpark) 的问题。 (3认同)

归档时间：	5 年，5 月前
查看次数：	26614 次
最近记录：	2 年，4 月前