使用Spark DataFrame列制作直方图

Question

使用Spark DataFrame列制作直方图

use*_*014 6 python pandas apache-spark apache-spark-sql pyspark

我试图用一个看起来像数据帧的列来制作直方图

DataFrame[C0: int, C1: int, ...]

Run Code Online (Sandbox Code Playgroud)

如果我要用C1列进行直方图,我该怎么办？

我尝试过的一些事情是

df.groupBy("C1").count().histogram()
df.C1.countByValue()

Run Code Online (Sandbox Code Playgroud)

由于数据类型不匹配,哪些不起作用.

Answer 1

zer*_*323 12

你可以使用histogram_numericHive UDAF:

import random

random.seed(323)

sqlContext = HiveContext(sc)
n = 3  # Number of buckets
df = sqlContext.createDataFrame(
    sc.parallelize(enumerate(random.random() for _ in range(1000))),
   ["id", "v"]
)

hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))

hists.show(1, False)
## +------------------------------------------------------------------------------------+
## |histogram_numeric(v,3)                                                              |
## +------------------------------------------------------------------------------------+
## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
## +------------------------------------------------------------------------------------+

Run Code Online (Sandbox Code Playgroud)

您还可以提取感兴趣的列并使用histogram方法RDD:

df.select("v").rdd.flatMap(lambda x: x).histogram(n)
## ([0.002028109534323752,
##  0.33410233677189705,
##  0.6661765640094703,
##  0.9982507912470436],
## [327, 326, 347])

Run Code Online (Sandbox Code Playgroud)

可能值得注意的是,`histogram_numeric`并不能保证均匀分布的箱子 - 无论如何这让我感到惊讶. (3认同)

Answer 2

lan*_*nok 11

对我有用的是

df.groupBy("C1").count().rdd.values().histogram()

Run Code Online (Sandbox Code Playgroud)

我必须转换为RDD,因为我histogram在pyspark.RDD类中找到了方法,但在spark.SQL模块中找不到

这种方法允许您设置 bin 大小吗？ (2认同)

Answer 3

Bri*_*lie 8

@Chris van den Berg提到的pyspark_dist_explore软件包非常好。如果您不想添加其他依赖项，则可以使用以下代码来绘制简单的直方图。

import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)

# This is a bit awkward but I believe this is the correct way to do it 
plt.hist(bins[:-1], bins=bins, weights=counts)

Run Code Online (Sandbox Code Playgroud)

Answer 4

Ass*_*son 4

假设 C1 中的值在 1-1000 之间，并且您想要获得 10 个 bin 的直方图。你可以这样做： df.withColumn("bins", df.C1/100).groupBy("bins").count() 如果你的分箱更复杂，你可以为它创建一个 UDF （更糟糕的是，你可能需要首先分析列，例如通过使用描述或通过某种其他方法）。

归档时间：	9 年，11 月前
查看次数：	30012 次
最近记录：	7 年，1 月前