我刚尝试在窗口上执行countDistinct并出现此错误:
AnalysisException: u'Distinct window functions are not supported: count(distinct color#1926)
Run Code Online (Sandbox Code Playgroud)
有没有办法在pyspark中对窗口进行明确计数?
这是一些示例代码:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
#function to calculate number of seconds from number of days
days = lambda i: i * 86400
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00", "orange"),
(13, "2017-03-15T12:27:18+00:00", "red"),
(25, "2017-03-18T11:27:18+00:00", "red")],
["dollars", "timestampGMT", "color"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
#create window by casting timestamp to long (number of seconds)
w = (Window.orderBy(F.col("timestampGMT").cast('long')).rangeBetween(-days(7), 0))
df = df.withColumn('distinct_color_count_over_the_last_week', F.countDistinct("color").over(w))
df.show()
Run Code Online (Sandbox Code Playgroud)
这是我想看到的输出:
+-------+--------------------+------+---------------------------------------+
|dollars| timestampGMT| …
Run Code Online (Sandbox Code Playgroud) 我有一个由时间戳列和美元列组成的数据集.我想找到每行的平均美元数,在每行的时间戳结束.我最初看的是pyspark.sql.functions.window函数,但是按周分类数据.
这是一个例子:
%pyspark
import datetime
from pyspark.sql import functions as F
df1 = sc.parallelize([(17,"2017-03-11T15:27:18+00:00"), (13,"2017-03-11T12:27:18+00:00"), (21,"2017-03-17T11:27:18+00:00")]).toDF(["dollars", "datestring"])
df2 = df1.withColumn('timestampGMT', df1.datestring.cast('timestamp'))
w = df2.groupBy(F.window("timestampGMT", "7 days")).agg(F.avg("dollars").alias('avg'))
w.select(w.window.start.cast("string").alias("start"), w.window.end.cast("string").alias("end"), "avg").collect()
Run Code Online (Sandbox Code Playgroud)
这导致两条记录:
| start | end | avg |
|---------------------|----------------------|-----|
|'2017-03-16 00:00:00'| '2017-03-23 00:00:00'| 21.0|
|---------------------|----------------------|-----|
|'2017-03-09 00:00:00'| '2017-03-16 00:00:00'| 15.0|
|---------------------|----------------------|-----|
Run Code Online (Sandbox Code Playgroud)
窗口函数将时间序列数据分类,而不是执行滚动平均值.
有没有办法执行滚动平均值,我会回到每行的每周平均值,时间段结束于行的timestampGMT?
编辑:
张的答案接近我想要的,但不完全是我想看到的.
这是一个更好的例子来展示我想要得到的东西:
%pyspark
from pyspark.sql import functions as F
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00"),
(13, "2017-03-15T12:27:18+00:00"),
(25, "2017-03-18T11:27:18+00:00")],
["dollars", "timestampGMT"])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
df = df.withColumn('rolling_average', F.avg("dollars").over(Window.partitionBy(F.window("timestampGMT", …
Run Code Online (Sandbox Code Playgroud) 我正在尝试将一列GMT时间戳字符串转换为东部时区的时间戳列.我想考虑夏令时.
我的时间戳字符串列如下所示:
'2017-02-01T10:15:21+00:00'
Run Code Online (Sandbox Code Playgroud)
我想出了如何将字符串列转换为EST中的时间戳:
from pyspark.sql import functions as F
df2 = df1.withColumn('datetimeGMT', df1.myTimeColumnInGMT.cast('timestamp'))
df3 = df2.withColumn('datetimeEST', F.from_utc_timestamp(df2.datetimeGMT, "EST"))
Run Code Online (Sandbox Code Playgroud)
但是时间不会因夏令时而改变.是否有其他功能或某些东西可以通过转换时间戳来解释夏令时?
编辑:我想我弄清楚了.在上面的from_utc_timestamp调用中,我需要使用"America/New_York"而不是"EST":
df3 = df2.withColumn('datetimeET', F.from_utc_timestamp(df2.datetimeGMT, "America/New_York"))
Run Code Online (Sandbox Code Playgroud) 我一直在尝试在使用 Holoviews 和 Bokeh 时设置标题。我将 3 个图相互叠加。代码如下所示:
%%opts Curve [width=900 height=400 show_grid=True tools=['hover'] finalize_hooks=[apply_formatter]]
%%opts Curve (color=Cycle('Category20'))
%%opts Overlay [ legend_position='bottom' ] Curve (muted_alpha=0.5 muted_color='black' )
actual_curve = hv.Curve(df_reg_test, 'date', 'y', label='Actual')
existing_curve = hv.Curve(df_reg_test, 'date', 'Forecast Calls', label='Existing Forecast')
xgb_curve = hv.Curve(df_reg_test, 'date', 'xgb_pred', label='New Forecast')
actual_curve * existing_curve * xgb_curve
Run Code Online (Sandbox Code Playgroud)
正如您所看到的,各个曲线的标签在图例中显示得很好,但我在图的顶部没有得到标题。
如何手动设置标题?
我正在尝试在 AWS EMR 上的 Jupyter Notebook(使用 Sagemaker 和 sparkmagic)中使用 pyspark 中的graphframes包。在 AWS 控制台中创建 EMR 集群时,我尝试添加一个配置选项:
[{"classification":"spark-defaults", "properties":{"spark.jars.packages":"graphframes:graphframes:0.7.0-spark2.4-s_2.11"}, "configurations":[]}]
Run Code Online (Sandbox Code Playgroud)
但是在 jupyter notebook 中尝试在我的 pyspark 代码中使用 graphframes 包时,我仍然遇到错误。
这是我的代码(来自graphframes示例):
# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
], ["src", "dst", …
Run Code Online (Sandbox Code Playgroud) amazon-emr apache-spark pyspark jupyter-notebook graphframes
我使用 AWS EMR 创建了一个 presto 集群。我正在使用所有默认配置。我想在主节点上编写一个python脚本来将查询推送到presto并获得结果。
我找到了 PyHive 库,但我不知道在连接字符串中放入什么:
from pyhive import presto # or import hive
cursor = presto.connect('localhost').cursor()
statement = 'SELECT * FROM my_awesome_data LIMIT 10'
cursor.execute(statement)
my_results = cursor.fetchall()
Run Code Online (Sandbox Code Playgroud)
我认为 localhost 可能是正确的,因为我在 presto 集群的主节点上运行脚本,但出现错误:
OperationalError: Unexpected status code 404
b'<!DOCTYPE html><html><head><title>Apache Tomcat/8.0.45 - Error report</title><style type="text/css">H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}.line {height: 1px; background-color: #525D76; border: none;}</style> </head><body><h1>HTTP Status 404 - /v1/statement</h1><div class="line"></div><p><b>type</b> Status report</p><p><b>message</b> …
Run Code Online (Sandbox Code Playgroud) pyspark ×4
apache-spark ×3
amazon-emr ×2
python ×2
bokeh ×1
count ×1
dst ×1
graphframes ×1
holoviews ×1
presto ×1
pyhive ×1
timestamp ×1
timezone ×1