小编mas*_*-g3的帖子

在pySpark上执行连接时"已解决的属性缺失"

我有以下两个pySpark数据帧:

> df_lag_pre.columns
['date','sku','name','country','ccy_code','quantity','usd_price','usd_lag','lag_quantity']

> df_unmatched.columns
['alt_sku', 'alt_lag_quantity', 'country', 'ccy_code', 'name', 'usd_price']
Run Code Online (Sandbox Code Playgroud)

现在我想在常用列上加入它们,所以我尝试以下方法:

> df_lag_pre.join(df_unmatched, on=['name','country','ccy_code','usd_price'])
Run Code Online (Sandbox Code Playgroud)

我收到以下错误消息:

AnalysisException: u'resolved attribute(s) price#3424 missing from country#3443,month#801,price#808,category#803,subcategory#804,page#805,date#280,link#809,name#806,quantity#807,ccy_code#3439,sku#3004,day#802 in operator !EvaluatePython PythonUDF#<lambda>(ccy_code#3439,price#3424), pythonUDF#811: string;'
Run Code Online (Sandbox Code Playgroud)

显示此错误的某些列(例如price)df_lag是构建于其中的另一个数据框的一部分.我找不到有关如何解释此消息的任何信息,因此任何帮助将不胜感激.

apache-spark pyspark spark-dataframe

13
推荐指数
1
解决办法
7329
查看次数

pySpark Dataframe上的多个聚合标准

我有一个pySpark数据框,如下所示:

+-------------+----------+
|          sku|      date|
+-------------+----------+
|MLA-603526656|02/09/2016|
|MLA-603526656|01/09/2016|
|MLA-604172009|02/10/2016|
|MLA-605470584|02/09/2016|
|MLA-605502281|02/10/2016|
|MLA-605502281|02/09/2016|
+-------------+----------+
Run Code Online (Sandbox Code Playgroud)

我想通过sku分组,然后计算最小和最大日期.如果我这样做:

df_testing.groupBy('sku') \
    .agg({'date': 'min', 'date':'max'}) \
    .limit(10) \
    .show()
Run Code Online (Sandbox Code Playgroud)

行为与Pandas相同,我只获取skumax(date)列.在Pandas我通常会做以下事情来得到我想要的结果:

df_testing.groupBy('sku') \
    .agg({'day': ['min','max']}) \
    .limit(10) \
    .show()
Run Code Online (Sandbox Code Playgroud)

但是在pySpark上这不起作用,我得到一个java.util.ArrayList cannot be cast to java.lang.String错误.谁能指点我正确的语法?

谢谢.

apache-spark pyspark spark-dataframe

13
推荐指数
1
解决办法
1万
查看次数

提升spark.yarn.executor.memoryOverhead

我正在尝试在EMR上运行一个处理大量数据的(py)Spark作业.目前我的工作失败,出现以下错误消息:

Reason: Container killed by YARN for exceeding memory limits.
5.5 GB of 5.5 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
Run Code Online (Sandbox Code Playgroud)

所以我google'd如何做到这一点,并发现我应该spark.yarn.executor.memoryOverhead使用--conf标志传递参数.我是这样做的:

aws emr add-steps\
--cluster-id %s\
--profile EMR\
--region us-west-2\
--steps Name=Spark,Jar=command-runner.jar,\
Args=[\
/usr/lib/spark/bin/spark-submit,\
--deploy-mode,client,\
/home/hadoop/%s,\
--executor-memory,100g,\
--num-executors,3,\
--total-executor-cores,1,\
--conf,'spark.python.worker.memory=1200m',\
--conf,'spark.yarn.executor.memoryOverhead=15300',\
],ActionOnFailure=CONTINUE" % (cluster_id,script_name)\
Run Code Online (Sandbox Code Playgroud)

但是当我重新运行这个工作时,它一直给我同样的错误信息,5.5 GB of 5.5 GB physical memory used这意味着我的记忆没有增加..任何关于我做错的提示?

编辑

以下是我最初如何创建群集的详细信息:

aws emr create-cluster\
--name "Spark"\
--release-label emr-4.7.0\
--applications Name=Spark\
--bootstrap-action Path=s3://emr-code-matgreen/bootstraps/install_python_modules.sh\
--ec2-attributes KeyName=EMR2,InstanceProfile=EMR_EC2_DefaultRole\
--log-uri s3://emr-logs-zerex\
--instance-type r3.xlarge\
--instance-count 4\
--profile EMR\ …
Run Code Online (Sandbox Code Playgroud)

amazon-web-services amazon-emr emr apache-spark pyspark

10
推荐指数
2
解决办法
2万
查看次数

Spark 失败,因为“从关闭挂钩调用 stop()”

I'm having the following problem when running Spark on AWS EMR. While doing a join on a table to filter certain IDs out, Spark suddenly dies with the stdout file reporting the following:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
Run Code Online (Sandbox Code Playgroud)

The command I'm executing where it's breaking is running fine when I run it on local mode on my machine (on a much smaller dataset), and looks like the following:

sampled_data = data_df \
                .join(sample_ids, data_df.entity_id == sample_ids.entity_id, 'inner') \ …
Run Code Online (Sandbox Code Playgroud)

emr apache-spark pyspark spark-dataframe

7
推荐指数
0
解决办法
6758
查看次数

对计数/计数(dplyr)的输出进行排序

这应该很容易,但我找不到直接的方法来实现它.我的数据集如下所示:

                DisplayName Nationality Gender Startyear
1           Alfred H. Barr, Jr.    American   Male      1929
2               Paul C\216zanne      French   Male      1929
3                  Paul Gauguin      French   Male      1929
4              Vincent van Gogh       Dutch   Male      1929
5         Georges-Pierre Seurat      French   Male      1929
6            Charles Burchfield    American   Male      1929
7                Charles Demuth    American   Male      1929
8             Preston Dickinson    American   Male      1929
9              Lyonel Feininger    American   Male      1929
10 George Overbury ("Pop") Hart    American   Male      1929
...
Run Code Online (Sandbox Code Playgroud)

我想按DisplayName和Gender分组,并获取每个名称的计数(它们在列表中重复多次,具有不同的年份信息).

以下两个命令给出了相同的输出,但它们没有按计数输出"n"排序.关于如何实现这一点的任何想法?

artists <- data %>%
  filter(!is.na(Gender) & Gender …
Run Code Online (Sandbox Code Playgroud)

r dplyr

2
推荐指数
1
解决办法
3459
查看次数