我有以下两个pySpark数据帧:
> df_lag_pre.columns
['date','sku','name','country','ccy_code','quantity','usd_price','usd_lag','lag_quantity']
> df_unmatched.columns
['alt_sku', 'alt_lag_quantity', 'country', 'ccy_code', 'name', 'usd_price']
Run Code Online (Sandbox Code Playgroud)
现在我想在常用列上加入它们,所以我尝试以下方法:
> df_lag_pre.join(df_unmatched, on=['name','country','ccy_code','usd_price'])
Run Code Online (Sandbox Code Playgroud)
我收到以下错误消息:
AnalysisException: u'resolved attribute(s) price#3424 missing from country#3443,month#801,price#808,category#803,subcategory#804,page#805,date#280,link#809,name#806,quantity#807,ccy_code#3439,sku#3004,day#802 in operator !EvaluatePython PythonUDF#<lambda>(ccy_code#3439,price#3424), pythonUDF#811: string;'
Run Code Online (Sandbox Code Playgroud)
显示此错误的某些列(例如price)df_lag是构建于其中的另一个数据框的一部分.我找不到有关如何解释此消息的任何信息,因此任何帮助将不胜感激.
我有一个pySpark数据框,如下所示:
+-------------+----------+
| sku| date|
+-------------+----------+
|MLA-603526656|02/09/2016|
|MLA-603526656|01/09/2016|
|MLA-604172009|02/10/2016|
|MLA-605470584|02/09/2016|
|MLA-605502281|02/10/2016|
|MLA-605502281|02/09/2016|
+-------------+----------+
Run Code Online (Sandbox Code Playgroud)
我想通过sku分组,然后计算最小和最大日期.如果我这样做:
df_testing.groupBy('sku') \
.agg({'date': 'min', 'date':'max'}) \
.limit(10) \
.show()
Run Code Online (Sandbox Code Playgroud)
行为与Pandas相同,我只获取sku和max(date)列.在Pandas我通常会做以下事情来得到我想要的结果:
df_testing.groupBy('sku') \
.agg({'day': ['min','max']}) \
.limit(10) \
.show()
Run Code Online (Sandbox Code Playgroud)
但是在pySpark上这不起作用,我得到一个java.util.ArrayList cannot be cast to java.lang.String错误.谁能指点我正确的语法?
谢谢.
我正在尝试在EMR上运行一个处理大量数据的(py)Spark作业.目前我的工作失败,出现以下错误消息:
Reason: Container killed by YARN for exceeding memory limits.
5.5 GB of 5.5 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
Run Code Online (Sandbox Code Playgroud)
所以我google'd如何做到这一点,并发现我应该spark.yarn.executor.memoryOverhead使用--conf标志传递参数.我是这样做的:
aws emr add-steps\
--cluster-id %s\
--profile EMR\
--region us-west-2\
--steps Name=Spark,Jar=command-runner.jar,\
Args=[\
/usr/lib/spark/bin/spark-submit,\
--deploy-mode,client,\
/home/hadoop/%s,\
--executor-memory,100g,\
--num-executors,3,\
--total-executor-cores,1,\
--conf,'spark.python.worker.memory=1200m',\
--conf,'spark.yarn.executor.memoryOverhead=15300',\
],ActionOnFailure=CONTINUE" % (cluster_id,script_name)\
Run Code Online (Sandbox Code Playgroud)
但是当我重新运行这个工作时,它一直给我同样的错误信息,5.5 GB of 5.5 GB physical memory used这意味着我的记忆没有增加..任何关于我做错的提示?
编辑
以下是我最初如何创建群集的详细信息:
aws emr create-cluster\
--name "Spark"\
--release-label emr-4.7.0\
--applications Name=Spark\
--bootstrap-action Path=s3://emr-code-matgreen/bootstraps/install_python_modules.sh\
--ec2-attributes KeyName=EMR2,InstanceProfile=EMR_EC2_DefaultRole\
--log-uri s3://emr-logs-zerex\
--instance-type r3.xlarge\
--instance-count 4\
--profile EMR\ …Run Code Online (Sandbox Code Playgroud) I'm having the following problem when running Spark on AWS EMR. While doing a join on a table to filter certain IDs out, Spark suddenly dies with the stdout file reporting the following:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
Run Code Online (Sandbox Code Playgroud)
The command I'm executing where it's breaking is running fine when I run it on local mode on my machine (on a much smaller dataset), and looks like the following:
sampled_data = data_df \
.join(sample_ids, data_df.entity_id == sample_ids.entity_id, 'inner') \ …Run Code Online (Sandbox Code Playgroud) 这应该很容易,但我找不到直接的方法来实现它.我的数据集如下所示:
DisplayName Nationality Gender Startyear
1 Alfred H. Barr, Jr. American Male 1929
2 Paul C\216zanne French Male 1929
3 Paul Gauguin French Male 1929
4 Vincent van Gogh Dutch Male 1929
5 Georges-Pierre Seurat French Male 1929
6 Charles Burchfield American Male 1929
7 Charles Demuth American Male 1929
8 Preston Dickinson American Male 1929
9 Lyonel Feininger American Male 1929
10 George Overbury ("Pop") Hart American Male 1929
...
Run Code Online (Sandbox Code Playgroud)
我想按DisplayName和Gender分组,并获取每个名称的计数(它们在列表中重复多次,具有不同的年份信息).
以下两个命令给出了相同的输出,但它们没有按计数输出"n"排序.关于如何实现这一点的任何想法?
artists <- data %>%
filter(!is.na(Gender) & Gender …Run Code Online (Sandbox Code Playgroud)