当我使用1 GB数据集运行解析代码时,它完成没有任何错误.但是,当我一次尝试25 GB的数据时,我会遇到错误.我试图了解如何避免失败.很高兴听到任何建议或想法.
不同的错误,
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-xxxxxxxx
org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer{file=/mnt/yarn/nm/usercache/xxxx/appcache/application_1450751731124_8446/blockmgr-8a7b17b8-f4c3-45e7-aea8-8b0a7481be55/08/shuffle_0_224_0.data, offset=12329181, length=2104094}
Run Code Online (Sandbox Code Playgroud)
群集细节:
纱线:8个节点
总核心数:64
内存:500 GB
火花版本:1.5
Spark提交声明:
spark-submit --master yarn-cluster \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--executor-memory 4g \
--driver-memory 16g \
--num-executors 50 \
--deploy-mode cluster \
--executor-cores 1 \
--class my.parser \
myparser.jar \
-input xxx \
-output xxxx \
Run Code Online (Sandbox Code Playgroud)
堆栈跟踪之一:
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:460)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:456)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at …Run Code Online (Sandbox Code Playgroud) 如何读取条件为数据帧的分区镶木地板,
这工作正常,
val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*")
Run Code Online (Sandbox Code Playgroud)
分区是有day=1 to day=30是它可以读取类似(day = 5 to 6)或者day=5,day=6,
val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*")
Run Code Online (Sandbox Code Playgroud)
如果我把*它给我所有30天的数据,它太大了.
我在创建一个配置单元数据时收到以下错误
FAILED:执行错误,从org.apache.hadoop.hive.ql.exec.DDLTask返回代码1.COM/Facebook的/ fb303/FacebookService $ IFACE
Hadoop version:**hadoop-1.2.1**
HIVE Version: **hive-0.12.0**
Hadoop路径:/home/hadoop_test/data/hadoop-1.2.1
蜂巢路径:/home/hadoop_test/data/hive-0.12.0
我已将hive - .jar,jline- .jar,antlr-runtime .jar从hive-0.12.0/lib复制到hadoop-1.2./lib
如何在列表推导中执行以下操作?
test = [["abc", 1],["bca",2]]
result = []
for x in test:
if x[0] =='abc':
result.append(x)
else:
pass
result
Out[125]: [['abc', 1]]
Run Code Online (Sandbox Code Playgroud)
试试1:
[x if (x[0] == 'abc') else pass for x in test]
File "<ipython-input-127-d0bbe1907880>", line 1
[x if (x[0] == 'abc') else pass for x in test]
^
SyntaxError: invalid syntax
Run Code Online (Sandbox Code Playgroud)
试试2:
[x if (x[0] == 'abc') else None for x in test]
Out[126]: [['abc', 1], None]
Run Code Online (Sandbox Code Playgroud)
尝试3:
[x if (x[0] == 'abc') for x in test]
File …Run Code Online (Sandbox Code Playgroud) 在aws批处理Job queues仪表板中,它显示24小时内所有作业状态失败和成功的作业计数。是否可以将计数器重置为零?
我正在使用PySpark 1.5.2.我UserWarning Please install psutil to have better support with spilling发出命令后得到了.collect()
为什么会出现此警告?
我该如何安装psutil?
我需要获取生成的统计数据,以便在Pandas中绘制一个箱形图(使用数据框创建箱形图).即Quartile1,Quartile2,Quartile3,较低的晶须值,较高的晶须值和异常值.我尝试了以下查询来绘制boxplot.
import pandas as pd
df = pd.DataFrame(np.random.rand(100, 5), columns=['A', 'B', 'C', 'D', 'E'])
pd.DataFrame.boxplot(df,return_type = 'both')
Run Code Online (Sandbox Code Playgroud)
有没有办法做到而不是手动计算值?
我试图找出是否Date陷入PromoInterval数据框架.
print dset1
Store Date PromoInterval
1760 2 2013-05-04 Jan,Apr,Jul,Oct
1761 2 2013-05-03 Jan,Apr,Jul,Oct
1762 2 2013-05-02 Jan,Apr,Jul,Oct
1763 2 2013-05-01 Jan,Apr,Jul,Oct
1764 2 2013-04-30 Jan,Apr,Jul,Oct
def func(a,b):
y = b.split(",")
z = {1:'Jan',2:'Feb',3:'Mar', 4:'Apr',5:'May',6:'Jun',7:'Jul',8:'Aug',9:'Sep',
10:'Oct',11:'Nov',12:'Dec'}
return (z[a] in y)
dset1.apply(func, axis=1, args = (dset1['Date'].dt.month, dset1['PromoInterval']) )
Run Code Online (Sandbox Code Playgroud)
碰到下面的错误:
dset1.apply(func,axis = 1,args =(dset1 ['Date'].dt.month,> dset1 ['PromoInterval']))('func()正好接受2个参数(3个给定)',u'发生在指数1760')
数据集:
{'Date': {1760: Timestamp('2013-05-04 00:00:00'),
1761: Timestamp('2013-05-03 00:00:00'),
1762: Timestamp('2013-05-02 00:00:00'),
1763: Timestamp('2013-05-01 00:00:00'),
1764: Timestamp('2013-04-30 00:00:00')},
'PromoInterval': {1760: 'Jan,Apr,Jul,Oct',
1761: …Run Code Online (Sandbox Code Playgroud) 我写了一个DataFrame作为镶木地板文件.而且,我想使用Hive使用镶木地板中的元数据来阅读该文件.
书写镶木地板写的输出
_common_metadata part-r-00000-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00002-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet _SUCCESS
_metadata part-r-00001-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00003-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet
Run Code Online (Sandbox Code Playgroud)
蜂巢表
CREATE TABLE testhive
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'/home/gz_files/result';
FAILED: SemanticException [Error 10043]: Either list of columns or a custom serializer should be specified
Run Code Online (Sandbox Code Playgroud)
如何从镶木地板文件中推断元数据?
如果我打开_common_metadata我有以下内容,
PAR1LHroot
%TSN%
%TS%
%Etype%
)org.apache.spark.sql.parquet.row.metadata?{"type":"struct","fields":[{"name":"TSN","type":"string","nullable":true,"metadata":{}},{"name":"TS","type":"string","nullable":true,"metadata":{}},{"name":"Etype","type":"string","nullable":true,"metadata":{}}]}
Run Code Online (Sandbox Code Playgroud)
或者如何解析元数据文件?
我试图将文件名附加到文件中的每个记录.我想如果RDD是Array,那我就很容易做到.
转换RDD类型或解决此问题的一些帮助将非常感谢!
在(String,String)类型中
scala> myRDD.first()(1)
scala><console>:24: error: (String, String) does not take parametersmyRDD.first()(1)
Run Code Online (Sandbox Code Playgroud)
在数组(字符串)
scala> myRDD.first()(1)
scala> res1: String = abcdefgh
Run Code Online (Sandbox Code Playgroud)
我的功能:
def appendKeyToValue(x: Array[Array[String]){
for (i<-0 to (x.length - 1)) {
var key = x(i)(0)
val pattern = new Regex("\\.")
val key2 = pattern replaceAllIn(key1,"|")
var tempvalue = x(i)(1)
val finalval = tempvalue.split("\n")
for (ab <-0 to (finalval.length -1)){
val result = (I am trying to append filename to each record in the filekey2+"|"+finalval(ab))
}
}
}
Run Code Online (Sandbox Code Playgroud) apache-spark ×4
python ×4
scala ×3
hive ×2
pandas ×2
parquet ×2
aws-batch ×1
dataframe ×1
hadoop-yarn ×1
psutil ×1
pyspark ×1
python-2.7 ×1