当我尝试将数据集写入镶木地板文件时,出现以下错误
18/11/05 06:25:43 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 84 in stage 1.0 failed 4 times, most recent failure: Lost task 84.3 in stage 1.0 (TID 989, ip-10-253-194-207.nonprd.aws.csp.net, executor 4): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Run Code Online (Sandbox Code Playgroud)
但是当我给出时dataset.show()
我可以查看数据。不确定在哪里检查根本原因。
我有以下内容Map
:
HashMap<String, String> map1= new HashMap<String, String>();
map1.put("1", "One");
map1.put("2", "Two");
map1.put("3", "Three");
Run Code Online (Sandbox Code Playgroud)
我有一个numbers
包含的列表["1","2","3"]
我必须执行以下操作:
List<String> spelling= new ArrayList<>();
for (String num: numbers) {
if (map1.containsKey(num)){
spelling.add(map1.get(num))
}
}
Run Code Online (Sandbox Code Playgroud)
如何使用lambda表达式编写上述代码?
我正在尝试写下列条件:
if(javaList.contains("aaa")||javaList.contains("abc")||javaList.contains("abc")) {
//do something
}
Run Code Online (Sandbox Code Playgroud)
我怎样才能以更好的方式做到这一点?
我正在尝试使用JAVA将数据集结果写入单个CSV
dataset.write().mode(SaveMode.Overwrite).option("header",true).csv("C:\\tmp\\csvs");
Run Code Online (Sandbox Code Playgroud)
但它超时,文件没有被写入.
抛出 org.apache.spark.SparkException: Job aborted.
错误:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 16, localhost): java.io.IOException: (null) entry in command string: null chmod 0644 C:\tmp\12333333testSpark\_temporary\0\_temporary\attempt_201712282255_0013_m_000000_0\part-r-00000-229fd1b6-ffb9-4ba1-9dc9-89dfdbd0be43.csv
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)
at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:132)
at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CSVRelation.scala:200)
at …
Run Code Online (Sandbox Code Playgroud) 我正在尝试对数据集进行以下操作以进行分组和聚合 Column expend 相加。但这不适用于它为 RelationalGroupedDataset 所说的普通数据集。如何在普通数据集中实现以下操作
dataset.select.(col("col1"),col("col2"),col("expend")).groupBy(col("col1"),col("col2"),col("expend")).agg(sum("expend"))
Run Code Online (Sandbox Code Playgroud)
SQL 查询看起来像
select col1,col2,SUM(expend) from table group by col1,col2
当我尝试此代码时,列会重复。
dataset.columns()
给我[col1,col2,expend,expend]
的方法是对的吗?
我试图使用scala文件中的脚本从spark shell中连接到DB。当连接脚本从其他位置获取密码时,它会在spark shell控制台中打印。我只是想避免这些。
Scala中的代码如下所示,
val config=Map("driver"->"drivername","url"->"dburl","user"->"username","password"->"741852963");
Run Code Online (Sandbox Code Playgroud)
在Spark Shell中加载此代码时,这也会在Spark Shell中打印代码。我希望这些单独的部分不要在Spark Console中打印。
我该如何实现?
我遇到了一个场景,我的 Spark 数据集有 24 列,我按前 22 列进行分组,并对最后两列求和。
我从查询中删除了分组依据,现在已选择所有 24 列。数据集的初始计数为 79,304。
在我删除 group by 后,计数增加到 138,204,这是可以理解的,因为我已经删除了 group by。
但我不清楚 Parquet 文件的初始大小为2.3MB但后来减少到1.5MB的行为。谁能帮我理解这一点。
而且并不是每次大小都会减小,我有一个类似的情况,22 列数之前为 35,298,226,删除 group by 后为 59,874,208,这里大小从466.5MB增加到509.8MB