小编Yas*_*rma的帖子

如何从子目录中将所有数据提取到配置单元中

我将数据以特定格式组织在目录中(如下所示),并希望将这些数据添加到hive表中.我想添加2012目录的所有数据.以下所有名称都是目录名称,最内层目录(第3级)具有实际数据文件.有没有办法直接选择数据而无需更改此dir结构.任何指针都表示赞赏.

/2012/
|
|---------2012-01
            |---------2012-01-01
            |---------2012-01-02
            |...
            |...
            |---------2012-01-31
|
|---------2012-02
            |---------2012-02-01
            |---------2012-02-02
            |...
            |...
            |---------2012-02-28
|
|---------2012-03
|...
|...
|---------2012-12
Run Code Online (Sandbox Code Playgroud)

到目前为止,查询没有运气:

CREATE EXTERNAL TABLE sampledata
(datestr string, id string, locations string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/path/to/data/2012/*/*'; 

CREATE EXTERNAL TABLE sampledata
(datestr string, id string, locations string)
partitioned by (ystr string, ymstr string, ymdstr string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

ALTER TABLE sampledata
ADD 
PARTITION (ystr ='2012') 
LOCATION '/path/to/data/2012/';
Run Code Online (Sandbox Code Playgroud)

解决方案: 这个小参数解决了我的问题.增加可能对其他人有益的问题:

SET mapred.input.dir.recursive=true;
Run Code Online (Sandbox Code Playgroud)

hive partition

8
推荐指数
1
解决办法
8999
查看次数

使用Apache Pig排名功能

使用Pig 0.11.0 rank函数并为我的数据中的每个id生成排名.我需要以特定方式对数据进行排名.我希望等级重置,并从每个新ID的1开始.

是否可以直接使用等级函数?任何提示将不胜感激.

数据:

id,rating
X001, 9
X001, 9
X001, 8
X002, 9
X002, 7
X002, 6
X002, 5
X003, 8
X004, 8
X004, 7
X004, 7
X004, 4
Run Code Online (Sandbox Code Playgroud)

使用等级函数,如:op =按ID,等级数据排名;

我得到了这个输出

rank,id,rating
1, X001, 9
1, X001, 9
2, X001, 8
3, X002, 9
4, X002, 7
5, X002, 6
6, X002, 5
7, X003, 8
8, X004, 8
9, X004, 7
9, X004, 7
10, X004, 4
Run Code Online (Sandbox Code Playgroud)

期望的O/P:

rank,id,rating
1, X001, 9
1, X001, 9
2, X001, …
Run Code Online (Sandbox Code Playgroud)

apache-pig

6
推荐指数
1
解决办法
9905
查看次数

分区文本文件的Spark附加模式失败,SaveMode.Append - IOException File已存在

写分区文本文件失败的简单方法.

dataDF.write.partitionBy("year", "month", "date").mode(SaveMode.Append).text("s3://data/test2/events/")
Run Code Online (Sandbox Code Playgroud)

例外 -

16/07/06 02:15:05 ERROR datasources.DynamicPartitionWriterContainer: Aborting task.
java.io.IOException: File already exists:s3://path/1839dd1ed38a.gz
 at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:614)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:894)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:791)
 at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:177)
 at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
 at org.apache.spark.sql.execution.datasources.text.TextOutputWriter.<init>(DefaultSource.scala:156)
 at org.apache.spark.sql.execution.datasources.text.TextRelation$$anon$1.newInstance(DefaultSource.scala:125)
 at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:129)
 at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.newOutputWriter$1(WriterContainer.scala:424)
 at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:356)
 at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
 at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:89)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
16/07/06 02:15:05 INFO output.DirectFileOutputCommitter: Nothing to clean up on abort since there are no temporary files written
16/07/06 02:15:05 ERROR datasources.DynamicPartitionWriterContainer: Task attempt attempt_201607060215_0004_m_001709_3 aborted.
16/07/06 …
Run Code Online (Sandbox Code Playgroud)

apache-spark spark-dataframe

3
推荐指数
1
解决办法
1397
查看次数