我将数据以特定格式组织在目录中(如下所示),并希望将这些数据添加到hive表中.我想添加2012目录的所有数据.以下所有名称都是目录名称,最内层目录(第3级)具有实际数据文件.有没有办法直接选择数据而无需更改此dir结构.任何指针都表示赞赏.
/2012/
|
|---------2012-01
|---------2012-01-01
|---------2012-01-02
|...
|...
|---------2012-01-31
|
|---------2012-02
|---------2012-02-01
|---------2012-02-02
|...
|...
|---------2012-02-28
|
|---------2012-03
|...
|...
|---------2012-12
Run Code Online (Sandbox Code Playgroud)
到目前为止,查询没有运气:
CREATE EXTERNAL TABLE sampledata
(datestr string, id string, locations string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/path/to/data/2012/*/*';
CREATE EXTERNAL TABLE sampledata
(datestr string, id string, locations string)
partitioned by (ystr string, ymstr string, ymdstr string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
ALTER TABLE sampledata
ADD
PARTITION (ystr ='2012')
LOCATION '/path/to/data/2012/';
Run Code Online (Sandbox Code Playgroud)
解决方案: 这个小参数解决了我的问题.增加可能对其他人有益的问题:
SET mapred.input.dir.recursive=true;
Run Code Online (Sandbox Code Playgroud) 使用Pig 0.11.0 rank函数并为我的数据中的每个id生成排名.我需要以特定方式对数据进行排名.我希望等级重置,并从每个新ID的1开始.
是否可以直接使用等级函数?任何提示将不胜感激.
数据:
id,rating
X001, 9
X001, 9
X001, 8
X002, 9
X002, 7
X002, 6
X002, 5
X003, 8
X004, 8
X004, 7
X004, 7
X004, 4
Run Code Online (Sandbox Code Playgroud)
使用等级函数,如:op =按ID,等级数据排名;
我得到了这个输出
rank,id,rating
1, X001, 9
1, X001, 9
2, X001, 8
3, X002, 9
4, X002, 7
5, X002, 6
6, X002, 5
7, X003, 8
8, X004, 8
9, X004, 7
9, X004, 7
10, X004, 4
Run Code Online (Sandbox Code Playgroud)
期望的O/P:
rank,id,rating
1, X001, 9
1, X001, 9
2, X001, …Run Code Online (Sandbox Code Playgroud) 写分区文本文件失败的简单方法.
dataDF.write.partitionBy("year", "month", "date").mode(SaveMode.Append).text("s3://data/test2/events/")
Run Code Online (Sandbox Code Playgroud)
例外 -
16/07/06 02:15:05 ERROR datasources.DynamicPartitionWriterContainer: Aborting task.
java.io.IOException: File already exists:s3://path/1839dd1ed38a.gz
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:614)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:894)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:791)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:177)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
at org.apache.spark.sql.execution.datasources.text.TextOutputWriter.<init>(DefaultSource.scala:156)
at org.apache.spark.sql.execution.datasources.text.TextRelation$$anon$1.newInstance(DefaultSource.scala:125)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:129)
at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.newOutputWriter$1(WriterContainer.scala:424)
at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:356)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/07/06 02:15:05 INFO output.DirectFileOutputCommitter: Nothing to clean up on abort since there are no temporary files written
16/07/06 02:15:05 ERROR datasources.DynamicPartitionWriterContainer: Task attempt attempt_201607060215_0004_m_001709_3 aborted.
16/07/06 …Run Code Online (Sandbox Code Playgroud)