我需要摄取大型JSON文件,其记录可能跨越多行(而不是文件)(完全取决于数据提供者如何编写它).
Elephant-Bird假设LZO压缩,我知道数据提供者不会这样做.
Dzone文章http://java.dzone.com/articles/hadoop-practice假设JSON记录将在同一行.
任何想法,除了压缩JSON ...文件将是巨大的...如何正确分割文件,使JSON不会破坏.
编辑:行,而不是文件
对于我的一个项目,我想分析大约2 TB的Protobuf对象.我想通过"大象鸟"库在Pig脚本中使用这些对象.但是,我不清楚如何将文件写入HDFS,以便ProtobufPigLoader类可以使用它.
这就是我所拥有的:
猪脚本:
register ../fs-c/lib/*.jar // this includes the elephant bird library
register ../fs-c/*.jar
raw_data = load 'hdfs://XXX/fsc-data2/XXX*' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('de.pc2.dedup.fschunk.pig.PigProtocol.File');
Run Code Online (Sandbox Code Playgroud)
导入工具(部分):
def getWriter(filenamePath: Path) : ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File] = {
val conf = new Configuration()
val fs = FileSystem.get(filenamePath.toUri(), conf)
val os = fs.create(filenamePath, true)
val writer = new ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File](os, classOf[de.pc2.dedup.fschunk.pig.PigProtocol.File])
return writer
}
val writer = getWriter(new Path(filename))
val builder = de.pc2.dedup.fschunk.pig.PigProtocol.File.newBuilder()
writer.write(builder.build)
writer.finish()
writer.close()
Run Code Online (Sandbox Code Playgroud)
导入工具运行正常.我有一些ProtobufPigLoader的问题因为我不能使用hadoop-lzo压缩库,并且没有修复(见这里)ProtobufPigLoader不起作用.我遇到问题的问题是DUMP raw_data;退货Unable to open …
我在我的系统中安装了hadoop 2.2.我想用象鸟罐.运行"mvn package"时出现以下错误.
错误:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile (default-compile) on project elephant-bird-core: Compilation failure: Compilation failure:
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java: [9,39] error: package org.apache.commons.lang3.builder does not exist
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java: [10,31] error: package org.apache.thrift.scheme does not exist
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java: [11,31] error: package org.apache.thrift.scheme does not exist
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java: [12,31] error: package org.apache.thrift.scheme does not exist
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java: [14,31] error: package org.apache.thrift.scheme does not exist
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java: [15,33] error: cannot find symbol
[ERROR] package org.apache.thrift.protocol
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java: [20,0] error: package org.apache.thrift.server.AbstractNonblockingServer does …Run Code Online (Sandbox Code Playgroud)