标签: elephantbird

JSON对象跨越多行,如何在Hadoop中拆分输入

我需要摄取大型JSON文件,其记录可能跨越多行(而不是文件)(完全取决于数据提供者如何编写它).

Elephant-Bird假设LZO压缩,我知道数据提供者不会这样做.

Dzone文章http://java.dzone.com/articles/hadoop-practice假设JSON记录将在同一行.

任何想法,除了压缩JSON ...文件将是巨大的...如何正确分割文件,使JSON不会破坏.

编辑:行,而不是文件

java json hadoop elephantbird

7
推荐指数
1
解决办法
3725
查看次数

写出可以从Elephant Bird的ProtobufPigLoader读取的数据

对于我的一个项目,我想分析大约2 TB的Protobuf对象.我想通过"大象鸟"库在Pig脚本中使用这些对象.但是,我不清楚如何将文件写入HDFS,以便ProtobufPigLoader类可以使用它.

这就是我所拥有的:

猪脚本:

  register ../fs-c/lib/*.jar // this includes the elephant bird library
  register ../fs-c/*.jar    
  raw_data = load 'hdfs://XXX/fsc-data2/XXX*' using com.twitter.elephantbird.pig.load.ProtobufPigLoader('de.pc2.dedup.fschunk.pig.PigProtocol.File');
Run Code Online (Sandbox Code Playgroud)

导入工具(部分):

def getWriter(filenamePath: Path) : ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File] = {
  val conf = new Configuration()
  val fs = FileSystem.get(filenamePath.toUri(), conf)
  val os = fs.create(filenamePath, true)
  val writer = new ProtobufBlockWriter[de.pc2.dedup.fschunk.pig.PigProtocol.File](os, classOf[de.pc2.dedup.fschunk.pig.PigProtocol.File])
  return writer
}
val writer = getWriter(new Path(filename))
val builder = de.pc2.dedup.fschunk.pig.PigProtocol.File.newBuilder()
writer.write(builder.build)
writer.finish()
writer.close()
Run Code Online (Sandbox Code Playgroud)

导入工具运行正常.我有一些ProtobufPigLoader的问题因为我不能使用hadoop-lzo压缩库,并且没有修复(见这里)ProtobufPigLoader不起作用.我遇到问题的问题是DUMP raw_data;退货Unable to open …

hadoop apache-pig elephantbird

5
推荐指数
0
解决办法
1002
查看次数

大象鸟mvn包错误

我在我的系统中安装了hadoop 2.2.我想用象鸟罐.运行"mvn package"时出现以下错误.

错误:


[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile (default-compile) on project elephant-bird-core: Compilation failure: Compilation failure:
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java:    [9,39] error: package org.apache.commons.lang3.builder does not exist
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java:    [10,31] error: package org.apache.thrift.scheme does not exist
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java:    [11,31] error: package org.apache.thrift.scheme does not exist
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java:    [12,31] error: package org.apache.thrift.scheme does not exist
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java:    [14,31] error: package org.apache.thrift.scheme does not exist
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java:    [15,33] error: cannot find symbol
[ERROR] package org.apache.thrift.protocol
[ERROR] /usr/lib/hadoop/elephant_bird/core/target/generated-sources/thrift/com/twitter/elephantbird/thrift/test/TestListInList.java:    [20,0] error: package org.apache.thrift.server.AbstractNonblockingServer does …
Run Code Online (Sandbox Code Playgroud)

java hadoop apache-pig maven elephantbird

5
推荐指数
1
解决办法
1508
查看次数

标签 统计

elephantbird ×3

hadoop ×3

apache-pig ×2

java ×2

json ×1

maven ×1