标签: hadoop

Unix排序产生错误的输出

我正在尝试通过以下方式测试hadoop流媒体作业的mapper和reducer函数:

    cat data.txt | python mapper.py | sort | python reducer.py
Run Code Online (Sandbox Code Playgroud)

但是映射器的排序输出不正确.

he the  1
i       1
i dog   1
i like  1
i'm     1
i'm rob 1
i'm the 1
i the   1 ### this should be after "i like 1" ###
lazy    1
Run Code Online (Sandbox Code Playgroud)

我让其他人在他们的机器上进行测试,他们使用相同的精确映射器函数和命令行执行获得正确的输出.所以我的Unix排序似乎出了问题.

如果这有帮助:

echo $TERM
> vt100 
Run Code Online (Sandbox Code Playgroud)

任何有关尝试或设置不同的建议都将非常感激.谢谢

python unix sorting hadoop

0
推荐指数
1
解决办法
120
查看次数

我可以逐行阅读hadoop文件吗?

我想从hadoop文件系统中读取unix框中的逐行记录:

示例 -

while read line

do 

echo "input record " $line
###some other logic i have here....
done < /user/want/to/read/from/hadoop/part00
Run Code Online (Sandbox Code Playgroud)

上面的代码片段显示错误 -

**: cannot open [No such file or directory]**
Run Code Online (Sandbox Code Playgroud)

如何使用Unix工具从Hadoop中读取?

unix linux shell hadoop

0
推荐指数
1
解决办法
2961
查看次数

字符串映射转换配置单元

我有一个有四列的表.

C1    C2    C3    C4
--------------------
x1    y1    z1    d1
x2    y2    z2    d2
Run Code Online (Sandbox Code Playgroud)

现在我想将它转换为具有键和值对的地图数据类型并加载到单独的表中.

create table test
(
   level map<string,string>
)
row format delimited
COLLECTION ITEMS TERMINATED BY '&'
map keys terminated by '=';
Run Code Online (Sandbox Code Playgroud)

现在我在sql下面使用加载数据.

insert overwrite table test
select str_to_map(concat('level1=',c1,'&','level2=',c2,'&','level3=',c3,'&','level4=',c4) from input;
Run Code Online (Sandbox Code Playgroud)

在表格上选择查询.

select * from test;
{"level1":"x1","level2":"y1","level3":"z1","level4":"d1=\\"}
{"level1":"x2","level2":"y2","level3":"z2","level4":"d2=\\"}
Run Code Online (Sandbox Code Playgroud)

我没理解为什么我在最后一个值中得到额外的"=\\".

我仔细检查数据,但问题仍然存在.

你能帮忙吗?

hadoop hive map

0
推荐指数
1
解决办法
1万
查看次数

如何在hadoop集群之间移动数据

我在VM上运行了2个hadoop集群.如何在这些群集之间移动HDFS数据.我可以scp HDFS上的数据,那些位于数据节点上的元数据呢?谢谢

hadoop

0
推荐指数
1
解决办法
679
查看次数

在Hive中使用RANK OVER功能

我正在尝试在Hive中运行此查询,以仅返回在adimpression表中更常出现的前10个网址.

select
        ranked_mytable.url,
        ranked_mytable.cnt

from
        ( select iq.url, iq.cnt, rank() over (partition by iq.url order by iq.cnt desc) rnk
        from
                ( select url, count(*) cnt
                from store.adimpression ai
                        inner join zuppa.adgroupcreativesubscription agcs
                                on agcs.id = ai.adgroupcreativesubscriptionid
                        inner join zuppa.adgroup ag
                                on ag.id = agcs.adgroupid
                where ai.datehour >= '2014-05-15 00:00:00'
                        and ag.siteid = 1240
                group by url
                ) iq
        ) ranked_mytable

where
      ranked_mytable.rnk <= 10

order by
        ranked_mytable.url,
        ranked_mytable.rnk desc

;
Run Code Online (Sandbox Code Playgroud)

不幸的是我收到一条错误消息:

FAILED: SemanticException [Error 10002]: Line 26:23 Invalid column reference …
Run Code Online (Sandbox Code Playgroud)

hadoop hive partitioning rank

0
推荐指数
2
解决办法
5万
查看次数

Hadoop基准测试/性能测试

我想在我的Hadoop集群上执行基准测试和性能测试.我知道hadoop-mapreduce*test*.jar和hadoop-mapreduce-examples*.jar有许多用于基准测试的程序.

是否有可用于这些测试的文件,其中提供了每种测试和性能测量的详细信息?此外,在执行任何测试后,是否有可用于比较结果的值?

谢谢.

benchmarking hadoop performance-testing

0
推荐指数
1
解决办法
8088
查看次数

循环脚本只执行一次 - Bash

我有以下bash脚本:

#!/bin/bash
cat /etc/hadoop/conf.my_cluster/slaves | \
while read CMD; do
    ssh -o StrictHostKeyChecking=no ubuntu@$CMD "sudo service hadoop-0.20-mapreduce-tasktracker restart"
    ssh -o StrictHostKeyChecking=no ubuntu@$CMD "sudo service hadoop-hdfs-datanode restart"
    echo $CMD
done
Run Code Online (Sandbox Code Playgroud)

/etc/hadoop/conf.my_cluster/slaves拥有5台奴隶机的IP.在datanode无法沟通的jobtracker,所以解决的办法就是重新启动它.输出是:

ubuntu@domU-12-31-39-07-D6-DE:~$ ./test.sh 
Warning: Permanently added '54.211.5.233' (ECDSA) to the list of known hosts.
 * Stopping Hadoop tasktracker: 
stopping tasktracker
 * Starting Hadoop tasktracker: 
starting tasktracker, logging to /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-tasktracker-domU-12-31-39-06-8A-27.out
Warning: Permanently added '54.211.5.233' (ECDSA) to the list of known hosts.
 * Stopping Hadoop datanode: …
Run Code Online (Sandbox Code Playgroud)

bash hadoop

0
推荐指数
1
解决办法
150
查看次数

在hadoop 2.4.1中启动namenode时出错

当我尝试使用以下命令启动dfs时:

start-dfs.sh
Run Code Online (Sandbox Code Playgroud)

我收到一个错误说:

14/07/03 11:03:21 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable Starting namenodes on [OpenJDK 64-Bit Server VM
warning: You have loaded library
/usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have
disabled stack guard. The VM will try to fix the stack guard now. It's
highly recommended that you fix the library with 'execstack -c
<libfile>', or link it with '-z noexecstack'. localhost] sed: -e
expression #1, char 6: unknown option to `s' Server: ssh: …
Run Code Online (Sandbox Code Playgroud)

hadoop jvm-arguments hadoop2

0
推荐指数
1
解决办法
7792
查看次数

从MapReduce写入Hive(初始化HCatOutputFormat)

我编写了MR脚本,它应该从HBase加载数据并将它们转储到Hive中.连接到HBase是可以的,但是当我尝试将数据保存到HIVE表时,我收到以下错误消息:

 Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.JavaMain], main() threw exception, org.apache.hive.hcatalog.common.HCatException : 2004 : HCatOutputFormat not initialized, setOutput has to be called
  org.apache.oozie.action.hadoop.JavaMainException: org.apache.hive.hcatalog.common.HCatException : 2004 : HCatOutputFormat not initialized, setOutput has to be called
  at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:58)
  at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:38)
  at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:36)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:226)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:415)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
  Caused by: org.apache.hive.hcatalog.common.HCatException : 2004 : HCatOutputFormat not initialized, setOutput has …
Run Code Online (Sandbox Code Playgroud)

hadoop hive mapreduce

0
推荐指数
1
解决办法
1488
查看次数

EMR - 从S3运行Pig Script的问题

我尝试在EMR上运行Pig脚本,如:

pig -f s3://bucket-name/loadData.pig

但它失败了,错误:

错误2999:意外的内部错误.空值

org.apache.pmpl.impl.io.FileLocalizer.fetchFilesInternal(FileLocalizer.java:778)中的org.apache.pig.impl.io.FileLocalizer.fetchFiles(FileLocalizer.java:746)中的java.lang.NullPointerException. apache.pig.PigServer.registerJar(PigServer.java:458)org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:433)atg.apache.pig.tools.pigscript.parser.PigScriptParser.解析(PigScriptParser.java:445)org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170) org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)org.apache.pig.Main.run(Main.java:479)org.apache.pig.Main.main(Main .java:159)在sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java. lang.reflect.Method.invoke(Method.java :606)在org.apache.hadoop.util.RunJar.main(RunJar.java:187)

loadData.pig看起来像:

A = load '/ajasing/input/input.txt' USING PigStorage('\t', '-noschema');
store A into '/ajasing/output1444/input1444.txt';
Run Code Online (Sandbox Code Playgroud)

我正在运行Pig版本0.11.1,hadoop版本1.0.3和AMI版本2.4.6.

如果我在本地执行这个猪,即通过在EMR集群上本地复制猪脚本,它工作正常.但是,如果猪脚本源是s3,它会因上述错误而失败.

请告诉我这里有什么问题.

hadoop amazon-s3 apache-pig amazon-emr

0
推荐指数
1
解决办法
1241
查看次数