我正在尝试通过以下方式测试hadoop流媒体作业的mapper和reducer函数:
cat data.txt | python mapper.py | sort | python reducer.py
Run Code Online (Sandbox Code Playgroud)
但是映射器的排序输出不正确.
he the 1
i 1
i dog 1
i like 1
i'm 1
i'm rob 1
i'm the 1
i the 1 ### this should be after "i like 1" ###
lazy 1
Run Code Online (Sandbox Code Playgroud)
我让其他人在他们的机器上进行测试,他们使用相同的精确映射器函数和命令行执行获得正确的输出.所以我的Unix排序似乎出了问题.
如果这有帮助:
echo $TERM
> vt100
Run Code Online (Sandbox Code Playgroud)
任何有关尝试或设置不同的建议都将非常感激.谢谢
我想从hadoop文件系统中读取unix框中的逐行记录:
示例 -
while read line
do
echo "input record " $line
###some other logic i have here....
done < /user/want/to/read/from/hadoop/part00
Run Code Online (Sandbox Code Playgroud)
上面的代码片段显示错误 -
**: cannot open [No such file or directory]**
Run Code Online (Sandbox Code Playgroud)
如何使用Unix工具从Hadoop中读取?
我有一个有四列的表.
C1 C2 C3 C4
--------------------
x1 y1 z1 d1
x2 y2 z2 d2
Run Code Online (Sandbox Code Playgroud)
现在我想将它转换为具有键和值对的地图数据类型并加载到单独的表中.
create table test
(
level map<string,string>
)
row format delimited
COLLECTION ITEMS TERMINATED BY '&'
map keys terminated by '=';
Run Code Online (Sandbox Code Playgroud)
现在我在sql下面使用加载数据.
insert overwrite table test
select str_to_map(concat('level1=',c1,'&','level2=',c2,'&','level3=',c3,'&','level4=',c4) from input;
Run Code Online (Sandbox Code Playgroud)
在表格上选择查询.
select * from test;
{"level1":"x1","level2":"y1","level3":"z1","level4":"d1=\\"}
{"level1":"x2","level2":"y2","level3":"z2","level4":"d2=\\"}
Run Code Online (Sandbox Code Playgroud)
我没理解为什么我在最后一个值中得到额外的"=\\".
我仔细检查数据,但问题仍然存在.
你能帮忙吗?
我在VM上运行了2个hadoop集群.如何在这些群集之间移动HDFS数据.我可以scp HDFS上的数据,那些位于数据节点上的元数据呢?谢谢
我正在尝试在Hive中运行此查询,以仅返回在adimpression表中更常出现的前10个网址.
select
ranked_mytable.url,
ranked_mytable.cnt
from
( select iq.url, iq.cnt, rank() over (partition by iq.url order by iq.cnt desc) rnk
from
( select url, count(*) cnt
from store.adimpression ai
inner join zuppa.adgroupcreativesubscription agcs
on agcs.id = ai.adgroupcreativesubscriptionid
inner join zuppa.adgroup ag
on ag.id = agcs.adgroupid
where ai.datehour >= '2014-05-15 00:00:00'
and ag.siteid = 1240
group by url
) iq
) ranked_mytable
where
ranked_mytable.rnk <= 10
order by
ranked_mytable.url,
ranked_mytable.rnk desc
;
Run Code Online (Sandbox Code Playgroud)
不幸的是我收到一条错误消息:
FAILED: SemanticException [Error 10002]: Line 26:23 Invalid column reference …Run Code Online (Sandbox Code Playgroud) 我想在我的Hadoop集群上执行基准测试和性能测试.我知道hadoop-mapreduce*test*.jar和hadoop-mapreduce-examples*.jar有许多用于基准测试的程序.
是否有可用于这些测试的文件,其中提供了每种测试和性能测量的详细信息?此外,在执行任何测试后,是否有可用于比较结果的值?
谢谢.
我有以下bash脚本:
#!/bin/bash
cat /etc/hadoop/conf.my_cluster/slaves | \
while read CMD; do
ssh -o StrictHostKeyChecking=no ubuntu@$CMD "sudo service hadoop-0.20-mapreduce-tasktracker restart"
ssh -o StrictHostKeyChecking=no ubuntu@$CMD "sudo service hadoop-hdfs-datanode restart"
echo $CMD
done
Run Code Online (Sandbox Code Playgroud)
/etc/hadoop/conf.my_cluster/slaves拥有5台奴隶机的IP.在datanode无法沟通的jobtracker,所以解决的办法就是重新启动它.输出是:
ubuntu@domU-12-31-39-07-D6-DE:~$ ./test.sh
Warning: Permanently added '54.211.5.233' (ECDSA) to the list of known hosts.
* Stopping Hadoop tasktracker:
stopping tasktracker
* Starting Hadoop tasktracker:
starting tasktracker, logging to /var/log/hadoop-0.20-mapreduce/hadoop-hadoop-tasktracker-domU-12-31-39-06-8A-27.out
Warning: Permanently added '54.211.5.233' (ECDSA) to the list of known hosts.
* Stopping Hadoop datanode: …Run Code Online (Sandbox Code Playgroud) 当我尝试使用以下命令启动dfs时:
start-dfs.sh
Run Code Online (Sandbox Code Playgroud)
我收到一个错误说:
14/07/03 11:03:21 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable Starting namenodes on [OpenJDK 64-Bit Server VM
warning: You have loaded library
/usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have
disabled stack guard. The VM will try to fix the stack guard now. It's
highly recommended that you fix the library with 'execstack -c
<libfile>', or link it with '-z noexecstack'. localhost] sed: -e
expression #1, char 6: unknown option to `s' Server: ssh: …Run Code Online (Sandbox Code Playgroud) 我编写了MR脚本,它应该从HBase加载数据并将它们转储到Hive中.连接到HBase是可以的,但是当我尝试将数据保存到HIVE表时,我收到以下错误消息:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.JavaMain], main() threw exception, org.apache.hive.hcatalog.common.HCatException : 2004 : HCatOutputFormat not initialized, setOutput has to be called
org.apache.oozie.action.hadoop.JavaMainException: org.apache.hive.hcatalog.common.HCatException : 2004 : HCatOutputFormat not initialized, setOutput has to be called
at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:58)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:38)
at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:226)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.apache.hive.hcatalog.common.HCatException : 2004 : HCatOutputFormat not initialized, setOutput has …Run Code Online (Sandbox Code Playgroud) 我尝试在EMR上运行Pig脚本,如:
pig -f s3://bucket-name/loadData.pig
但它失败了,错误:
错误2999:意外的内部错误.空值
org.apache.pmpl.impl.io.FileLocalizer.fetchFilesInternal(FileLocalizer.java:778)中的org.apache.pig.impl.io.FileLocalizer.fetchFiles(FileLocalizer.java:746)中的java.lang.NullPointerException. apache.pig.PigServer.registerJar(PigServer.java:458)org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:433)atg.apache.pig.tools.pigscript.parser.PigScriptParser.解析(PigScriptParser.java:445)org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170) org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)org.apache.pig.Main.run(Main.java:479)org.apache.pig.Main.main(Main .java:159)在sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java. lang.reflect.Method.invoke(Method.java :606)在org.apache.hadoop.util.RunJar.main(RunJar.java:187)
loadData.pig看起来像:
A = load '/ajasing/input/input.txt' USING PigStorage('\t', '-noschema');
store A into '/ajasing/output1444/input1444.txt';
Run Code Online (Sandbox Code Playgroud)
我正在运行Pig版本0.11.1,hadoop版本1.0.3和AMI版本2.4.6.
如果我在本地执行这个猪,即通过在EMR集群上本地复制猪脚本,它工作正常.但是,如果猪脚本源是s3,它会因上述错误而失败.
请告诉我这里有什么问题.
hadoop ×10
hive ×3
unix ×2
amazon-emr ×1
amazon-s3 ×1
apache-pig ×1
bash ×1
benchmarking ×1
hadoop2 ×1
linux ×1
map ×1
mapreduce ×1
partitioning ×1
python ×1
rank ×1
shell ×1
sorting ×1