问题:当我向hadoop 2.2.0群集提交作业时,它不会显示在作业跟踪器中,但作业成功完成.通过这个我可以看到输出,它正确运行并在运行时打印输出.
我尝试过多种选择,但是求职者没有看到这份工作.如果我使用2.2.0 hadoop运行流媒体作业,它会显示在任务跟踪器中,但是当我通过hadoop-client api提交它时,它不会显示在作业跟踪器中.我正在查看端口8088上的ui接口以验证该作业
环境 OSX Mavericks,Java 1.6,Hadoop 2.2.0单节点集群,Tomcat 7.0.47
码
try {
configuration.set("fs.defaultFS", "hdfs://127.0.0.1:9000");
configuration.set("mapred.jobtracker.address", "localhost:9001");
Job job = createJob(configuration);
job.waitForCompletion(true);
} catch (Exception e) {
logger.log(Level.SEVERE, "Unable to execute job", e);
}
return null;
Run Code Online (Sandbox Code Playgroud)
等/ hadoop的/ mapred-site.xml中
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
Run Code Online (Sandbox Code Playgroud)
等/ hadoop的/芯-site.xml中
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Run Code Online (Sandbox Code Playgroud) 我有一个顺序文件,它是hadoop map-reduce作业的输出.在此文件中,数据以键值对形式写入,值本身是映射.我想将值作为MAP对象读取,以便我可以进一步处理它.
Configuration config = new Configuration();
Path path = new Path("D:\\OSP\\sample_data\\data\\part-00000");
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
long position = reader.getPosition();
while(reader.next(key,value))
{
System.out.println("Key is: "+textKey +" value is: "+val+"\n");
}
Run Code Online (Sandbox Code Playgroud)
程序输出:键是:[这是键]值是:{abc = 839177,xyz = 548498,lmn = 2,pqr = 1}
在这里我获得了作为字符串的价值,但我希望它作为地图的对象.
我正在解析Apache,Nginx,Darwin(视频流服务器)生成的访问日志,并按日期/ referrer/useragent聚合每个交付文件的统计信息.
每小时生成大量日志,并且该数量可能在不久的将来急剧增加 - 因此通过Amazon Elastic MapReduce以分布式方式处理这类数据听起来合理.
现在我已经准备好使用映射器和缩减器来处理我的数据并使用以下流程测试整个过程:
我已经根据互联网上关于Amazon ERM的数以千计的教程手动完成了这项工作.
接下来我该怎么办?什么是自动化此过程的最佳方法?
我认为这个主题对于尝试使用Amazon Elastic MapReduce处理访问日志但无法找到好的材料和/或最佳实践的人来说非常有用.
UPD:这里只是澄清最后一个问题:
Amazon Elastic MapReduce支持的日志处理的最佳实践是什么?
相关文章:
我有以下情况
我有3台机器集群,后面有确认.
Master
Usage of /: 91.4% of 74.41GB
MemTotal: 16557308 kB
MemFree: 723736 kB
Run Code Online (Sandbox Code Playgroud)
Slave 01
Usage of /: 52.9% of 29.76GB
MemTotal: 16466220 kB
MemFree: 5320860 kB
Run Code Online (Sandbox Code Playgroud)
Slave 02
Usage of /: 19.0% of 19.84GB
MemTotal: 16466220 kB
MemFree: 6173564 kB
Run Code Online (Sandbox Code Playgroud)
的hadoop/CONF /芯-site.xml中
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/work/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. …Run Code Online (Sandbox Code Playgroud) 我想从我的hadoop流媒体作业中的文件中读取一个列表.这是我简单的mapper.py:
#!/usr/bin/env python
import sys
import json
def read_file():
id_list = []
#read ids from a file
f = open('../user_ids','r')
for line in f:
line = line.strip()
id_list.append(line)
return id_list
if __name__ == '__main__':
id_list = set(read_file())
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
line = json.loads(line)
user_id = line['user']['id']
if str(user_id) in id_list:
print '%s\t%s' % (user_id, line)
Run Code Online (Sandbox Code Playgroud)
这是我的reducer.py
#!/usr/bin/env python
from operator import itemgetter
import …Run Code Online (Sandbox Code Playgroud) 根据http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/,确定每个节点并发运行任务数的公式为:
min (yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb,
yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores) .
Run Code Online (Sandbox Code Playgroud)
但是,将这些参数设置为(对于c3.2xlarges的集群):
yarn.nodemanager.resource.memory-mb = 14336
mapreduce.map.memory.mb = 2048
yarn.nodemanager.resource.cpu-vcores = 8
mapreduce.map.cpu.vcores = 1,
我发现当公式显示7应该是每个节点时,我只能同时运行4个任务.这是怎么回事?
我在AMI 3.1.0上运行Hadoop 2.4.0.
amazon-web-services elastic-map-reduce hadoop-streaming hadoop-yarn hadoop2
我试图从sys.stdin获取输入.这是hadoop的map reducer程序.输入文件采用txt格式.预览数据集:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
62 257 2 879372434
286 1014 5 879781125
200 222 5 876042340
210 40 3 891035994
224 29 3 888104457
303 785 3 879485318
122 387 5 879270459
194 274 2 879539794
291 1042 4 874834944
Run Code Online (Sandbox Code Playgroud)
我一直在尝试的代码 - …
我安装了Hadoop并且工作正常,因为我运行了单词计数示例,它运行良好.现在我试着继续前进并做一些更实际的例子.我的例子在本网站上作为例子2(每个部门的平均工资)完成.我使用的是网站上的相同代码和这些数据
mapper.py
#!usr/bin/Python
# mapper.py
import csv
import sys
reader = csv.reader(sys.stdin, delimiter=',')
writer = csv.writer(sys.stdout, delimiter='\t')
for row in reader:
agency = row[3]
annualSalary = row[5][1:].strip()
print '{0}\t{1}'.format(agency, annualSalary)
Run Code Online (Sandbox Code Playgroud)
reducer.py
#!usr/bin/Python
# reducer.py
import csv
import sys
agency_salary_sum = 0
current_agency = None
n_occurences = 0
for row in sys.stdin:
data_mapped = row.strip().split("\t")
if len(data_mapped) != 2:
# Something has gone wrong. Skip this line.
continue
agency, salary = data_mapped
try: salary = float(salary)
except: continue …Run Code Online (Sandbox Code Playgroud) 在我的工作中,我需要解析许多历史日志集.个别客户(有数千个)可能有数百个按日期分类的日志子目录.例如:
每个单独的日志集本身可能是五个或六个级别,包含数千个文件.
因此,我实际上希望各个映射作业处理子目录的步骤:简单地枚举单个文件是我的分布式计算问题的一部分!
不幸的是,当我尝试将只包含log子目录的目录传递给Hadoop时,它抱怨我无法将这些子目录传递给我的mapper.(同样,我已经写过接受子目录作为输入):
$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .
[ . . . ]
12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$
$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output . … 当我运行"hadoop job -status xxx"时,输出以下一些列表.
Rack-local map tasks=124
Data-local map tasks=6
Run Code Online (Sandbox Code Playgroud)
Rack-local map任务和Data-local map任务有什么区别?
hadoop-streaming ×10
hadoop ×8
java ×3
python ×3
hadoop-yarn ×2
mapreduce ×2
amazon-emr ×1
amazon-s3 ×1
hadoop2 ×1
logging ×1
map ×1
pandas ×1
sequential ×1