vkr*_*ris 8 java apache hadoop
这些序列文件是如何生成的?我在这里看到了关于序列文件的链接,
http://wiki.apache.org/hadoop/SequenceFile
Run Code Online (Sandbox Code Playgroud)
这些是使用默认的Java序列化器编写的吗?以及如何读取序列文件?
Lev*_*ich 16
序列文件由MapReduce任务生成,并且可以用作在MapReduce作业之间传输数据的通用格式.
您可以通过以下方式阅读它们:
Configuration config = new Configuration();
Path path = new Path(PATH_TO_YOUR_FILE);
SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value))
// perform some operating
reader.close();
Run Code Online (Sandbox Code Playgroud)
您也可以使用SequenceFile.Writer自己生成序列文件.
示例中使用的类如下:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
Run Code Online (Sandbox Code Playgroud)
并包含在hadoop-coremaven依赖项中:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
Run Code Online (Sandbox Code Playgroud)
感谢 Lev Khomich 的回答,我的问题已经解决了。
然而,该解决方案已被弃用一段时间,新的 API 提供了更多功能且易于使用。
查看hadoop.io.SequenceFile的源代码,点击这里:
Configuration config = new Configuration();
Path path = new Path("/Users/myuser/sequencefile");
SequenceFile.Reader reader = new Reader(config, Reader.file(path));
WritableComparable key = (WritableComparable) reader.getKeyClass()
.newInstance();
Writable value = (Writable) reader.getValueClass().newInstance();
while (reader.next(key, value)) {
System.out.println(key);
System.out.println(value);
System.out.println("------------------------");
}
reader.close();
Run Code Online (Sandbox Code Playgroud)
额外信息,这里是针对 Nutch/injector 生成的数据文件运行的示例输出:
------------------------
https://wiki.openoffice.org/wiki/Ru/FAQ
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun Apr 13 16:12:59 MDT 2014
Modified time: Wed Dec 31 17:00:00 MST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:
------------------------
https://www.bankhapoalim.co.il/
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun Apr 13 16:12:59 MDT 2014
Modified time: Wed Dec 31 17:00:00 MST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:
Run Code Online (Sandbox Code Playgroud)
谢谢!
| 归档时间: |
|
| 查看次数: |
13144 次 |
| 最近记录: |