我正在处理来自一组文件的数据,这些文件包含日期戳作为文件名的一部分.文件中的数据不包含日期戳.我想处理文件名并将其添加到脚本中的一个数据结构中.有没有办法在Pig Latin(PigStorage的扩展可能?)中做到这一点,或者我是否需要预先使用Perl等预处理所有文件?
我想象如下:
-- Load two fields from file, then generate a third from the filename
rawdata = LOAD '/directory/of/files/' USING PigStorage AS (field1:chararray, field2:int, field3:filename);
-- Reformat the filename into a datestamp
annotated = FOREACH rawdata GENERATE
REGEX_EXTRACT(field3,'*-(20\d{6})-*',1) AS datestamp,
field1, field2;
Run Code Online (Sandbox Code Playgroud)
请注意LOAD语句中的特殊"filename"数据类型.似乎它必须在那里发生,因为一旦数据被加载,回到源文件名已经太晚了.
use*_*487 14
您可以通过指定-tagsource来使用PigStorage,如下所示
A = LOAD 'input' using PigStorage(',','-tagsource');
B = foreach A generate INPUT_FILE_NAME;
Run Code Online (Sandbox Code Playgroud)
每个元组中的第一个字段将包含输入路径(INPUT_FILE_NAME)
根据API doc http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html
担
Rom*_*ain 13
Pig wiki作为PigStorageWithInputPath的一个例子,它在另一个chararray字段中有文件名:
例
A = load '/directory/of/files/*' using PigStorageWithInputPath()
as (field1:chararray, field2:int, field3:chararray);
Run Code Online (Sandbox Code Playgroud)
UDF
// Note that there are several versions of Path and FileSplit. These are intended:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.builtin.PigStorage;
import org.apache.pig.data.Tuple;
public class PigStorageWithInputPath extends PigStorage {
Path path = null;
@Override
public void prepareToRead(RecordReader reader, PigSplit split) {
super.prepareToRead(reader, split);
path = ((FileSplit)split.getWrappedSplit()).getPath();
}
@Override
public Tuple getNext() throws IOException {
Tuple myTuple = super.getNext();
if (myTuple != null)
myTuple.append(path.toString());
return myTuple;
}
}
Run Code Online (Sandbox Code Playgroud)