如何将当前输入文件名合并到我的Pig Latin脚本中?

Kev*_*ink 13 apache-pig

我正在处理来自一组文件的数据,这些文件包含日期戳作为文件名的一部分.文件中的数据不包含日期戳.我想处理文件名并将其添加到脚本中的一个数据结构中.有没有办法在Pig Latin(PigStorage的扩展可能?)中做到这一点,或者我是否需要预先使用Perl等预处理所有文件?

我想象如下:

-- Load two fields from file, then generate a third from the filename
rawdata = LOAD '/directory/of/files/' USING PigStorage AS (field1:chararray, field2:int, field3:filename);

-- Reformat the filename into a datestamp
annotated = FOREACH rawdata GENERATE
  REGEX_EXTRACT(field3,'*-(20\d{6})-*',1) AS datestamp,
  field1, field2;
Run Code Online (Sandbox Code Playgroud)

请注意LOAD语句中的特殊"filename"数据类型.似乎它必须在那里发生,因为一旦数据被加载,回到源文件名已经太晚了.

use*_*487 14

您可以通过指定-tagsource来使用PigStorage,如下所示

A = LOAD 'input' using PigStorage(',','-tagsource'); 
B = foreach A generate INPUT_FILE_NAME; 
Run Code Online (Sandbox Code Playgroud)

每个元组中的第一个字段将包含输入路径(INPUT_FILE_NAME)

根据API doc http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html


Rom*_*ain 13

Pig wiki作为PigStorageWithInputPath的一个例子,它在另一个chararray字段中有文件名:

A = load '/directory/of/files/*' using PigStorageWithInputPath() 
    as (field1:chararray, field2:int, field3:chararray);
Run Code Online (Sandbox Code Playgroud)

UDF

// Note that there are several versions of Path and FileSplit. These are intended:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.builtin.PigStorage;
import org.apache.pig.data.Tuple;

public class PigStorageWithInputPath extends PigStorage {
   Path path = null;

   @Override
   public void prepareToRead(RecordReader reader, PigSplit split) {
       super.prepareToRead(reader, split);
       path = ((FileSplit)split.getWrappedSplit()).getPath();
   }

   @Override
   public Tuple getNext() throws IOException {
       Tuple myTuple = super.getNext();
       if (myTuple != null)
          myTuple.append(path.toString());
       return myTuple;
   }
}
Run Code Online (Sandbox Code Playgroud)