在下面的代码中,连接后重命名字段会损害脚本的计算时间?它是否在Pig中优化?或者它真的经历过每一条记录吗?
-- tables A: (f1, f2, id) and B: (g1, g2, id) to be joined by id
C = JOIN A BY id, B by id;
C = FOREACH C GENERATE A::f1 AS f1, A::f2 AS f2, B::id AS id, B::g1 AS g1, B::g2 AS g2;
Run Code Online (Sandbox Code Playgroud)
FOREACH命令是否经过C的每个记录?如果是,有优化方法吗?
谢谢.
不要担心优化它,重命名字段可能会有轻微的开销,但它不会触发添加Map/Reduce作业.现场投影将在你的减速机后发生JOIN.
考虑下面给出的两段代码和Map Reduce计划explain.
A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);
C = join A by id, B by id;
store C into 'output';
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-30
Map Plan
Union[tuple] - scope-31
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
| | |
| | Project[bytearray][2] - scope-21
| |
| |---A: New For Each(false,false,false)[bag] - scope-7
| | |
| | Project[bytearray][0] - scope-1
| | |
| | Project[bytearray][1] - scope-3
| | |
| | Project[bytearray][2] - scope-5
| |
| |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
| |
| Project[bytearray][2] - scope-23
|
|---B: New For Each(false,false,false)[bag] - scope-15
| |
| Project[bytearray][0] - scope-9
| |
| Project[bytearray][1] - scope-11
| |
| Project[bytearray][2] - scope-13
|
|---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
----------------
Run Code Online (Sandbox Code Playgroud)
A = load 'first' using PigStorage() as (f1, f2, id);
B = load 'second' using PigStorage() as (g1, g2, id);
C = join A by id, B by id;
C = foreach C generate A::f1 as f1, -- This
A::f2 as f2, -- section
B::id as id, -- is
B::g1 as g1, -- different
B::g2 as g2; --
store C into 'output';
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-41
Map Plan
Union[tuple] - scope-42
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-20
| | |
| | Project[bytearray][2] - scope-21
| |
| |---A: New For Each(false,false,false)[bag] - scope-7
| | |
| | Project[bytearray][0] - scope-1
| | |
| | Project[bytearray][1] - scope-3
| | |
| | Project[bytearray][2] - scope-5
| |
| |---A: Load(hdfs://location/first:PigStorage) - scope-0
|
|---C: Local Rearrange[tuple]{bytearray}(false) - scope-22
| |
| Project[bytearray][2] - scope-23
|
|---B: New For Each(false,false,false)[bag] - scope-15
| |
| Project[bytearray][0] - scope-9
| |
| Project[bytearray][1] - scope-11
| |
| Project[bytearray][2] - scope-13
|
|---B: Load(hdfs://location/second:PigStorage) - scope-8--------
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
| |
| Project[bytearray][0] - scope-27
| |
| Project[bytearray][1] - scope-29
| |
| Project[bytearray][5] - scope-31
| |
| Project[bytearray][3] - scope-33
| |
| Project[bytearray][4] - scope-35
|
|---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
----------------
Run Code Online (Sandbox Code Playgroud)
不同之处在于Reduce计划.没有重命名:
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-27
|
|---POJoinPackage(true,true)[tuple] - scope-32--------
Global sort: false
Run Code Online (Sandbox Code Playgroud)
与重命名:
Reduce Plan
C: Store(hdfs://location/output:org.apache.pig.builtin.PigStorage) - scope-38
|
|---C: New For Each(false,false,false,false,false)[bag] - scope-37
| |
| Project[bytearray][0] - scope-27
| |
| Project[bytearray][1] - scope-29
| |
| Project[bytearray][5] - scope-31
| |
| Project[bytearray][3] - scope-33
| |
| Project[bytearray][4] - scope-35
|
|---POJoinPackage(true,true)[tuple] - scope-43--------
Global sort: false
Run Code Online (Sandbox Code Playgroud)
简而言之,在您担心重命名之前,您可以在脚本中优化其他内容.因为你无论如何都要经历每join一条记录,所以重命名只是一个便宜的额外步骤.