小编Aja*_*pta的帖子

如何在Pig中JOIN后删除重复的列?

让我说JOIN两个关系如下:

-- part looks like:
-- 1,5.3
-- 2,4.9
-- 3,4.9

-- original looks like:
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,akhila,3.3,IT,C,1.3,0.3

jnd = JOIN part BY $0, original BY $0;
Run Code Online (Sandbox Code Playgroud)

输出将是:

1,5.3,1,Anju,3.6,IT,A,1.6,0.3
2,4.9,2,Remya,3.3,EEE,B,1.6,0.3
3,4.9,3,akhila,3.3,IT,C,1.3,0.3
Run Code Online (Sandbox Code Playgroud)

请注意,$0每个元组中显示两次.例如:

1,5.3,1,Anju,3.6,IT,A,1.6,0.3
^     ^
|-----|
Run Code Online (Sandbox Code Playgroud)

我可以通过执行以下操作手动删除重复键:

jnd = foreach jnd generate $0,$1,$3,$4 ..;
Run Code Online (Sandbox Code Playgroud)

有没有办法动态删除它?喜欢remove(the duplicate key joiner).

java hadoop join apache-pig

9
推荐指数
1
解决办法
4561
查看次数

获取连接到Apache Spark GraphX中节点的所有节点

假设我们在Apache GraphX中获得了以下输入:

顶点RDD:

val vertexArray = Array(
  (1L, "Alice"),
  (2L, "Bob"),
  (3L, "Charlie"),
  (4L, "David"),
  (5L, "Ed"),
  (6L, "Fran")
)
Run Code Online (Sandbox Code Playgroud)

Edge RDD:

val edgeArray = Array(
  Edge(1L, 2L, 1),
  Edge(2L, 3L, 1),
  Edge(3L, 4L, 1),
  Edge(5L, 6L, 1)
)
Run Code Online (Sandbox Code Playgroud)

我需要连接到Apache Spark GraphX中的节点的所有组件

1,[1,2,3,4]
5,[5,6]
Run Code Online (Sandbox Code Playgroud)

scala graph apache-spark spark-graphx

6
推荐指数
1
解决办法
2389
查看次数

标签 统计

apache-pig ×1

apache-spark ×1

graph ×1

hadoop ×1

java ×1

join ×1

scala ×1

spark-graphx ×1