Apache spark和python lambda

jho*_*ith 5 python apache-spark

我有以下代码

file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
Run Code Online (Sandbox Code Playgroud)

http://spark.apache.org/examples.html我从这里复制了这个例子

我无法理解此代码,特别是关键字

  1. flatmap,
  2. 地图和
  3. reduceby

有人可以用简单的英语解释发生了什么.

aar*_*man 14

map它是最简单的,它基本上是对序列的每个元素执行给定的操作并返回结果序列(非常类似于foreach).flatMap是相同的东西,但不是每个元素只返回一个元素,而是允许返回一个序列(可以为空).这里有一个答案解释之间的差异mapflatMap.最后reduceByKey需要一个聚合函数(意味着它需要两个相同类型的参数并返回该类型,也应该是可交换的和关联的,否则你会得到不一致的结果),它用于聚合你的序列对中的每V一个.K(K,V)

示例*:
reduce (lambda a, b: a + b,[1,2,3,4])

这是说聚合与整个列表+,以便它会做

1 + 2 = 3  
3 + 3 = 6  
6 + 4 = 10  
final result is 10
Run Code Online (Sandbox Code Playgroud)

除了为每个唯一键执行reduce之外,按键减少是一回事.


所以在你的例子中解释它

file = spark.textFile("hdfs://...") // open text file each element of the RDD is one line of the file
counts = file.flatMap(lambda line: line.split(" ")) //flatMap is needed here to return every word (separated by a space) in the line as an Array
             .map(lambda word: (word, 1)) //map each word to a value of 1 so they can be summed
             .reduceByKey(lambda a, b: a + b) // get an RDD of the count of every unique word by aggregating (adding up) all the 1's you wrote in the last step
counts.saveAsTextFile("hdfs://...") //Save the file onto HDFS
Run Code Online (Sandbox Code Playgroud)

因此,为什么以这种方式计算单词,原因是MapReduce编程范式是高度可并行化的,因此可以扩展到以TB或甚至数PB的数据进行计算.


我不用python告诉我,如果我弄错了.


alf*_*sin 5

见内嵌评论:

file = spark.textFile("hdfs://...") # opens a file
counts = file.flatMap(lambda line: line.split(" ")) \  # iterate over the lines, split each line by space (into words)
             .map(lambda word: (word, 1)) \ # for each word, create the tuple (word, 1)
             .reduceByKey(lambda a, b: a + b) # go over the tuples "by key" (first element) and sum the second elements
counts.saveAsTextFile("hdfs://...")
Run Code Online (Sandbox Code Playgroud)

可以在此处找到关于 reduceByKey 的更详细说明