jho*_*ith 5 python apache-spark
我有以下代码
file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
Run Code Online (Sandbox Code Playgroud)
http://spark.apache.org/examples.html我从这里复制了这个例子
我无法理解此代码,特别是关键字
有人可以用简单的英语解释发生了什么.
aar*_*man 14
map它是最简单的,它基本上是对序列的每个元素执行给定的操作并返回结果序列(非常类似于foreach).flatMap是相同的东西,但不是每个元素只返回一个元素,而是允许返回一个序列(可以为空).这里有一个答案解释之间的差异map和flatMap.最后reduceByKey需要一个聚合函数(意味着它需要两个相同类型的参数并返回该类型,也应该是可交换的和关联的,否则你会得到不一致的结果),它用于聚合你的序列对中的每V一个.K(K,V)
示例*:
reduce (lambda a, b: a + b,[1,2,3,4])
这是说聚合与整个列表+,以便它会做
1 + 2 = 3
3 + 3 = 6
6 + 4 = 10
final result is 10
Run Code Online (Sandbox Code Playgroud)
除了为每个唯一键执行reduce之外,按键减少是一回事.
所以在你的例子中解释它
file = spark.textFile("hdfs://...") // open text file each element of the RDD is one line of the file
counts = file.flatMap(lambda line: line.split(" ")) //flatMap is needed here to return every word (separated by a space) in the line as an Array
.map(lambda word: (word, 1)) //map each word to a value of 1 so they can be summed
.reduceByKey(lambda a, b: a + b) // get an RDD of the count of every unique word by aggregating (adding up) all the 1's you wrote in the last step
counts.saveAsTextFile("hdfs://...") //Save the file onto HDFS
Run Code Online (Sandbox Code Playgroud)
因此,为什么以这种方式计算单词,原因是MapReduce编程范式是高度可并行化的,因此可以扩展到以TB或甚至数PB的数据进行计算.
我不用python告诉我,如果我弄错了.
见内嵌评论:
file = spark.textFile("hdfs://...") # opens a file
counts = file.flatMap(lambda line: line.split(" ")) \ # iterate over the lines, split each line by space (into words)
.map(lambda word: (word, 1)) \ # for each word, create the tuple (word, 1)
.reduceByKey(lambda a, b: a + b) # go over the tuples "by key" (first element) and sum the second elements
counts.saveAsTextFile("hdfs://...")
Run Code Online (Sandbox Code Playgroud)
可以在此处找到关于 reduceByKey 的更详细说明