我是Apache Beam的新手,我想计算大型数据集的平均值和标准偏差.
给定一个"A,B"形式的.csv文件,其中A,B是整数,这基本上就是我所拥有的.
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.textio import ReadFromText
class Split(beam.DoFn):
def process(self, element):
A, B = element.split(',')
return [('A', A), ('B', B)]
with beam.Pipeline(options=PipelineOptions()) as p:
# parse the rows
rows = (p
| ReadFromText('data.csv')
| beam.ParDo(Split()))
# calculate the mean
avgs = (rows
| beam.CombinePerKey(
beam.combiners.MeanCombineFn()))
# calculate the stdv per key
# ???
std >> beam.io.WriteToText('std.out')
Run Code Online (Sandbox Code Playgroud)
我想做点什么:
class SquaredDiff(beam.DoFn):
def process(self, element):
A = element[0][1]
B = element[1][1]
return [('A', A …Run Code Online (Sandbox Code Playgroud)