小编str*_*dmc的帖子

如何计算Apache Beam中的标准偏差

我是Apache Beam的新手,我想计算大型数据集的平均值和标准偏差.

给定一个"A,B"形式的.csv文件,其中A,B是整数,这基本上就是我所拥有的.

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io.textio import ReadFromText

class Split(beam.DoFn):
    def process(self, element):
        A, B = element.split(',')
        return [('A', A), ('B', B)]

with beam.Pipeline(options=PipelineOptions()) as p:
     # parse the rows
     rows = (p
             | ReadFromText('data.csv')
             | beam.ParDo(Split()))

     # calculate the mean
     avgs = (rows
             | beam.CombinePerKey(
                 beam.combiners.MeanCombineFn()))

     # calculate the stdv per key
     # ???

     std >> beam.io.WriteToText('std.out')
Run Code Online (Sandbox Code Playgroud)

我想做点什么:

class SquaredDiff(beam.DoFn):
    def process(self, element):
        A = element[0][1]
        B = element[1][1]
        return [('A', A …
Run Code Online (Sandbox Code Playgroud)

python apache-beam

4
推荐指数
1
解决办法
359
查看次数

标签 统计

apache-beam ×1

python ×1