小编Imr*_*ali的帖子

如何使用 IQR 从 DataFrame 中删除异常值？

我有很多列的数据框（大约 100 个特征），我想应用四分位法并想从数据框中删除异常值。

我正在使用此链接 stackOverflow

但问题是上述方法的 nan 工作正常，

当我像这样尝试时

Q1 = stepframe.quantile(0.25)
Q3 = stepframe.quantile(0.75)
IQR = Q3 - Q1
((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).sum()

Run Code Online (Sandbox Code Playgroud)

它给了我这个

((stepframe < (Q1 - 1.5 * IQR)) | (stepframe > (Q3 + 1.5 * IQR))).sum()
Out[35]: 
Day                      0
Col1                     0
Col2                     0
col3                     0
Col4                     0
Step_Count            1179
dtype: int64

Run Code Online (Sandbox Code Playgroud)

我只是想知道，接下来我要做什么，以便删除数据框中的所有异常值。

如果我使用这个

def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 …

Run Code Online (Sandbox Code Playgroud)

dataframe python-3.x pandas iqr

Imr*_*ali

2018 05-22

6
推荐指数

1
解决办法

2万
查看次数

如何从数据流中的google存储桶中读取csv文件，组合，对数据流中的数据帧进行一些转换，然后将其转储到bigquery中？

我必须用 python 编写一个数据流作业，它将从 GCS 读取两个不同的 .csv 文件，执行联接操作，对联接数据帧的结果执行转换，然后最终将其发送到 BigQuery 表？

我对此很陌生，经过大量的研发，我知道我们可以从 apache.beam 完成所有管道操作。我终于找到了一个模板，但在给定的点上仍然有很多困惑。

import logging
import os

import apache_beam as beam
from apache_beam.io.filesystems import FileSystems
from apache_beam.pipeline import PipelineOptions


os.environ["GOOGLE_APPLICATION_CREDENTIALS"]='auth_file.json'


class DataTransformation:
    """A helper class that translates a CSV into a format BigQuery will accept."""

     def __init__(self):
         dir_path = os.path.dirname(os.path.realpath(__file__))
         # Here we read the output schema from a json file.  This is used to specify the types
         # of data we are writing to BigQuery.
         self.schema = os.path.join(dir_path, 'resources',
                                    'gs://wahtch_dog_dataflow/schema.json') …

Run Code Online (Sandbox Code Playgroud)

python google-cloud-platform google-cloud-dataflow apache-beam

Imr*_*ali

2020 06-26

5
推荐指数

1
解决办法

5131
查看次数