小编Moh*_*OUI的帖子

pandas groupby没有按列分组转换为索引

pandas groupby的默认行为是将按列转换为索引,并将其从数据框的列列表中删除.例如,假设我有一个包含这些列的dataFrame

col1|col2|col3|col4
Run Code Online (Sandbox Code Playgroud)

如果我通过列col2col3这样的方式应用一个组

df.groupby(['col2','col3']).sum()
Run Code Online (Sandbox Code Playgroud)

数据框df不再包含['col2','col3']列列表.它们会自动转换为结果数据帧的索引.

我的问题是如何在列上执行groupby并将该列保留在数据框中?

python dataframe pandas

49
推荐指数
4
解决办法
4万
查看次数

如何在文本文件中检测无效的utf8 unicode/binary

我需要检测损坏的文本文件,其中存在无效(非ASCII)utf-8,Unicode或二进制字符.

�>t�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½o��������ï¿ï¿½_��������������������o����������������������￿����ß����������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~�ï¿ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}���������}w��׿��������������������������������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~������������������������������������_������������������������������������������������������������������������������^����ï¿ï¿½s�����������������������������?�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½}����������ï¿ï¿½ï¿½ï¿½ï¿½y����������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½o�������������������������}��
Run Code Online (Sandbox Code Playgroud)

我试过的:

iconv -f utf-8 -t utf-8 -c file.csv 
Run Code Online (Sandbox Code Playgroud)

这将文件从utf-8编码转换为utf-8编码,-c用于跳过无效的utf-8字符.然而最后这些非法字符仍然被打印出来.在linux或其他语言的bash中还有其他解决方案吗?

linux bash utf-8 character-encoding

43
推荐指数
5
解决办法
4万
查看次数

Spark yarn cluster vs client - 如何选择使用哪一个?

spark 文档具有以下段落,它使纱线客户端和纱线簇之间的差异在下降:

有两种部署模式可用于在YARN上启动Spark应用程序.在集群模式下,Spark驱动程序在应用程序主进程内运行,该进程由群集上的YARN管理,客户端可以在启动应用程序后消失.在客户端模式下,驱动程序在客户端进程中运行,应用程序主服务器仅用于从YARN请求资源.

我假设有两个选择是有原因的.如果是这样,你如何选择使用哪一个?

请使用事实证明您的回答是正确的,以便此问题和答案符合stackoverflow的要求.

stackoverflow上有一些类似的问题,但是这些问题集中在两种方法之间的差异,但不关注何时一种方法比另一种方法更合适.

hadoop-yarn apache-spark

21
推荐指数
2
解决办法
2万
查看次数

克服Graphdef在张量流中不能大于2GB

我使用tensorflow的imageNet训练模型来提取最后一个池层的特征作为新图像数据集的表示向量.

该模型预测新图像如下:

python classify_image.py --image_file new_image.jpeg 
Run Code Online (Sandbox Code Playgroud)

我编辑了主要功能,以便我可以获取图像文件夹并立即返回所有图像的预测,并将特征向量写入csv文件.我是这样做的:

def main(_):
  maybe_download_and_extract()
  #image = (FLAGS.image_file if FLAGS.image_file else
  #         os.path.join(FLAGS.model_dir, 'cropped_panda.jpg'))
  #edit to take a directory of image files instead of a one file
  if FLAGS.data_folder:
    images_folder=FLAGS.data_folder
    list_of_images = os.listdir(images_folder)
  else: 
    raise ValueError("Please specify image folder")

  with open("feature_data.csv", "wb") as f:
    feature_writer = csv.writer(f, delimiter='|')

    for image in list_of_images:
      print(image) 
      current_features = run_inference_on_image(images_folder+"/"+image)
      feature_writer.writerow([image]+current_features)
Run Code Online (Sandbox Code Playgroud)

它适用于大约21张图像但随后因以下错误而崩溃:

  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1912, in as_graph_def
    raise ValueError("GraphDef cannot be larger than 2GB.")
ValueError: GraphDef …
Run Code Online (Sandbox Code Playgroud)

python tensorflow

15
推荐指数
1
解决办法
2万
查看次数

选择除pandas dataframe中的列列表之外的所有内容

是否可以从pandas数据帧中选择给定列表的否定?例如,假设我有以下数据帧

T1_V2  T1_V3 T1_V4 T1_V5 T1_V6 T1_V7 T1_V8
1     15      3      2     N     B     N         
4     16     14      5     H     B     N            
1     10     10      5     N     K     N  
Run Code Online (Sandbox Code Playgroud)

我希望列出所有列,但列T1_V6.我通常会这样做:

df = df[["T1_V2","T1_V3","T1_V4","T1_V5","T1_V7","T1_V8"]]
Run Code Online (Sandbox Code Playgroud)

我的问题在于是否有办法解决这个问题,就像这样

df = df[!["T1_V6"]]
Run Code Online (Sandbox Code Playgroud)

python pandas

12
推荐指数
1
解决办法
5947
查看次数

在 PySpark 中的多列上应用 MinMaxScaler

我想将MinMaxScalarPySpark 应用于 PySpark 数据框的多列df。到目前为止,我只知道如何将它应用于单个列,例如x.

from pyspark.ml.feature import MinMaxScaler

pdf = pd.DataFrame({'x':range(3), 'y':[1,2,5], 'z':[100,200,1000]})
df = spark.createDataFrame(pdf)

scaler = MinMaxScaler(inputCol="x", outputCol="x")
scalerModel = scaler.fit(df)
scaledData = scalerModel.transform(df)
Run Code Online (Sandbox Code Playgroud)

如果我有 100 列怎么办?有没有办法对 PySpark 中的许多列进行最小-最大缩放?

更新:

另外,如何应用MinMaxScalar整数或双精度值?它引发以下错误:

java.lang.IllegalArgumentException: requirement failed: Column length must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually int.
Run Code Online (Sandbox Code Playgroud)

python apache-spark-sql pyspark

11
推荐指数
2
解决办法
7071
查看次数

pandas groupby与sum()在大型csv文件上?

我有一个大文件(19GB左右),我想在内存中加载以执行某些列的聚合.

该文件如下所示:

id, col1, col2, col3, 
1 ,  12 , 15 , 13 
2 ,  18 , 15 , 13 
3 ,  14 , 15 , 13 
3 ,  14 , 185 , 213 
Run Code Online (Sandbox Code Playgroud)

请注意,我在加载到数据框后使用列(id,col1)进行聚合,还要注意这些键​​可能会连续重复几次,例如:

3 ,  14 , 15 , 13 
3 ,  14 , 185 , 213 
Run Code Online (Sandbox Code Playgroud)

对于小文件,以下脚本可以完成此任务

import pandas as pd
data = pd.read_csv("data_file", delimiter=",")
data = data.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()
Run Code Online (Sandbox Code Playgroud)

但是,对于大文件,我需要在读取csv文件时使用chunksize来限制加载到内存中的行数:

import pandas as pd
data = pd.read_csv("data_file", delimiter=",", chunksize=1000000)
data = data.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()
Run Code Online (Sandbox Code Playgroud)

在后一种情况下,如果(id,col1)相似的行被分成不同的文件,则会出现问题.我该怎么处理?

编辑

正如@EdChum所指出的,有一个潜在的解决方法,即不仅将groupby结果附加到新的csv并重新读取并再次执行聚合,直到df大小不变. …

python pandas

9
推荐指数
2
解决办法
3118
查看次数

分类器中的 scikit-learn 改装/部分拟合选项

我想知道 sklearn 分类器中是否有任何选项可以使用一些超参数进行拟合,并在更改一些超参数后,通过节省计算(拟合)成本来重新拟合模型。

让我们说,逻辑回归适合使用C=1e5( logreg=linear_model.LogisticRegression(C=1e5)),我们只更改CC=1e3。我想节省一些计算,因为只更改了一个参数。

parameters machine-learning scikit-learn logistic-regression

8
推荐指数
1
解决办法
4575
查看次数

为具有pandas的系列分配时间戳值会创建一个int

在Python中,熊猫:

g = pd.Series(dict(a = 5, b =datetime(2018, 1,1)))
g['datetime'] = pd.Timestamp('2018-01-02')
Run Code Online (Sandbox Code Playgroud)

g 收益:

a                             5
b           2018-01-01 00:00:00
datetime    1514851200000000000
dtype: object
Run Code Online (Sandbox Code Playgroud)

任何人都知道为什么时间戳在这里转换为其int值,以及如何避免问题并正确地将时间戳附加到系列?

python pandas

8
推荐指数
1
解决办法
424
查看次数

使用数字作为过滤器访问OData时出错

我正在尝试访问OData源提供程序,特别是SAP HANA通过Odata服务公开的分析视图.我在odata上应用了一个包含数字的过滤器,但是我收到的错误是因为支持使用该号码

  "Operator 'eq' incompatible with operand types 'Edm.Decimal' and 'Edm.String'
Run Code Online (Sandbox Code Playgroud)

这是我访问资源的方式:

 analyticView.xsodata/analyticView?$select=AMOUNT_SOLD,FAMILY_NAME&$filter=SALE_PRICE%20eq%20'323.7'&$format=json
Run Code Online (Sandbox Code Playgroud)

我还试图从号码中删除引号

analyticView.xsodata/analyticView?$select=AMOUNT_SOLD,FAMILY_NAME&$filter=SALE_PRICE%20eq%20323.7&$format=json
Run Code Online (Sandbox Code Playgroud)

但我收到这个错误:

"Operator 'eq' incompatible with operand types 'Edm.Decimal' and 'Edm.Double'."
Run Code Online (Sandbox Code Playgroud)

您能否查看问题是什么以及解决方法.

odata hana sapui5

7
推荐指数
1
解决办法
4270
查看次数