pandas groupby的默认行为是将按列转换为索引,并将其从数据框的列列表中删除.例如,假设我有一个包含这些列的dataFrame
col1|col2|col3|col4
Run Code Online (Sandbox Code Playgroud)
如果我通过列col2
和col3
这样的方式应用一个组
df.groupby(['col2','col3']).sum()
Run Code Online (Sandbox Code Playgroud)
数据框df
不再包含['col2','col3']
列列表.它们会自动转换为结果数据帧的索引.
我的问题是如何在列上执行groupby并将该列保留在数据框中?
我需要检测损坏的文本文件,其中存在无效(非ASCII)utf-8,Unicode或二进制字符.
�>t�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½o��������ï¿ï¿½_��������������������o����������������������￿����ß����������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~�ï¿ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½}���������}w��׿��������������������������������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½~������������������������������������_������������������������������������������������������������������������������^����ï¿ï¿½s�����������������������������?�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½w�������������ï¿ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½}����������ï¿ï¿½ï¿½ï¿½ï¿½y����������������ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½o�������������������������}��
Run Code Online (Sandbox Code Playgroud)
我试过的:
iconv -f utf-8 -t utf-8 -c file.csv
Run Code Online (Sandbox Code Playgroud)
这将文件从utf-8编码转换为utf-8编码,-c
用于跳过无效的utf-8字符.然而最后这些非法字符仍然被打印出来.在linux或其他语言的bash中还有其他解决方案吗?
spark 文档具有以下段落,它使纱线客户端和纱线簇之间的差异在下降:
有两种部署模式可用于在YARN上启动Spark应用程序.在集群模式下,Spark驱动程序在应用程序主进程内运行,该进程由群集上的YARN管理,客户端可以在启动应用程序后消失.在客户端模式下,驱动程序在客户端进程中运行,应用程序主服务器仅用于从YARN请求资源.
我假设有两个选择是有原因的.如果是这样,你如何选择使用哪一个?
请使用事实证明您的回答是正确的,以便此问题和答案符合stackoverflow的要求.
stackoverflow上有一些类似的问题,但是这些问题集中在两种方法之间的差异,但不关注何时一种方法比另一种方法更合适.
我使用tensorflow的imageNet训练模型来提取最后一个池层的特征作为新图像数据集的表示向量.
该模型预测新图像如下:
python classify_image.py --image_file new_image.jpeg
Run Code Online (Sandbox Code Playgroud)
我编辑了主要功能,以便我可以获取图像文件夹并立即返回所有图像的预测,并将特征向量写入csv文件.我是这样做的:
def main(_):
maybe_download_and_extract()
#image = (FLAGS.image_file if FLAGS.image_file else
# os.path.join(FLAGS.model_dir, 'cropped_panda.jpg'))
#edit to take a directory of image files instead of a one file
if FLAGS.data_folder:
images_folder=FLAGS.data_folder
list_of_images = os.listdir(images_folder)
else:
raise ValueError("Please specify image folder")
with open("feature_data.csv", "wb") as f:
feature_writer = csv.writer(f, delimiter='|')
for image in list_of_images:
print(image)
current_features = run_inference_on_image(images_folder+"/"+image)
feature_writer.writerow([image]+current_features)
Run Code Online (Sandbox Code Playgroud)
它适用于大约21张图像但随后因以下错误而崩溃:
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1912, in as_graph_def
raise ValueError("GraphDef cannot be larger than 2GB.")
ValueError: GraphDef …
Run Code Online (Sandbox Code Playgroud) 是否可以从pandas数据帧中选择给定列表的否定?例如,假设我有以下数据帧
T1_V2 T1_V3 T1_V4 T1_V5 T1_V6 T1_V7 T1_V8
1 15 3 2 N B N
4 16 14 5 H B N
1 10 10 5 N K N
Run Code Online (Sandbox Code Playgroud)
我希望列出所有列,但列T1_V6.我通常会这样做:
df = df[["T1_V2","T1_V3","T1_V4","T1_V5","T1_V7","T1_V8"]]
Run Code Online (Sandbox Code Playgroud)
我的问题在于是否有办法解决这个问题,就像这样
df = df[!["T1_V6"]]
Run Code Online (Sandbox Code Playgroud) 我想将MinMaxScalar
PySpark 应用于 PySpark 数据框的多列df
。到目前为止,我只知道如何将它应用于单个列,例如x
.
from pyspark.ml.feature import MinMaxScaler
pdf = pd.DataFrame({'x':range(3), 'y':[1,2,5], 'z':[100,200,1000]})
df = spark.createDataFrame(pdf)
scaler = MinMaxScaler(inputCol="x", outputCol="x")
scalerModel = scaler.fit(df)
scaledData = scalerModel.transform(df)
Run Code Online (Sandbox Code Playgroud)
如果我有 100 列怎么办?有没有办法对 PySpark 中的许多列进行最小-最大缩放?
更新:
另外,如何应用MinMaxScalar
整数或双精度值?它引发以下错误:
java.lang.IllegalArgumentException: requirement failed: Column length must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually int.
Run Code Online (Sandbox Code Playgroud) 我有一个大文件(19GB左右),我想在内存中加载以执行某些列的聚合.
该文件如下所示:
id, col1, col2, col3,
1 , 12 , 15 , 13
2 , 18 , 15 , 13
3 , 14 , 15 , 13
3 , 14 , 185 , 213
Run Code Online (Sandbox Code Playgroud)
请注意,我在加载到数据框后使用列(id,col1)进行聚合,还要注意这些键可能会连续重复几次,例如:
3 , 14 , 15 , 13
3 , 14 , 185 , 213
Run Code Online (Sandbox Code Playgroud)
对于小文件,以下脚本可以完成此任务
import pandas as pd
data = pd.read_csv("data_file", delimiter=",")
data = data.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()
Run Code Online (Sandbox Code Playgroud)
但是,对于大文件,我需要在读取csv文件时使用chunksize来限制加载到内存中的行数:
import pandas as pd
data = pd.read_csv("data_file", delimiter=",", chunksize=1000000)
data = data.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()
Run Code Online (Sandbox Code Playgroud)
在后一种情况下,如果(id,col1)相似的行被分成不同的文件,则会出现问题.我该怎么处理?
编辑
正如@EdChum所指出的,有一个潜在的解决方法,即不仅将groupby结果附加到新的csv并重新读取并再次执行聚合,直到df大小不变. …
我想知道 sklearn 分类器中是否有任何选项可以使用一些超参数进行拟合,并在更改一些超参数后,通过节省计算(拟合)成本来重新拟合模型。
让我们说,逻辑回归适合使用C=1e5
( logreg=linear_model.LogisticRegression(C=1e5)
),我们只更改C
为C=1e3
。我想节省一些计算,因为只更改了一个参数。
parameters machine-learning scikit-learn logistic-regression
在Python中,熊猫:
g = pd.Series(dict(a = 5, b =datetime(2018, 1,1)))
g['datetime'] = pd.Timestamp('2018-01-02')
Run Code Online (Sandbox Code Playgroud)
g
收益:
a 5
b 2018-01-01 00:00:00
datetime 1514851200000000000
dtype: object
Run Code Online (Sandbox Code Playgroud)
任何人都知道为什么时间戳在这里转换为其int值,以及如何避免问题并正确地将时间戳附加到系列?
我正在尝试访问OData源提供程序,特别是SAP HANA通过Odata服务公开的分析视图.我在odata上应用了一个包含数字的过滤器,但是我收到的错误是因为支持使用该号码
"Operator 'eq' incompatible with operand types 'Edm.Decimal' and 'Edm.String'
Run Code Online (Sandbox Code Playgroud)
这是我访问资源的方式:
analyticView.xsodata/analyticView?$select=AMOUNT_SOLD,FAMILY_NAME&$filter=SALE_PRICE%20eq%20'323.7'&$format=json
Run Code Online (Sandbox Code Playgroud)
我还试图从号码中删除引号
analyticView.xsodata/analyticView?$select=AMOUNT_SOLD,FAMILY_NAME&$filter=SALE_PRICE%20eq%20323.7&$format=json
Run Code Online (Sandbox Code Playgroud)
但我收到这个错误:
"Operator 'eq' incompatible with operand types 'Edm.Decimal' and 'Edm.Double'."
Run Code Online (Sandbox Code Playgroud)
您能否查看问题是什么以及解决方法.
python ×6
pandas ×4
apache-spark ×1
bash ×1
dataframe ×1
hadoop-yarn ×1
hana ×1
linux ×1
odata ×1
parameters ×1
pyspark ×1
sapui5 ×1
scikit-learn ×1
tensorflow ×1
utf-8 ×1