了解这个Pandas脚本

Question

了解这个Pandas脚本

AEA*_*AEA 2 python comments numpy python-2.7 pandas

我收到此代码将数据分组为直方图类型数据.我一直试图理解这个pandas脚本中的代码,以便编辑,操作和复制它.我对我理解的部分有评论.

码

import numpy as np
import pandas as pd


column_names = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 
              'col7', 'col8', 'col9', 'col10', 'col11'] #names to be used as column labels.  If no names are specified then columns can be refereed to by number eg. df[0], df[1] etc.

df = pd.read_csv('data.csv', header=None, names=column_names) #header= None means there are no column headings in the  csv file

df.ix[df.col11 == 'x', 'col11']=-0.08 #trick so that 'x' rows will be grouped into a category >-0.1 and <= -0.05.  This will allow all of col11 to be treated as a numbers

bins = np.arange(-0.1, 1.0, 0.05) #bins to put col11 values in.  >-0.1 and <=-0.05 will be our special 'x' rows, >-0.05 and <=0 will capture all the '0' values.
labels = np.array(['%s:%s' % (x, y) for x, y in zip(bins[:-1], bins[1:])]) #create labels for the bins
labels[0] = 'x' #change first bin label to 'x'
labels[1] = '0' #change second bin label to '0'

df['col11'] = df['col11'].astype(float) #convert col11 to numbers so we can do math on them


df['bin'] = pd.cut(df['col11'], bins=bins, labels=False) # make another column 'bins' and put in an integer representing what bin the number falls into.Later we'll map the integer to the bin label


df.set_index('bin', inplace=True, drop=False, append=False) #groupby is meant to run faster with an index

def count_ones(x):
    """aggregate function to count values that equal 1"""
    return np.sum(x==1)

dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]

dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')

Run Code Online (Sandbox Code Playgroud)

我真正很难理解的部分在本节中:

def count_ones(x):
    """aggregate function to count values that equal 1"""
    return np.sum(x==1)

dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
dfg.index = labels[dfg.index]

dfg.ix['x',('col11', 'mean')]='N/A'
print(dfg)
dfg.to_csv('new.csv')

Run Code Online (Sandbox Code Playgroud)

如果任何人能够评论这个脚本,我将非常感激.也可以随意纠正或添加我的评论(这些是我到目前为止他们可能不正确的假设).我希望这不是SOF的主题.我很乐意为任何可以帮助我的用户提供50点奖励.

Answer 1

rtr*_*ker 8

我会尝试解释我的代码.因为它使用了一些技巧.

我称之为dfpandas DataFrame的缩写名称
我把它称为dfg集体我的df.
让我建立表达 dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
- 代码dfg = df[['bin','col7','col11']]是说从我的DataFrame中取名为'bin''col7'和'col11'的列df.
- 现在我有了我感兴趣的3列,我想按"bin"列中的值进行分组.这是通过dfg = df[['bin','col7','col11']].groupby('bin').我现在有数据组,即bin#1中的所有记录,bin#2中的所有记录,等等.
- 我现在想要将一些聚合函数应用于我的每个bin组中的记录(聚合函数类似于sum,mean或count).
- 现在我想对我的每个箱子中的记录应用三个聚合函数:'col11'的平均值,每个bin中的记录数,以及每个bin中'col7'等于1的记录数.意思很简单; numpy已经具有计算均值的函数.如果我只是做'col11'的意思,我会写: dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean]}).记录数量也很容易; python的len函数(它不是真正的函数,而是列表的属性等)将为我们提供列表中的项目数.所以我现在有dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [len]}).现在我想不出一个现有函数来计算numpy数组中的一个数量(它必须在一个numpy数组上工作).我可以定义自己的函数在numpy数组上工作,因此我的函数count_ones.
- 现在我将解构这个count_ones功能.x传递给函数的varibale 总是一个1d numpy数组.在我们的特定情况下,所有'col7'值都落在bin#1中,所有'col7'值都落在bin#2等中.代码x==1将创建一个相同大小的布尔(TRUE/FALSE)数组作为x.如果x中的相应值等于1,则布尔数组中的条目将为True,否则为false.因为如果我对我的布尔数组的值求和,python将True视为1,我将获得== 1的值的计数.现在我有了我的count_ones功能,我将它应用于'col7':dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
- 你可以看到的语法.agg是.agg({'column_name_to_apply_to': [list_of_function names_to_apply]}
- 使用布尔数组,您可以执行各种奇怪的条件组合(x == 6)| (x == 3)将是'x等于6或x等于3'.'和'运算符是&.总是()围绕每个条件
现在来dfg.index = labels[dfg.index].在dfg,因为我按'bin'分组,每行分组数据的索引(或行标签)(即我的dfg.index)将是我的bin数字:1,2,3, labels[dfg.index]正在使用numpy数组的花式索引.标签[0]会给我第一个标签,标签[3]会给我第4个标签.使用普通的python列表,你可以使用切片来做标签[0:3],它会给我标签0,1和2.对于numpy数组,我们可以更进一步,只需使用值列表或另一个数组进行索引,这样就可以了[np.array([0,2,4])会给我标签0,2,4.通过使用labels[dfg.index]我正在请求与bin#相对应的标签.基本上我将我的bin编号改为bin标签.我本可以对原始数据这样做,但这将是数千行; 通过我在小组之后做到21行左右.请注意,我不能这样做,dfg.index = labels因为我的某些垃圾箱可能是空的,因此不会出现在数据组中.
现在的dfg.ix['x',('col11', 'mean')]='N/A'部分.请记住,当我这样做时df.ix[df.col11 == 'x', 'col11']=-0.08,所有我的无效数据都被视为一个数字,并将被放入第一个bin.在应用group by和aggregate函数之后,我的第一个bin中'col11'值的平均值将是-0.08(因为所有这些值都是-0.08).现在我知道这不正确,-0.08的所有值实际上表示原始值为wsa x.你不能做x的意思.所以我手动把它放到N/A. 即.dfg.ix['x',('col11', 'mean')]='N/A'表示在dfg中索引(或行)为'x'且列为'col11 mean')将值设置为'N/A'.在('col11', 'mean')我认为是大熊猫如何出现与aggreagate当我做列名,即.agg({'col11': [np.mean]}),是指所产生的聚合列,我需要('column_name', 'aggregate_function_name')

所有这一切的动机是:将所有数据转换为数字,以便我可以使用Pandas的强大功能,然后在处理之后,手动更改我知道的任何垃圾值.如果您需要更多解释,请与我们联系.

归档时间：	12 年，3 月前
查看次数：	1223 次
最近记录：	12 年，3 月前