我汇总了我的Pandas数据帧:data.具体来说,我希望amount通过[ origin和type]的元组得到平均值和总和.为了平均和求和,我尝试了下面的numpy函数:
import numpy as np
import pandas as pd
result = data.groupby(groupbyvars).agg({'amount': [ pd.Series.sum, pd.Series.mean]}).reset_index()
Run Code Online (Sandbox Code Playgroud)
我的问题是该amount列包含NaNs,这导致result上述代码具有大量的NaN平均值和总和.
我知道这两个pd.Series.sum和pd.Series.mean具有skipna=True默认情况下,所以为什么我仍然得到NaN下面就?
我也试过这个,这显然不起作用:
data.groupby(groupbyvars).agg({'amount': [ pd.Series.sum(skipna=True), pd.Series.mean(skipna=True)]}).reset_index()
Run Code Online (Sandbox Code Playgroud)
编辑:
根据@ Korem的建议,我也尝试使用partial如下:
s_na_mean = partial(pd.Series.mean, skipna = True)
data.groupby(groupbyvars).agg({'amount': [ np.nansum, s_na_mean ]}).reset_index()
Run Code Online (Sandbox Code Playgroud)
但得到这个错误:
error: 'functools.partial' object has no attribute '__name__'
Run Code Online (Sandbox Code Playgroud) 我使用PostgreSQL 9.1.2并且我有一个基本表,如下所示,我将条目的生存状态作为布尔值 (Survival)以及天数(Survival(Days)).
我手动添加了一个名为的新列1-yr Survival,现在我想为表中的每个条目填写此列的值,条件是该条目的值Survival和 Survival (Days)列值.一旦完成,数据库表将如下所示:
Survival Survival(Days) 1-yr Survival
---------- -------------- -------------
Dead 200 NO
Alive - YES
Dead 1200 YES
Run Code Online (Sandbox Code Playgroud)
输入条件值的伪代码1-yr Survival将类似于:
ALTER TABLE mytable ADD COLUMN "1-yr Survival" text
for each row
if ("Survival" = Dead & "Survival(Days)" < 365) then Update "1-yr Survival" = NO
else Update "1-yr Survival" = YES
end
Run Code Online (Sandbox Code Playgroud)
我相信这是一个基本的操作,但我没有找到postgresql语法来执行它.一些搜索结果返回"添加触发器",但我不确定这是我所需要的.我认为我的情况要简单得多.任何帮助/建议将不胜感激.
这是关于返回满足 R 条件的矩阵行的问题的扩展.说我有矩阵:
one two three four
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 11 18
[4,] 4 9 11 19
[5,] 5 10 15 20
[6,] 1 6 15 20
[7,] 5 7 12 20
Run Code Online (Sandbox Code Playgroud)
我想尽快返回所有行,其中matrix$two == 7AND matrix$three == 12.这是我所知道的方式:
out <- mat[mat$two == 7,]
final_out <- out[out$three == 12, ]
Run Code Online (Sandbox Code Playgroud)
显然应该有一种方法来获取final_out单行内容,例如: final_out <- which(mat$two == 7 && mat$three == 12)比上面的两行代码更快,更简洁.
返回此多条件矩阵查询的最快R代码是什么?
performance r matrix multiple-columns conditional-statements
我的数据帧不完整incomplete_df,如下所示.我想amount用amount相应的平均值来估算缺失的s id.如果该特定的平均值id本身是NaN(参见参考资料id=4),我想使用整体平均值.
以下是示例数据和我效率极低的解决方案:
import pandas as pd
import numpy as np
incomplete_df = pd.DataFrame({'id': [1,2,3,2,2,3,1,1,1,2,4],
'type': ['one', 'one', 'two', 'three', 'two', 'three', 'one', 'two', 'one', 'three','one'],
'amount': [345,928,np.NAN,645,113,942,np.NAN,539,np.NAN,814,np.NAN]
}, columns=['id','type','amount'])
# Forrest Gump Solution
for idx in incomplete_df.index[np.isnan(incomplete_df.amount)]: # loop through all rows with amount = NaN
cur_id = incomplete_df.loc[idx, 'id']
if (cur_id in means.index ):
incomplete_df.loc[idx, 'amount'] = means.loc[cur_id]['amount'] # average amount of that specific id.
else:
incomplete_df.loc[idx, …Run Code Online (Sandbox Code Playgroud) 我有一个Pandas DataFrame,如下所示.
df
A B
date_time
2014-07-01 06:03:59.614000 62.1250 NaN
2014-07-01 06:03:59.692000 62.2500 NaN
2014-07-01 06:13:34.524000 62.2500 241.0625
2014-07-01 06:13:34.602000 62.2500 241.5000
2014-07-01 06:15:05.399000 62.2500 241.3750
2014-07-01 06:15:05.399000 62.2500 241.2500
2014-07-01 06:15:42.004000 62.2375 241.2500
2014-07-01 06:15:42.082000 62.2375 241.3750
2014-07-01 06:15:42.082000 62.2375 240.2500
Run Code Online (Sandbox Code Playgroud)
我想将此频率更改为常规1 minute间隔.但是得到以下错误:
new = df.asfreq('1Min')
>>error: cannot reindex from a duplicate axis
Run Code Online (Sandbox Code Playgroud)
现在,我理解为什么会这样.由于我的时间粒度很高(以毫秒为单位)但不规则,我每分钟获得多个读数,甚至每秒.因此,我尝试将这些毫秒读数合并到分钟,并删除重复项,如下所示.
# try to convert the index to minutes and drop duplicates
df['index'] = df.index
df['minute_index']= df['index'].apply( lambda x: x.strftime('%Y-%m-%d %H:%M'))
df.drop_duplicates(cols = 'minute_index', …Run Code Online (Sandbox Code Playgroud) 我知道statsmodels.tools.tools.ECDF,但由于计算一个empricial累积分布函数(ECDF)是非常简单的,我想最小化项目中的依赖项,我想手动编码.
在给定的list()/中np.array() Pandas.Series,每个元素的ECDF可以按维基百科中给出的方式计算:

我有Pandas DataFrame,dfser下面我想获得该values列的ecdf .我也给出了两个单线解决方案.
有更快的方法吗?速度在我的应用中很重要.
# Note that in my case indices are unique identifiers so I cannot reset them.
import numpy as np
import pandas as pd
# all indices are unique, but there may be duplicate measurement values (that belong to different indices).
dfser = pd.DataFrame({'group':['a','b','b','a','d','c','e','e','c','a','b','d','d','c','d','e','e','a'],
'values':[2.01899E-06, 1.12186E-07, 8.97467E-07, 2.91257E-06, 1.93733E-05,
0.00017889, 0.000120963, 4.27643E-07, 3.33614E-07, 2.08352E-12,
1.39478E-05, 4.28255E-08, 9.7619E-06, 8.51787E-09, 1.28344E-09,
3.5063E-05, 0.01732035,2.08352E-12]}, …Run Code Online (Sandbox Code Playgroud) 我有以下代码,使用它我可以通过三行Pandas代码计算体积加权平均价格.
import numpy as np
import pandas as pd
from pandas.io.data import DataReader
import datetime as dt
df = DataReader(['AAPL'], 'yahoo', dt.datetime(2013, 12, 30), dt.datetime(2014, 12, 30))
df['Cum_Vol'] = df['Volume'].cumsum()
df['Cum_Vol_Price'] = (df['Volume'] * (df['High'] + df['Low'] + df['Close'] ) /3).cumsum()
df['VWAP'] = df['Cum_Vol_Price'] / df['Cum_Vol']
Run Code Online (Sandbox Code Playgroud)
我试图找到一种方法来编码,而不cumsum()用作练习.我试图找到VWAP一个通过一列的解决方案.我尝试了下面这行,使用.apply().逻辑是存在的,但问题是我无法在行n中存储值以便在行(n + 1)中使用.你如何解决这个问题pandas- 只需使用外部连音符或字典来临时存储累积值?
df['Cum_Vol']= np.nan
df['Cum_Vol_Price'] = np.nan
# calculate running cumulatives by apply - assume df row index is 0 to N
df['Cum_Vol'] = …Run Code Online (Sandbox Code Playgroud) 我有一个大型数据框(例如15k对象),其中每一行都是一个对象,列是数字对象的特征.它的形式如下:
df = pd.DataFrame({ 'A' : [0, 0, 1],
'B' : [2, 3, 4],
'C' : [5, 0, 1],
'D' : [1, 1, 0]},
columns= ['A','B', 'C', 'D'], index=['first', 'second', 'third'])
Run Code Online (Sandbox Code Playgroud)
我想计算所有对象(行)的成对距离,并且由于其计算效率,读取scipy的pdist()函数是一个很好的解决方案.我可以简单地打电话:
res = pdist(df, 'cityblock')
res
>> array([ 6., 8., 4.])
Run Code Online (Sandbox Code Playgroud)
并且看到res数组按以下顺序包含距离:[first-second, first-third, second-third].
我的问题是如何在矩阵,数据帧或(不太理想的)dict格式中得到它,所以我确切地知道每个距离值属于哪一对,如下所示:
first second third
first 0 - -
second 6 0 -
third 8 4 0
Run Code Online (Sandbox Code Playgroud)
最后,我认为将距离矩阵作为pandas DataFrame可能很方便,因为我可以对每行应用一些排序和排序操作(例如,找到最靠近对象的N个最近的对象first).
我有两个Pandas数据框,分别是:habitat_family和habitat_species。我想habitat_species根据分类标准lookupMap和中的值进行填充habitat_family:
import pandas as pd
import numpy as np
species = ['tiger', 'lion', 'mosquito', 'ladybug', 'locust', 'seal', 'seabass', 'shark', 'dolphin']
families = ['mammal','fish','insect']
lookupMap = {'tiger':'mammal', 'lion':'mammal', 'mosquito':'insect', 'ladybug':'insect', 'locust':'insect',
'seal':'mammal', 'seabass':'fish', 'shark':'fish', 'dolphin':'mammal' }
habitat_family = pd.DataFrame({'id': range(1,11),
'mammal': [101,123,523,562,546,213,562,234,987,901],
'fish' : [625,254,929,827,102,295,174,777,123,763],
'insect': [345,928,183,645,113,942,689,539,789,814]
}, index=range(1,11), columns=['id','mammal','fish','insect'])
habitat_species = pd.DataFrame(0.0, index=range(1,11), columns=species)
# My highly inefficient solution:
for id in habitat_family.index: # loop through habitat id's
for spec …Run Code Online (Sandbox Code Playgroud) 我有一个数据框如下:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id' : range(1,9),
'code' : ['one', 'one', 'two', 'three',
'two', 'three', 'one', 'two'],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'amount' : np.random.randn(8)}, columns= ['id','code','colour','amount'])
Run Code Online (Sandbox Code Playgroud)
我希望能够到组idS按code和colour,然后将它们相对于排序amount.我知道如何groupby():
df.groupby(['code','colour']).head(5)
id code colour amount
code colour
one black 0 1 one black -0.117307
white 1 2 one white 1.653216
6 7 one white 0.817205
three black 5 6 three black 0.567162 …Run Code Online (Sandbox Code Playgroud)