我有一个如下所示的数据框:
Out[14]:
impwealth indweight
16 180000 34.200
21 384000 37.800
26 342000 39.715
30 1154000 44.375
31 421300 44.375
32 1210000 45.295
33 1062500 45.295
34 1878000 46.653
35 876000 46.653
36 925000 53.476
Run Code Online (Sandbox Code Playgroud)
我想impwealth用频率权重计算列的加权中位数indweight.我的伪代码看起来像这样:
# Sort `impwealth` in ascending order
df.sort('impwealth', 'inplace'=True)
# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)
# Search for the first occurrence of `impweight` that is greater than P
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()
# The …Run Code Online (Sandbox Code Playgroud) 我有一个Python脚本清理并在大型面板数据集(2,000,000+ observations)上执行基本统计计算.
我发现其中一些任务更适合Stata,并用必要的命令写了一个do文件.因此,我想在我的Python代码中运行.do文件.我该如何调用.do文件Python?
我有一个d关于100,000,000行和3列的数据框。看起来像这样:
import pandas as pd
In [17]: d = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e'], 'val': [1, 2, 3, 4, 5], 'n': [34, 22, 95, 86, 44]})
In [18]: d.set_index(['id', 'val'], inplace = True)
Run Code Online (Sandbox Code Playgroud)
我还有另一个要保留的数据框,其值是id和。有60万左右的组合,和我想保留:valdidval
In [20]: keep = pd.DataFrame({'id':['a', 'b'], 'val' : [1, 2]})
Run Code Online (Sandbox Code Playgroud)
我已经通过以下方式尝试过:
In [21]: keep.set_index(['id', 'val'], inplace = True)
In [22]: d.loc[d.index.isin(keep.index), :]
Out [22]:
n
id val
a 1 34
b 2 22 …Run Code Online (Sandbox Code Playgroud) 我有一个如下所示的数据框:
In [9]: d = pd.DataFrame({'place': ['home', 'home', 'home', 'home', 'office', 'office', 'office', 'home', 'office', 'home', 'office', 'home', 'office', 'home'], 'person': ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c'], 'other_stuff': ['f', 'g', 'd', 'q', 'w', 'r', 's', 't', 'u', 'v', 'w', 'l', 'm', 'n']})
In [7]: d
place other_stuff person
0 home f a
1 home g a
2 home d a
3 home q a
4 office w a
5 office r a …Run Code Online (Sandbox Code Playgroud)