import pandas as pd
dic = {'A': [np.nan, 4, np.nan, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]}
df = pd.DataFrame(dic)
df
Run Code Online (Sandbox Code Playgroud)
如果我有如下数据
A B C
0 NaN 9 0
1 4.0 2 0
2 NaN 5 5
3 4.0 3 3
Run Code Online (Sandbox Code Playgroud)
我想选择A列的原始NaN值,并用np.nan替换B列的值,如下所示.
A B C
0 NaN NaN 0
1 4.0 2.0 0
2 NaN NaN 5
3 4.0 3.0 3
Run Code Online (Sandbox Code Playgroud)
我试过df[df.A.isna()]["B"]=np.nan,但它没有用.
根据这个页面,我应该选择数据df.iloc.但问题是如果df有很多行,我就无法通过输入索引选择数据.
选项1
你实际上非常接近.使用pd.Series.isnull上A并赋值B使用df.loc:
df.loc[df.A.isnull(), 'B'] = np.nan
df
A B C
0 NaN NaN 0
1 4.0 2.0 0
2 NaN NaN 5
3 4.0 3.0 3
Run Code Online (Sandbox Code Playgroud)
选项2
np.where:
df['B'] = np.where(df.A.isnull(), np.nan, df.B)
df
A B C
0 NaN NaN 0
1 4.0 2.0 0
2 NaN NaN 5
3 4.0 3.0 3
Run Code Online (Sandbox Code Playgroud)
使用mask或where使用倒置条件 - 默认情况下替换为NaNs:
df['B'] = df.B.mask(df.A.isnull())
Run Code Online (Sandbox Code Playgroud)
df['B'] = df.B.where(df.A.notnull())
Run Code Online (Sandbox Code Playgroud)
使用非常相似numpy.where- 定义两个输出:
df['B'] = np.where(df.A.isnull(), np.nan, df.B)
Run Code Online (Sandbox Code Playgroud)
print (df)
A B C
0 NaN NaN 0
1 4.0 2.0 0
2 NaN NaN 5
3 4.0 3.0 3
Run Code Online (Sandbox Code Playgroud)
时间:
dic = {'A': [np.nan, 4, np.nan, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]}
df = pd.DataFrame(dic)
df = pd.concat([df] * 10000, ignore_index=True)
In [61]: %timeit df['B'] = np.where(df.A.isnull(), np.nan, df.B)
The slowest run took 7.57 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 405 µs per loop
In [62]: %timeit df['B'] = df.B.mask(df.A.isnull())
The slowest run took 70.14 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 3.54 ms per loop
In [63]: %timeit df['B'] = df.B.where(df.A.notnull())
The slowest run took 5.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.04 ms per loop
In [65]: %timeit df.B += df.A * 0
The slowest run took 12.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 913 µs per loop
In [67]: %timeit df.loc[df.A.isnull(), 'B'] = np.nan
The slowest run took 4.56 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.88 ms per loop
Run Code Online (Sandbox Code Playgroud)
因为我的同行采取了合理的选择......
df.B += df.A * 0
df
A B C
0 NaN NaN 0
1 4.0 2.0 0
2 NaN NaN 5
3 4.0 3.0 3
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
62 次 |
| 最近记录: |