Python Pandas Dataframe填充NaN值

use*_*430 6 python random nan dataframe pandas

我试图在数据框中填充NaN值,其值来自标准正态分布.这是我目前的代码:

 sqlStatement = "select * from sn.clustering_normalized_dataset"
 df = psql.frame_query(sqlStatement, cnx)
 data=df.pivot("user","phrase","tfw")
 dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))
 data[np.isnan(data)] = dfrand[np.isnan(data)]
Run Code Online (Sandbox Code Playgroud)

在旋转数据框"数据"后,它看起来像这样:

phrase      aaron  abbas  abdul       abe  able  abroad       abu     abuse  \
user                                                                          
14233664      NaN    NaN    NaN       NaN   NaN     NaN       NaN       NaN   
52602716      NaN    NaN    NaN       NaN   NaN     NaN       NaN       NaN   
123456789     NaN    NaN    NaN       NaN   NaN     NaN       NaN       NaN   
500158258     NaN    NaN    NaN       NaN   NaN     NaN       NaN       NaN   
517187571     0.4    NaN    NaN  0.142857     1     0.4  0.181818       NaN  
Run Code Online (Sandbox Code Playgroud)

但是,我需要将每个NaN值替换为新的随机值.所以我创建了一个新的df,它只包含随机值(dfrand),然后尝试用dfrand中与NaN索引相对应的值交换缺失的数字(Nan).嗯 - 不幸的是它不起作用 - 虽然表达

 np.isnan(data)
Run Code Online (Sandbox Code Playgroud)

返回一个数据帧,由True和False值组成,表达式

  dfrand[np.isnan(data)]
Run Code Online (Sandbox Code Playgroud)

仅返回NaN值,因此整体技巧不起作用.任何想法有什么问题?

tnk*_*epp 5

三千列不是那么多.你有几行?您总是可以制作相同大小的随机数据帧并进行逻辑替换(数据帧的大小将决定这是否可行).

如果您知道数据帧的大小:

import pandas as pd
import numpy as np

# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(rows,cols))

# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in

# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]
Run Code Online (Sandbox Code Playgroud)

如果你不知道你的数据框的大小,只需要改变一下

import pandas as pd
import numpy as np



# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in

# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))

# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]
Run Code Online (Sandbox Code Playgroud)

编辑每个"用户"的最后评论:"dfrand [np.isnan(data)]仅返回NaN."

对!这正是你想要的.在我的解决方案中,我有:data [np.isnan(data)] = dfrand [np.isnan(data)].翻译,这意味着:从dfrand中随机生成的值对应于"data"中的NaN位置,并将其插入"data",其中"data"是NaN.一个例子将有助于:

a = pd.DataFrame(data=np.random.randint(0,100,(10,3)))
a[0][5] = np.nan

In [32]: a
Out[33]: 
    0   1   2
0   2  26  28
1  14  79  82
2  89  32  59
3  65  47  31
4  29  59  15
5 NaN  58  90
6  15  66  60
7  10  19  96
8  90  26  92
9   0  19  23

# define randomly-generated dataframe, much like what you are doing, and replace NaN's
b = pd.DataFrame(data=np.random.randint(0,100,(10,3)))

In [39]: b
Out[39]: 
    0   1   2
0  92  21  55
1  65  53  89
2  54  98  97
3  48  87  79
4  98  38  62
5  46  16  30
6  95  39  70
7  90  59   9
8  14  85  37
9  48  29  46


a[np.isnan(a)] = b[np.isnan(a)]

In [38]: a
Out[38]: 
    0   1   2
0   2  26  28
1  14  79  82
2  89  32  59
3  65  47  31
4  29  59  15
5  46  58  90
6  15  66  60
7  10  19  96
8  90  26  92
9   0  19  23
Run Code Online (Sandbox Code Playgroud)

正如您所看到的,所有NaN都已被基于纳米价值指数的随机生成值所取代.


acu*_*ner 0

你可以尝试这样的事情,假设你正在处理一个系列:

ser = data['column_with_nulls_to_replace']
index = ser[ser.isnull()].index
df = pd.DataFrame(np.random.randn(len(index)), index=index, columns=['column_with_nulls_to_replace'])
ser.update(df)
Run Code Online (Sandbox Code Playgroud)