Pio*_*dal 15 csv int nan missing-data pandas
将Pandas DataFrame保存到csv时,某些整数将在浮点数中转换.它发生在浮点列缺少值(np.nan)的地方.
有一种简单的方法可以避免它吗?(特别是以自动方式 - 我经常处理各种数据类型的许多列.)
例如
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2],[3,np.nan],[5,6]],
columns=["a","b"],
index=["i_1","i_2","i_3"])
df.to_csv("file.csv")
Run Code Online (Sandbox Code Playgroud)
产量
,a,b
i_1,1,2.0
i_2,3,
i_3,5,6.0
Run Code Online (Sandbox Code Playgroud)
我想得到的是
,a,b
i_1,1,2
i_2,3,
i_3,5,6
Run Code Online (Sandbox Code Playgroud)
编辑:我完全了解支持整数NA - 熊猫警告和陷阱.问题是什么是一个很好的解决方法(特别是如果有许多其他各种类型的列,我不知道哪个"整数"列有缺少值).
float_format = '%.12g'在to_csv函数内部使用为我解决了类似的问题.它保留合法浮点数的小数,最多12个有效数字,但是因为NaN的存在而强制浮动的int会丢弃它们:
In [4]: df
Out[4]:
a b
i_1 1 2.0
i_2 3 NaN
i_3 5.9 6.0
In [5]: df.to_csv('file.csv', float_format = '%.12g')
Run Code Online (Sandbox Code Playgroud)
输出是:
, a, b
i_1, 1, 2
i_2, 3,
i_3, 5.9, 6
Run Code Online (Sandbox Code Playgroud)
这个代码片段可以满足您的需求,并且在执行此操作时应该相对有效.
import numpy as np
import pandas as pd
EPSILON = 1e-9
def _lost_precision(s):
"""
The total amount of precision lost over Series `s`
during conversion to int64 dtype
"""
try:
return (s - s.fillna(0).astype(np.int64)).sum()
except ValueError:
return np.nan
def _nansafe_integer_convert(s):
"""
Convert Series `s` to an object type with `np.nan`
represented as an empty string ""
"""
if _lost_precision(s) < EPSILON:
# Here's where the magic happens
as_object = s.fillna(0).astype(np.int64).astype(np.object)
as_object[s.isnull()] = ""
return as_object
else:
return s
def nansafe_to_csv(df, *args, **kwargs):
"""
Write `df` to a csv file, allowing for missing values
in integer columns
Uses `_lost_precision` to test whether a column can be
converted to an integer data type without losing precision.
Missing values in integer columns are represented as empty
fields in the resulting csv.
"""
df.apply(_nansafe_integer_convert).to_csv(*args, **kwargs)
Run Code Online (Sandbox Code Playgroud)
我们可以使用一个简单的DataFrame测试它,它应该涵盖所有基础:
In [75]: df = pd.DataFrame([[1,2, 3.1, "i"],[3,np.nan, 4.0, "j"],[5,6, 7.1, "k"]]
columns=["a","b", "c", "d"],
index=["i_1","i_2","i_3"])
In [76]: df
Out[76]:
a b c d
i_1 1 2 3.1 i
i_2 3 NaN 4.0 j
i_3 5 6 7.1 k
In [77]: nansafe_to_csv(df, 'deleteme.csv', index=False)
Run Code Online (Sandbox Code Playgroud)
其中产生以下csv文件:
a,b,c,d
1,2,3.1,i
3,,4.0,j
5,6,7.1,k
Run Code Online (Sandbox Code Playgroud)