将 pandas DataFrame 中的列转换为具有 nan 值的浮点数

Question

将 pandas DataFrame 中的列转换为具有 nan 值的浮点数

我正在使用 pandas 和 Python3.4 操作数据。我遇到特定 csv 文件的问题。我不知道为什么，即使有nan值，pandas 通常也会将列读取为float. 这里将它们读作string。我的 csv 文件如下所示：

\n\n

Date        RR  TN  TX\n08/10/2015  0   10.5    19.5\n09/10/2015  0   5.5 20\n10/10/2015  0   5   24\n11/10/2015  0.5 7   24.5\n12/10/2015  3   12  23\n...\n27/04/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0\n28/04/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0\n29/04/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0\n30/04/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0\n01/05/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0\n02/05/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0\n03/05/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0\n04/05/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0\n

Run Code Online (Sandbox Code Playgroud)\n\n

问题是float由于nan最后的值我无法将其转换为。我需要它们，float因为我正在尝试做TN+ TX。\n这是我到目前为止尝试过的：

\n\n

读取文件时：

\n\n

dfs[code] = pd.read_csv(path, sep = ';', index_col = 0, parse_dates = True, encoding = 'ISO-8859-1', dtype = float)\n

Run Code Online (Sandbox Code Playgroud)\n\n

我也尝试过：

\n\n

dtype = {\n    'TN': np.float,\n    'TX': np.float\n}\ndfs[code] = pd.read_csv(path, sep = ';', index_col = 0, parse_dates = True, encoding = 'ISO-8859-1', dtype = dtype)\n

Run Code Online (Sandbox Code Playgroud)\n\n

否则，目前要执行添加，我也尝试过：

\n\n

tn = dfs[code]['TN'].astype(float)\ntx = dfs[code]['TX'].astype(float)\nformatted_dfs[code] = tn + tx\n

Run Code Online (Sandbox Code Playgroud)\n\n

但我总是遇到同样的错误：

\n\n

ValueError: could not convert string to float.\n

Run Code Online (Sandbox Code Playgroud)\n\n

我知道我可以逐行执行此操作，测试该值是否为nan，但我很确定有一种更简单的方法。你知道怎么做吗？或者我必须一行一行地做？谢谢。

\n

Answer 1

Mic*_*ith 5

您可以看到，如果允许 pandas 本身检测 dtypes，您就可以避免 ValueError 并发现潜在的问题。

\n\n

In [4]: df = pd.read_csv(path, sep=\';\', index_col=0, parse_dates=True, low_memory=False)\nIn [5]: df\nOut[5]:\nEmpty DataFrame\nColumns: []\nIndex: [08/10/2015  0   10.5    19.5, 09/10/2015  0   5.5 20, 10/10/2015  0   5   24, 11/10/2015  0.5 7   24.5, 12/10/2015  3   12  23, 27/04/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0, 28/04/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0, 29/04/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0, 30/04/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0, 01/05/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0, 02/05/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0, 03/05/2017  \xc2\xa0   \xc2\xa0   \xc2\xa0, 04/05/2017  \xc2\xa0]\n

Run Code Online (Sandbox Code Playgroud)\n\n

看来您无意中指定了分隔符\';\'，因为您的文件是以空格分隔的。由于没有任何分号，因此整行都将读入索引。

\n\n

首先，尝试使用正确的分隔符读取文件

\n\n

df = pd.read_csv(path, delim_whitespace=True, index_col=0, parse_dates=True, low_memory=False)\n

Run Code Online (Sandbox Code Playgroud)\n\n

现在，某些行的数据不完整。从概念上讲，一个简单的解决方案是尝试将值转换为np.float，并用np.nanother 替换它们。

\n\n

def f(x):\n    try:\n        return np.float(x)\n    except:\n        return np.nan\n\ndf["TN"] = df["TN"].apply(f)\ndf["TX"] = df["TX"].apply(f)\n\nprint(df.dtypes)\n

Run Code Online (Sandbox Code Playgroud)\n\n

这会根据需要返回

\n\n

RR     object\nTN    float64\nTX    float64\ndtype: object\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	8 年，9 月前
查看次数：	6211 次
最近记录：	5 年，8 月前