Pandas csv - 清理错误列中的数据

Question

Pandas csv - 清理错误列中的数据

我正在使用一个数据集，其中某些行缺少一列，并且后续列被错误地移动到缺少列的位置，因此它可能如下所示：

              date    tap     time    count
0         20160730     on     02:30   415.0
1         20160730     on     02:30    18.0
2         20160730     on     02:30    24.0
3         20160730     on     02:30    31.0
4         20160730     on     13:30    64.0
...            ...    ...       ...     ...
169549    20170101  23:45        29     NaN
169550    20170101  23:45        34     NaN
169551    20170101  23:45        43     NaN
169552    20170101  23:45        42     NaN
169553    20170101  23:45        60     NaN

Run Code Online (Sandbox Code Playgroud)

请注意，在最后 5 行中，“time”的值位于“tap”列中，而“count”的值位于“time”列中。这不仅发生在最后几行，而且发生在整个数据集中。

我正在尝试创建一个函数来执行此操作：

for each item in the 'tap' column
   if item is neither 'on' or 'off', then
      the value of the 'count' column in that row takes on the value of the 'time' column
      the value of the 'time' column in that row takes on the value of the 'tap' column
      the value of the 'tap' column in that row is replaced by a string "N/A"

Run Code Online (Sandbox Code Playgroud)

所以希望最终结果会是这样的：

              date   tap    time    count
0         20160730    on    02:30   415.0
1         20160730    on    02:30    18.0
2         20160730    on    02:30    24.0
3         20160730    on    02:30    31.0
4         20160730    on    13:30    64.0
...            ...   ...      ...     ...
169549    20170101   N/A    23:45      29
169550    20170101   N/A    23:45      34
169551    20170101   N/A    23:45      43
169552    20170101   N/A    23:45      42
169553    20170101   N/A    23:45      60

Run Code Online (Sandbox Code Playgroud)

到目前为止我只加载了 csv 文件......

import pandas as pd 

df = pd.read_csv('data.csv', dtype={
    'date': str,
    'tap': str,
    'time': str,
    'count': float})

Run Code Online (Sandbox Code Playgroud)

我确信我错过了一些非常简单的东西，但我已经在谷歌上花了几个小时，只是找不到正确的语法来做到这一点。请让我知道如何进行这项工作。

Answer 1

jez*_*ael 5

DataFrame.shift与条件 by 一起使用Series.isin，只需将所有列转换为字符串，以避免不匹配的数据类型丢失值（如最后一列）：

m = df['tap'].isin(['on','off'])
cols = ['tap','time','count']
df.loc[~m, cols] = df.loc[~m, cols].astype(str).shift(axis=1)
df['count'] = df['count'].astype(int)
print (df)
            date  tap   time  count
0       20160730   on  02:30    415
1       20160730   on  02:30     18
2       20160730   on  02:30     24
3       20160730   on  02:30     31
4       20160730   on  13:30     64
169549  20170101  NaN  23:45     29
169550  20170101  NaN  23:45     34
169551  20170101  NaN  23:45     43
169552  20170101  NaN  23:45     42
169553  20170101  NaN  23:45     60

Run Code Online (Sandbox Code Playgroud)

如果要分配新列而不需要移位：

m = df['tap'].isin(['on','off'])
df.loc[~m, ['time','count']] = df.loc[~m, ['tap','time']].to_numpy()
df.loc[~m, 'tap'] = np.nan
df['count'] = df['count'].astype(int)
print (df)
            date  tap   time  count
0       20160730   on  02:30    415
1       20160730   on  02:30     18
2       20160730   on  02:30     24
3       20160730   on  02:30     31
4       20160730   on  13:30     64
169549  20170101  NaN  23:45     29
169550  20170101  NaN  23:45     34
169551  20170101  NaN  23:45     43
169552  20170101  NaN  23:45     42
169553  20170101  NaN  23:45     60

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，1 月前
查看次数：	932 次
最近记录：	6 年，1 月前