如何替换熊猫中的值？

Question

如何替换熊猫中的值？

尝试将“KDDTest+.csv”的倒数第二列中的 23 个不同标签分为四组。请注意，在执行此操作之前，我已删除了 csv 的最后一列。

我已经阅读了 .csv 文件使用

df = pd.read_csv('KDDTrain+.csv', header=None, names = col_names)

Run Code Online (Sandbox Code Playgroud)

在哪里

col_names = ["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]

Run Code Online (Sandbox Code Playgroud)

如果我打印出数据框的前 5 行，这是输出（请注意“标签”列）：

使用 print(df.head(5))

   duration protocol_type  ... dst_host_srv_rerror_rate    label
0         0           tcp  ...                     0.00   normal
1         0           udp  ...                     0.00   normal
2         0           tcp  ...                     0.00  neptune
3         0           tcp  ...                     0.01   normal
4         0           tcp  ...                     0.00   normal

Run Code Online (Sandbox Code Playgroud)

我已经根据我在网上找到的内容尝试了这两种分组方法：

方法一：

df.replace(to_replace = ['ipsweep.', 'portsweep.', 'nmap.', 'satan.'], value = 'probe', inplace = True)
df.replace(to_replace = ['ftp_write.', 'guess_passwd.', 'imap.', 'multihop.', 'phf.', 'spy.', 'warezclient.', 'warezmaster.'], value = 'r2l', inplace = True)
df.replace(to_replace = ['buffer_overflow.', 'loadmodule.', 'perl.', 'rootkit.'], value = 'u2r', inplace = True)
df.replace(to_replace = ['back.', 'land.' , 'neptune.', 'pod.', 'smurf.', 'teardrop.'], value = 'dos', inplace = True)

Run Code Online (Sandbox Code Playgroud)

方法二：

df['label'] = df['label'].replace(['ipsweep.', 'portsweep.', 'nmap.', 'satan.'], 'probe',regex=True)
df['label'] = df['label'].replace(['ftp_write.', 'guess_passwd.', 'imap.', 'multihop.', 'phf.', 'spy.', 'warezclient.', 'warezmaster.'], 'r2l',regex=True)
df['label'] = df['label'].replace(['buffer_overflow.', 'loadmodule.', 'perl.', 'rootkit.'], 'u2r',regex=True)
df['label'] = df['label'].replace(['back.', 'land.' , 'neptune.', 'pod.', 'smurf.', 'teardrop.'], 'dos',regex=True)

Run Code Online (Sandbox Code Playgroud)

但是，这仍然是打印数据帧的前 5 行的输出：

After replacing, first 5 rows of df: 

   duration protocol_type  ... dst_host_srv_rerror_rate    label
0         0           tcp  ...                     0.00   normal
1         0           udp  ...                     0.00   normal
2         0           tcp  ...                     0.00  neptune
3         0           tcp  ...                     0.01   normal
4         0           tcp  ...                     0.00   normal

Run Code Online (Sandbox Code Playgroud)

我期望第 2 行中的标签列读取“dos”而不是“neptune”，但它没有发生。

我究竟做错了什么？任何帮助表示赞赏。

Answer 1

Paw*_*erg 1

通过使用的值"neptune."，您可以告诉 Pandas 查找任何单个附加字符（例如，“neptuneX”或“neptune！”）。由于不存在该额外字符，因此整个短语不会被替换。相反，您可以仅使用, 或来表示 0 或 1 个额外字符，或者使用 0 或任意更多数量的额外字符。to_replaceregex = True"neptune""neptune""neptune.?""neptune.*"

如果没有regex = True，你就是在告诉 Pandas 寻找字面"neptune."短语。

归档时间：	5 年，5 月前
查看次数：	64 次
最近记录：	5 年，5 月前