pandas数据帧中整个列的子字符串

Question

pandas数据帧中整个列的子字符串

我有一个pandas数据帧"df".在这个数据框中,我有多个列,其中一个我必须子串.让我们说列名是"col".我可以运行像下面这样的"for"循环并对该列进行子串:

for i in range(0,len(df)):
  df.iloc[i].col = df.iloc[i].col[:9]

Run Code Online (Sandbox Code Playgroud)

但我想知道,如果有一个选项,我不必使用"for"循环,并直接使用属性.我有大量的数据,如果我这样做,数据将采取很长一段时间的过程.

Answer 1

ayh*_*han 73

使用str.slice:

df.col = df.col.str.slice(0, 9)

Run Code Online (Sandbox Code Playgroud)

你也可以使用它[],它使用水下切片:

df.col = df.col.str[:9]

Run Code Online (Sandbox Code Playgroud)

这给了我可怕的“SettingWithCopyWarning：” (4认同)

Answer 2

小智 16

如果该列不是字符串，请使用以下方法astype将其转换：

df['col'] = df['col'].astype(str).str[:9]

Run Code Online (Sandbox Code Playgroud)

Answer 3

Gon*_*ica 12

由于不确切知道 OP 的数据帧，因此可以创建一个用作测试。

df = pd.DataFrame({'col': {0: '2020-12-08', 1: '2020-12-08', 2: '2020-12-08', 3: '2020-12-08', 4: '2020-12-08', 5: '2020-12-08', 6: '2020-12-08', 7: '2020-12-08', 8: '2020-12-08', 9: '2020-12-08'}})

[Out]:
          col
0  2020-12-08
1  2020-12-08
2  2020-12-08
3  2020-12-08
4  2020-12-08
5  2020-12-08
6  2020-12-08
7  2020-12-08
8  2020-12-08
9  2020-12-08

Run Code Online (Sandbox Code Playgroud)

假设想要将列存储在同一个数据帧中df，并且我们只想保留 4 个字符，在名为的列上col_substring，可以执行多种选项。

选项1

使用pandas.Series.str

df['col_substring'] = df['col'].str[:4]

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Run Code Online (Sandbox Code Playgroud)

选项2

使用pandas.Series.str.slice如下

df['col_substring'] = df['col'].str.slice(0, 4)

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Run Code Online (Sandbox Code Playgroud)

或者像这样

df['col_substring'] = df['col'].str.slice(stop=4)

Run Code Online (Sandbox Code Playgroud)

选项3

使用自定义 lambda 函数

df['col_substring'] = df['col'].apply(lambda x: x[:4])

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Run Code Online (Sandbox Code Playgroud)

选项4

使用带有正则表达式的自定义 lambda 函数 (with re)

import re

df['col_substring'] = df['col'].apply(lambda x: re.findall(r'^.{4}', x)[0])

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Run Code Online (Sandbox Code Playgroud)

选项5

使用numpy.vectorize

df['col_substring'] = np.vectorize(lambda x: x[:4])(df['col'])

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Run Code Online (Sandbox Code Playgroud)

笔记：

理想的解决方案取决于用例、约束和数据帧。

归档时间：	9 年，9 月前
查看次数：	82190 次
最近记录：	7 年，3 月前