我有一个.xlsx文件,我打开这个代码:
import pandas as pd
df = pd.read_excel(open('file.xlsx','rb'))
df['Description'].head
Run Code Online (Sandbox Code Playgroud)
我有以下结果,看起来很不错.
ID | Description
:----- | :-----------------------------
0 | Some Description with no hash
1 | Text with #one hash
2 | Text with #two #hashes
Run Code Online (Sandbox Code Playgroud)
现在我想创建一个新列,只保留以#开头的单词,如下所示:
ID | Description | Only_Hash
:----- | :----------------------------- | :-----------------
0 | Some Description with no hash | Nan
1 | Text with #one hash | #one
2 | Text with #two #hashes | #two #hashes
Run Code Online (Sandbox Code Playgroud)
我能用#计算/分隔线:
descriptionWithHash = df['Description'].str.contains('#').sum()
Run Code Online (Sandbox Code Playgroud)
但现在我想像上面描述的那样创建列.最简单的方法是什么?
问候!
PS:它应该在问题中显示表格格式,但我无法弄清楚它为什么显示错误!
你可以用str.findall与str.join:
df['new'] = df['Description'].str.findall('(\#\w+)').str.join(' ')
print(df)
ID Description new
0 0 Some Description with no hash
1 1 Text with #one hash #one
2 2 Text with #two #hashes #two #hashes
Run Code Online (Sandbox Code Playgroud)
对于NaNs:
df['new'] = df['Description'].str.findall('(\#\w+)').str.join(' ').replace('',np.nan)
print(df)
ID Description new
0 0 Some Description with no hash NaN
1 1 Text with #one hash #one
2 2 Text with #two #hashes #two #hashes
Run Code Online (Sandbox Code Playgroud)
In [126]: df.join(df.Description
...: .str.extractall(r'(\#\w+)')
...: .unstack(-1)
...: .T.apply(lambda x: x.str.cat(sep=' ')).T
...: .to_frame(name='Hash'))
Out[126]:
ID Description Hash
0 0 Some Description with no hash NaN
1 1 Text with #one hash #one
2 2 Text with #two #hashes #two #hashes
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
47 次 |
| 最近记录: |