ADJ*_*ADJ 33 python text pandas
我有一个Pandas数据框,其中一列包含文本.我想得到整个列中出现的唯一单词列表(空格是唯一的分割).
import pandas as pd
r1=['My nickname is ft.jgt','Someone is going to my place']
df=pd.DataFrame(r1,columns=['text'])
Run Code Online (Sandbox Code Playgroud)
输出应如下所示:
['my','nickname','is','ft.jgt','someone','going','to','place']
Run Code Online (Sandbox Code Playgroud)
获得计数也没有什么坏处,但并不是必需的.
Bou*_*oud 55
使用a set
创建唯一元素序列.
做一些清理df
以获得小写和分裂的字符串:
df['text'].str.lower().str.split()
Out[43]:
0 [my, nickname, is, ft.jgt]
1 [someone, is, going, to, my, place]
Run Code Online (Sandbox Code Playgroud)
可以将此列中的每个列表传递给set.update
函数以获取唯一值.使用apply
这样做:
results = set()
df['text'].str.lower().str.split().apply(results.update)
print results
set(['someone', 'ft.jgt', 'my', 'is', 'to', 'going', 'place', 'nickname'])
Run Code Online (Sandbox Code Playgroud)
Ofi*_*ael 24
用途collections.Counter
:
>>> from collections import Counter
>>> r1=['My nickname is ft.jgt','Someone is going to my place']
>>> Counter(" ".join(r1).split(" ")).items()
[('Someone', 1), ('ft.jgt', 1), ('My', 1), ('is', 2), ('to', 1), ('going', 1), ('place', 1), ('my', 1), ('nickname', 1)]
Run Code Online (Sandbox Code Playgroud)
cwh*_*and 19
如果您想从DataFrame构造中执行此操作:
import pandas as pd
r1=['My nickname is ft.jgt','Someone is going to my place']
df=pd.DataFrame(r1,columns=['text'])
df.text.apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)
My 1
Someone 1
ft.jgt 1
going 1
is 2
my 1
nickname 1
place 1
to 1
dtype: float64
Run Code Online (Sandbox Code Playgroud)
如果您想要更灵活的标记化使用nltk
及其tokenize
基于@Ofir以色列的答案,特定于熊猫:
from collections import Counter
result = Counter(" ".join(df['text'].values.tolist()).split(" ")).items()
result
Run Code Online (Sandbox Code Playgroud)
将为您提供所需的内容,将文本列系列值转换为列表,拆分空格并计算实例.
uniqueWords = list(set(" ".join(r1).lower().split(" ")))
count = len(uniqueWords)
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
36755 次 |
最近记录: |