我有一个带有一个文本列的 Pandas 数据框。我想数一数本专栏中哪些短语最常见。例如,从文本中可以看出,诸如a very good movie、last night等短语出现了很多次。我认为有一种定义 n-gram 的方法,例如该短语在 3 到 5 个单词之间,但我不知道该怎么做。
import pandas as pd
text = ['this is a very good movie that we watched last night',
'i have watched a very good movie last night',
'i love this song, its amazing',
'what should we do if he asks for it',
'movie last night was amazing',
'a very nice song was played',
'i would like to se a good show',
'a good show was on tv last night']
df = pd.DataFrame({"text":text})
print(df)
Run Code Online (Sandbox Code Playgroud)
所以我的目标是对出现次数较多的词组(3-5个词)进行排名
split列表推导中的第一个文本并展平到vals,然后创建ngrams、传递到Series和最后使用Series.value_counts:
from nltk import ngrams
vals = [y for x in df['text'] for y in x.split()]
n = [3,4,5]
a = pd.Series([y for x in n for y in ngrams(vals, x)]).value_counts()
print (a)
(a, good, show) 2
(movie, last, night) 2
(a, very, good) 2
(last, night, i) 2
(a, very, good, movie) 2
..
(should, we, do) 1
(a, very, nice, song, was) 1
(asks, for, it, movie, last) 1
(this, song,, its, amazing, what) 1
(i, have, watched, a) 1
Length: 171, dtype: int64
Run Code Online (Sandbox Code Playgroud)
或者如果元组应该用空格连接:
n = [3,4,5]
a = pd.Series([' '.join(y) for x in n for y in ngrams(vals, x)]).value_counts()
print (a)
last night i 2
a good show 2
a very good movie 2
very good movie 2
movie last night 2
..
its amazing what should 1
watched last night i have 1
to se a 1
very good movie last night 1
a very nice song was 1
Length: 171, dtype: int64
Run Code Online (Sandbox Code Playgroud)
另一个想法Counter:
from nltk import ngrams
from collections import Counter
vals = [y for x in df['text'] for y in x.split()]
c = Counter([' '.join(y) for x in [3,4,5] for y in ngrams(vals, x)])
df1 = pd.DataFrame({'ngrams': list(c.keys()),
'count': list(c.values())})
print (df1)
ngrams count
0 this is a 1
1 is a very 1
2 a very good 2
3 very good movie 2
4 good movie that 1
.. ... ...
166 show a good show was 1
167 a good show was on 1
168 good show was on tv 1
169 show was on tv last 1
170 was on tv last night 1
[171 rows x 2 columns]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1309 次 |
| 最近记录: |