1 python extract n-gram pandas trigram
下面是我拥有的输入数据框。
id description
1 **must watch avoid** **good acting**
2 average movie bad acting
3 good movie **acting good**
4 pathetic avoid
5 **avoid watch must**
Run Code Online (Sandbox Code Playgroud)
我想从短语中常用的单词中提取 ngrams,即 bigram、trigram 和 4 wordgram。让我们将短语标记为单词,那么即使常用单词的顺序不同,我们也可以找到 ngrams ie(如果我们经常使用单词“good movie”和在第二个短语我们经常使用的词是“电影好”,我们可以将二元词提取为“好电影”)。我期望的示例如下所示:
ngram frequency
must watch 2
acting good 2
must watch avoid 2
average 1
Run Code Online (Sandbox Code Playgroud)
正如我们在第一句中看到的常用词是“must watch”,在最后一句中,我们有“watch must”,即频繁出现的词的顺序发生了变化。因此,它以 2 的频率提取必须观看的二元组。
我需要从短语中的常用单词中提取 ngrams/bigrams。
如何使用 Python 数据框实现这一点?任何帮助是极大的赞赏。
谢谢!
import pandas as pd
from collections import Counter
from itertools import chain
data = [
{"sentence": "Run with dogs, or shoes, or dogs and shoes"},
{"sentence": "Run without dogs, or without shoes, or without dogs or shoes"},
{"sentence": "Hold this while I finish writing the python script"},
{"sentence": "Is this python script written yet, hey, hold this"},
{"sentence": "Can dogs write python, or a python script?"},
]
def find_ngrams(input_list, n):
return list(zip(*[input_list[i:] for i in range(n)]))
df = pd.DataFrame.from_records(data)
df['bigrams'] = df['sentence'].map(lambda x: find_ngrams(x.split(" "), 2))
df.head()
Run Code Online (Sandbox Code Playgroud)
现在进入频率计数
# Bigram Frequency Counts
bigrams = df['bigrams'].tolist()
bigrams = list(chain(*bigrams))
bigrams = [(x.lower(), y.lower()) for x,y in bigrams]
bigram_counts = Counter(bigrams)
bigram_counts.most_common(10)
[(('dogs,', 'or'), 2),
(('shoes,', 'or'), 2),
(('or', 'without'), 2),
(('hold', 'this'), 2),
(('python', 'script'), 2),
(('run', 'with'), 1),
(('with', 'dogs,'), 1),
(('or', 'shoes,'), 1),
(('or', 'dogs'), 1),
(('dogs', 'and'), 1)]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5640 次 |
| 最近记录: |