如何在熊猫数据框中以不同的顺序从文本数据框列中提取所有 ngram?

1 python extract n-gram pandas trigram

下面是我拥有的输入数据框。

id  description
1   **must watch avoid** **good acting**
2   average movie bad acting
3   good movie **acting good**
4   pathetic avoid
5   **avoid watch must**
Run Code Online (Sandbox Code Playgroud)

我想从短语中常用的单词中提取 ngrams,即 bigram、trigram 和 4 wordgram。让我们将短语标记为单词,那么即使常用单词的顺序不同,我们也可以找到 ngrams ie(如果我们经常使用单词“good movie”和在第二个短语我们经常使用的词是“电影好”,我们可以将二元词提取为“好电影”)。我期望的示例如下所示:

ngram              frequency
must watch            2
acting good           2
must watch avoid      2
average               1
Run Code Online (Sandbox Code Playgroud)

正如我们在第一句中看到的常用词是“must watch”,在最后一句中,我们有“watch must”,即频繁出现的词的顺序发生了变化。因此,它以 2 的频率提取必须观看的二元组。

我需要从短语中的常用单词中提取 ngrams/bigrams。

如何使用 Python 数据框实现这一点?任何帮助是极大的赞赏。

谢谢!

jrj*_*s83 6

import pandas as pd
from collections import Counter
from itertools import chain

data = [
    {"sentence": "Run with dogs, or shoes, or dogs and shoes"},
    {"sentence": "Run without dogs, or without shoes, or without dogs or shoes"},
    {"sentence": "Hold this while I finish writing the python script"},
    {"sentence": "Is this python script written yet, hey, hold this"},
    {"sentence": "Can dogs write python, or a python script?"},
]

def find_ngrams(input_list, n):
    return list(zip(*[input_list[i:] for i in range(n)]))

df = pd.DataFrame.from_records(data)
df['bigrams'] = df['sentence'].map(lambda x: find_ngrams(x.split(" "), 2))
df.head()
Run Code Online (Sandbox Code Playgroud)

现在进入频率计数

# Bigram Frequency Counts
bigrams = df['bigrams'].tolist()
bigrams = list(chain(*bigrams))
bigrams = [(x.lower(), y.lower()) for x,y in bigrams]

bigram_counts = Counter(bigrams)
bigram_counts.most_common(10)

 [(('dogs,', 'or'), 2),
 (('shoes,', 'or'), 2),
 (('or', 'without'), 2),
 (('hold', 'this'), 2),
 (('python', 'script'), 2),
 (('run', 'with'), 1),
 (('with', 'dogs,'), 1),
 (('or', 'shoes,'), 1),
 (('or', 'dogs'), 1),
 (('dogs', 'and'), 1)]
Run Code Online (Sandbox Code Playgroud)