如何使用 spacy 删除停用词并在 pandas 数据框中获取引理？

Question

如何使用 spacy 删除停用词并在 pandas 数据框中获取引理？

Alo*_*kin 0 python nlp stop-words pandas spacy

我在 python 的 pandas 数据框中有一列标记。看起来像这样的东西：

 word_tokens
 (the,cheeseburger,was,great)
 (i,never,did,like,the,pizza,too,much)
 (yellow,submarine,was,only,an,ok,song)

Run Code Online (Sandbox Code Playgroud)

我想使用 spacy 库在此数据框中再获得两个新列。一列包含删除了停用词的每一行的标记，另一列包含第二列中的引理。我怎么能这么做呢？

Answer 1

小智 6

您将文本设置为 spaCy 类型是正确的 - 您希望将每个标记元组转换为 spaCy 文档。从那里，最好使用标记的属性来回答“标记是否是停用词”（use token.is_stop）或“该标记的引理是什么”（use token.lemma_）的问题。我的实现如下，我稍微更改了您的输入数据以包含一些复数示例，以便您可以看到词形还原工作正常。

import spacy
import pandas as pd

nlp = spacy.load('en_core_web_sm')

texts = [('the','cheeseburger','was','great'),
         ('i','never','did','like','the','pizzas','too','much'), 
         ('yellowed','submarines','was','only','an','ok','song')]

df = pd.DataFrame({'word_tokens': texts})

Run Code Online (Sandbox Code Playgroud)

最初的 DataFrame 如下所示：

	单词标记
0	（“那个”、“芝士汉堡”、“曾经”、“太棒了”）
1	（“我”、“从来没有”、“做过”、“喜欢”、“这个”、“披萨”、“也是”、“很多”）
2	('黄色', '潜艇', '曾经', '唯一', '安', '好吧', '歌曲')

我定义了执行主要任务的函数：

标记元组 -> spaCy Doc
spaCy Doc -> 非停用词列表
spaCy Doc -> 不间断、词形还原的单词列表

def to_doc(words:tuple) -> spacy.tokens.Doc:
    # Create SpaCy documents by joining the words into a string
    return nlp(' '.join(words))

def remove_stops(doc) -> list:
    # Filter out stop words by using the `token.is_stop` attribute
    return [token.text for token in doc if not token.is_stop]

def lemmatize(doc) -> list:
    # Take the `token.lemma_` of each non-stop word
    return [token.lemma_ for token in doc if not token.is_stop]

Run Code Online (Sandbox Code Playgroud)

应用这些看起来像：

# create documents for all tuples of tokens
docs = list(map(to_doc, df.word_tokens))

# apply removing stop words to all
df['removed_stops'] = list(map(remove_stops, docs))

# apply lemmatization to all
df['lemmatized'] = list(map(lemmatize, docs))

Run Code Online (Sandbox Code Playgroud)

您得到的输出应该如下所示：

	单词标记	删除的停靠点	词形还原
0	（“那个”、“芝士汉堡”、“曾经”、“太棒了”）	[‘芝士汉堡’，‘很棒’]	[‘芝士汉堡’，‘很棒’]
1	（“我”、“从来没有”、“做过”、“喜欢”、“这个”、“披萨”、“也是”、“很多”）	[‘喜欢’、‘披萨’]	[‘喜欢’、‘披萨’]
2	('黄色', '潜艇', '曾经', '唯一', '安', '好吧', '歌曲')	['黄色'，'潜艇'，'好吧'，'歌曲']	['黄色'、'潜艇'、'好吧'、'歌曲']

根据您的用例，您可能想要探索 spaCy 文档对象的其他属性 ( https://spacy.io/api/doc )。特别是，如果您想从文本中提取更多含义，请查看和doc.noun_chunks。doc.ents

还值得注意的是，如果您计划将其用于大量文本，您应该考虑nlp.pipe： https: //spacy.io/usage/processing-pipelines。它批量处理您的文档，而不是逐一处理，并且可以使您的实施更加高效。

归档时间：	4 年，8 月前
查看次数：	4306 次
最近记录：	3 年，5 月前