使用 spacy 对 Pandas Dataframe 中已解析的 html 文本列进行词形还原

Question

使用 spacy 对 Pandas Dataframe 中已解析的 html 文本列进行词形还原

Lar*_*ova 3 python apply lemmatization pandas spacy

我想做一些非常琐碎的事情，但很难编写函数来完成它。对于 NLP 多类分类任务，我必须预处理 pandas DataFrame。感兴趣的列是已解析的 html 文本（列：“tweet”）。我对数据进行标准化（小写、删除标点符号、停用词等），然后我想使用 spacy 对其进行词形还原并将其作为列写回。但是，我无法将这些功能组合在一起。我在 SO 上找到了几个例子，但它们都使用列表，我无法将其转换为 DF。因为我有一个非常大（10GB）的 DataFrame，所以我想使用一个不太慢的函数。任何帮助或建议将不胜感激。谢谢：）

# My real text is in german, but since Englisch is more frequent I use "en_core_web_sm" here
import spacy
en_core = spacy.load('en_core_web_sm')

# Create DataFrame
pos_tweets = [('I love this car', 'positive'), ('This view is amazing', 'positive'), ('I feel great this morning', 'positive'), ('I am so excited about the concert', 'positive'), ('He is my best friend', 'positive')]
df = pd.DataFrame(pos_tweets)
df.columns = ["tweet","class"]

# Normalization
df['tweet'] = [entry.lower() for entry in df['tweet']]
# Tokenization
df["tokenized"] = [w.split() for w in df["tweet"]]

# Lemmatization
# This is where I struggle. I can't get together the English Model en_core, lemma_ and stuff :(
df["lemmatized"] = df['tokenized'].apply(lambda x: [en_core(y.lemma_) for y in x])

Run Code Online (Sandbox Code Playgroud)

Answer 1

Wik*_*żew 7

您需要在文本上运行它，而不是标记。

df["lemmatized"] = df['tweet'].apply(lambda x: " ".join([y.lemma_ for y in en_core(x)]))

Run Code Online (Sandbox Code Playgroud)

这里，x将是tweet列中的一个句子/文本，en_core(x)将用它创建一个文档，并将y代表每个标记，并y.lemma_产生单词引理。" ".join(...)将把找到的所有词元连接成一个空格分隔的字符串。

归档时间：	5 年，4 月前
查看次数：	2315 次
最近记录：	5 年，4 月前