使用 Pandas 和 spaCy 提取句子嵌入特征

mrg*_*gou 3 python nlp dataframe pandas spacy

我目前正在学习 spaCy,并且有一个关于单词和句子嵌入的练习。句子存储在 pandas DataFrame 列中,并且要求我们根据这些句子的向量训练分类器。

我有一个如下所示的数据框:

+---+---------------------------------------------------+
|   |                                          sentence |
+---+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... |
+---+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... |
+---+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... |
+---+---------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

接下来,我将 NLP 函数应用于这些句子:

+---+---------------------------------------------------+
|   |                                          sentence |
+---+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... |
+---+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... |
+---+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... |
+---+---------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

现在,如果我理解正确的话, df['tokenized'] 中的每个项目都有一个属性,该属性返回二维数组中句子的向量。

import en_core_web_md
nlp = en_core_web_md.load()
df['tokenized'] = df['sentence'].apply(nlp)
Run Code Online (Sandbox Code Playgroud)

产量

<class 'numpy.ndarray'>
(300,)
Run Code Online (Sandbox Code Playgroud)

如何将此数组的内容(300 行)作为列添加到df相应句子的数据框中,忽略停用词

谢谢!

Ser*_*nov 5

假设您有句子列表:

sents = ["'Whitey on the Moon' is a 1970 spoken word"
         , "St Anselm's Church is a Roman Catholic church"
         , "Nymphargus grandisonae (common name: giant)"]
Run Code Online (Sandbox Code Playgroud)

您放入数据框中的内容:

df=pd.DataFrame({"sentence":sents})
print(df)
                                        sentence
0     'Whitey on the Moon' is a 1970 spoken word
1  St Anselm's Church is a Roman Catholic church
2    Nymphargus grandisonae (common name: giant)
Run Code Online (Sandbox Code Playgroud)

那么您可以按照以下步骤进行:

df['tokenized'] = df['sentence'].apply(nlp)
df['sent_vectors'] = df['tokenized'].apply(
  lambda sent: np.mean([token.vector for token in sent if not token.is_stop])
                                          )
Run Code Online (Sandbox Code Playgroud)

结果sent_vectorized列是非停用词(属性)的标记的所有向量嵌入的平均值token.is_stop

注意 1 您在数据框中所说的 asentence实际上是一个Doc类的实例。

注 2 虽然您可能更喜欢使用 pandas 数据框,但推荐的方法是通过 getter 扩展:

import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_md")

sents = ["'Whitey on the Moon' is a 1970 spoken word"
         , "St Anselm's Church is a Roman Catholic church"
         , "Nymphargus grandisonae (common name: giant)"]

vector_except_stopwords = lambda doc: np.mean([token.vector for token in sent if not token.is_stop])
Doc.set_extension("vector_except_stopwords", getter=vector_except_stopwords)

vecs =[] # for demonstration purposes
for doc in nlp.pipe(sents):
    vecs.append(doc._.vector_except_stopwords)
Run Code Online (Sandbox Code Playgroud)