mrg*_*gou 3 python nlp dataframe pandas spacy
我目前正在学习 spaCy,并且有一个关于单词和句子嵌入的练习。句子存储在 pandas DataFrame 列中,并且要求我们根据这些句子的向量训练分类器。
我有一个如下所示的数据框:
+---+---------------------------------------------------+
| | sentence |
+---+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... |
+---+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... |
+---+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... |
+---+---------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
接下来,我将 NLP 函数应用于这些句子:
+---+---------------------------------------------------+
| | sentence |
+---+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... |
+---+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... |
+---+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... |
+---+---------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
现在,如果我理解正确的话, df['tokenized'] 中的每个项目都有一个属性,该属性返回二维数组中句子的向量。
import en_core_web_md
nlp = en_core_web_md.load()
df['tokenized'] = df['sentence'].apply(nlp)
Run Code Online (Sandbox Code Playgroud)
产量
<class 'numpy.ndarray'>
(300,)
Run Code Online (Sandbox Code Playgroud)
如何将此数组的内容(300 行)作为列添加到df相应句子的数据框中,忽略停用词?
谢谢!
假设您有句子列表:
sents = ["'Whitey on the Moon' is a 1970 spoken word"
, "St Anselm's Church is a Roman Catholic church"
, "Nymphargus grandisonae (common name: giant)"]
Run Code Online (Sandbox Code Playgroud)
您放入数据框中的内容:
df=pd.DataFrame({"sentence":sents})
print(df)
sentence
0 'Whitey on the Moon' is a 1970 spoken word
1 St Anselm's Church is a Roman Catholic church
2 Nymphargus grandisonae (common name: giant)
Run Code Online (Sandbox Code Playgroud)
那么您可以按照以下步骤进行:
df['tokenized'] = df['sentence'].apply(nlp)
df['sent_vectors'] = df['tokenized'].apply(
lambda sent: np.mean([token.vector for token in sent if not token.is_stop])
)
Run Code Online (Sandbox Code Playgroud)
结果sent_vectorized列是非停用词(属性)的标记的所有向量嵌入的平均值token.is_stop。
注意 1
您在数据框中所说的 asentence实际上是一个Doc类的实例。
注 2 虽然您可能更喜欢使用 pandas 数据框,但推荐的方法是通过 getter 扩展:
import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_md")
sents = ["'Whitey on the Moon' is a 1970 spoken word"
, "St Anselm's Church is a Roman Catholic church"
, "Nymphargus grandisonae (common name: giant)"]
vector_except_stopwords = lambda doc: np.mean([token.vector for token in sent if not token.is_stop])
Doc.set_extension("vector_except_stopwords", getter=vector_except_stopwords)
vecs =[] # for demonstration purposes
for doc in nlp.pipe(sents):
vecs.append(doc._.vector_except_stopwords)
Run Code Online (Sandbox Code Playgroud)