W.R*_*W.R 3 python nlp pandas spacy textacy
我正在尝试从数据集上的 textacy 实现“extract.subject_verb_object_triples”函数。然而,我编写的代码非常慢并且占用大量内存。有没有更高效的实现方式?
import spacy
import textacy
def extract_SVO(text):
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
tuples = textacy.extract.subject_verb_object_triples(doc)
tuples_to_list = list(tuples)
if tuples_to_list != []:
tuples_list.append(tuples_to_list)
tuples_list = []
sp500news['title'].apply(extract_SVO)
print(tuples_list)
Run Code Online (Sandbox Code Playgroud)
date_publish \
0 2013-05-14 17:17:05
1 2014-05-09 20:15:57
4 2018-07-19 10:29:54
6 2012-04-17 21:02:54
8 2012-12-12 20:17:56
9 2018-11-08 10:51:49
11 2013-08-25 07:13:31
12 2015-01-09 00:54:17
title
0 Italy will not dismantle Montis labour reform minister
1 Exclusive US agency FinCEN rejected veterans in bid to hire lawyers
4 Xis campaign to draw people back to graying rural China faces uphill battle
6 Romney begins to win over conservatives
8 Oregon mall shooting survivor in serious condition
9 Polands PGNiG to sign another deal for LNG supplies from US CEO
11 Australias opposition leader pledges stronger economy if elected PM
12 New York shifts into Code Blue to get homeless off frigid streets
Run Code Online (Sandbox Code Playgroud)
这应该会加快速度 -
import spacy
import textacy
nlp = spacy.load('en_core_web_sm')
def extract_SVO(text):
tuples = textacy.extract.subject_verb_object_triples(text)
if tuples:
tuples_to_list = list(tuples)
tuples_list.append(tuples_to_list)
tuples_list = []
sp500news['title'] = sp500news['title'].apply(nlp)
_ = sp500news['title'].apply(extract_SVO)
print(tuples_list)
Run Code Online (Sandbox Code Playgroud)
解释
在 OP 实现中,nlp = spacy.load('en_core_web_sm')从每次加载的函数内部调用。我感觉这是最大的瓶颈。这个可以拿出来,应该会加快速度。
此外,只有当元tuple组list不为空时才可以进行转换。