我想从一个句子中找到主题Spacy
.下面的代码工作正常并给出依赖树.
import spacy
from nltk import Tree
en_nlp = spacy.load('en')
doc = en_nlp("The quick brown fox jumps over the lazy dog.")
def to_nltk_tree(node):
if node.n_lefts + node.n_rights > 0:
return Tree(node.orth_, [to_nltk_tree(child) for child in node.children])
else:
return node.orth_
[to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]
Run Code Online (Sandbox Code Playgroud)
从这个依赖树代码,我可以找到这句话的主题吗?
我正试图通过我机器上的官方文档工作制作Spacy的 匹配器示例.
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern = [{'LOWER': "hello"}, {'LOWER': "world"}]
matcher.add("HelloWorld", None, pattern)
doc = nlp(u'hello world!')
matches = matcher(doc)
Run Code Online (Sandbox Code Playgroud)
不幸的是我遇到以下错误:
TypeError:add()至少需要4个位置参数(给定3个)
相应的源代码可以在这里找到,重要的部分是
def add(self, key, on_match, *patterns):
"""Add a match-rule to the matcher. A match-rule consists of: an ID
key, an on_match callback, and one or more patterns.
Run Code Online (Sandbox Code Playgroud)
Spacy最近已更新到2.0版,我安装了该版本并成功将英文模型链接到它.感觉我在这里遗漏了一些非常明显的东西,但我看不出我做错了什么.
我正在尝试按照spaCy指南使用基于规则的Matcher(即添加on_match规则)创建一个名为FRUIT的自定义实体标签。我正在使用spaCy 2.0.11,因此与spaCy 1.X相比,我相信这样做的步骤已经更改。
示例:doc = nlp('汤姆(Tom)想在联合国吃一些苹果')
预期的文本和实体输出:
Tom PERSON
apples FRUIT
the United Nations ORG
Run Code Online (Sandbox Code Playgroud)
但是,我似乎收到以下错误:[E084]将标签ID 7429577500961755728分配给span时出错:不在StringStore中。我在下面包含了我的代码。当我将nlp.vocab.strings ['FRUIT']更改为nlp.vocab.strings ['EVENT']时,奇怪的是它起作用了,但是苹果会被分配实体标签EVENT。还有其他人遇到此问题吗?
doc = nlp('Tom wants to eat some apples at the United Nations')
FRUIT = nlp.vocab.strings['FRUIT']
def add_ent(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
doc.ents += ((FRUIT, start, end),)
matcher = Matcher(nlp.vocab) …
Run Code Online (Sandbox Code Playgroud) 我使用的是 Windows 10,并使用 pip 安装 spacy,但现在运行时出现错误
import spacy
Run Code Online (Sandbox Code Playgroud)
在 python shell 中。
我的错误信息是:
Traceback (most recent call last):
File "C:\Users\Administrator\errbot-root\plugins\utility\model_training_test.py", line 17, in <module>
import spacy
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\spacy\__init__.py", line 4, in <module>
from .cli.info import info as cli_info
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\spacy\cli\__init__.py", line 1, in <module>
from .download import download
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\spacy\cli\download.py", line 5, in <module>
import requests
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\__init__.py", line 43, in <module>
import urllib3
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\urllib3\__init__.py", line 8, in <module>
from .connectionpool import (
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\urllib3\connectionpool.py", line 11, in <module> …
Run Code Online (Sandbox Code Playgroud) 我正在尝试使用 spacy,noun_chunks
但它抛出了一个错误。我下载了模型
python -m spacy download en_core_web_sm
AttributeError: 'English' object has no attribute 'noun_chunks'
NLP = spacy.load('en_core_web_sm')
NOUN_CHUNKS = NLP.noun_chunks
Run Code Online (Sandbox Code Playgroud) 我正在从头开始训练模型,以便根据文本预测食物。我已经标记了大约500个句子来训练我的模型,准确性非常好。但是,我有点担心看不见的真实世界的数据,所以我想出了一个有趣的想法。所以我想知道一些有经验的人对这个有趣的想法的想法。
原始训练句子:
食物清单:
新生成的训练语句:
因此生成这样的训练语句是否很好?我认为的好处:
问题可能是:
谢谢,请让我知道这种方法的想法。
我正在尝试使用新实体“动物”在 Spacy 中训练自定义玩家。但我有一个包含单个单词的数据集:
TRAIN_DATA = [("Whale_ Blue", {"entities": [(0,11,LABEL)]}), ("Shark_ whale", {"entities": [(0,12,LABEL)]}), ("Elephant_ African", {"entities": [(0,17,LABEL)]}), ("Elephant_ Indian", {"entities": [(0,16,LABEL)]}), ("Giraffe_ male", {"entities": [(0,13,LABEL)]}), ("Mule", {"entities": [(0,4,LABEL)]}), ("Camel", {"entities": [(0,5,LABEL)]}), ("Horse", {"entities": [(0,5,LABEL)]}), ("Cow", {"entities": [(0,3,LABEL)]}), ("Dolphin_ Bottlenose", {"entities": [(0,19,LABEL)]}), ("Donkey", {"entities": [(0,6,LABEL)]}), ("Tapir", {"entities": [(0,5,LABEL)]}), ("Shark_ Hammerhead", {"entities": [(0,17,LABEL)]}), ("Seal_ fur", {"entities": [(0,9,LABEL)]}), ("Manatee", {"entities": [(0,7,LABEL)]}), ("Bear_ Grizzly", {"entities": [(0,13,LABEL)]}), ("Alligator_ American", {"entities": [(0,19,LABEL)]}), ("Sturgeon_ Atlantic", {"entities": [(0,18,LABEL)]}), ("Lion", {"entities": [(0,4,LABEL)]}), ("Bear_ American Black", {"entities": [(0,20,LABEL)]}), ("Ostrich", {"entities": …
Run Code Online (Sandbox Code Playgroud) 我在这里遇到了一个问题。我将使用 spacy 的单词分词器。但是我有一些限制,例如我的标记器不会拆分包含撇号 (') 的单词。
例子:
输入字符串:“我不能这样做” 当前输出:["I","ca","n't","do","this"] 预期输出:["I","can't","do","this"]
我的尝试:
doc = nlp(sent)
position = [token.i for token in doc if token.i!=0 and "'" in token.text]
with doc.retokenize() as retokenizer:
for pos in position:
retokenizer.merge(doc[pos-1:pos+1])
for token in doc:
print(token.text)
Run Code Online (Sandbox Code Playgroud)
通过这种方式,我得到了预期的输出。但是不知道这个方法对不对?或者有没有更好的方法来进行重新标记化?
' '.join(token_list) 在连续出现多个空格和标点符号的情况下不会重建原始文本。
例如:
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizerSpaCy = Tokenizer(nlp.vocab)
context_text = 'this is a test \n \n \t\t test for \n testing - ./l \t'
contextSpaCyToksSpaCyObj = tokenizerSpaCy(context_text)
spaCy_toks = [i.text for i in contextSpaCyToksSpaCyObj]
reconstruct = ' '.join(spaCy_toks)
reconstruct == context_text
>False
Run Code Online (Sandbox Code Playgroud)
是否有一种既定的方法可以从 spaCy 令牌重建原始文本?
既定答案应适用于此边缘情况文本(您可以通过单击“改进此问题”按钮直接获取源代码)
" UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05795279 Date: 01/07/2016\n\n\n …
我在 python 的 pandas 数据框中有一列标记。看起来像这样的东西:
word_tokens
(the,cheeseburger,was,great)
(i,never,did,like,the,pizza,too,much)
(yellow,submarine,was,only,an,ok,song)
Run Code Online (Sandbox Code Playgroud)
我想使用 spacy 库在此数据框中再获得两个新列。一列包含删除了停用词的每一行的标记,另一列包含第二列中的引理。我怎么能这么做呢?
import re
import spacy
from nltk.corpus import stopwords
import pdfplumber
def extract_All_data(path):
text = ""
try:
with pdfplumber.open(path) as pdf:
for i in pdf.pages:
text += i.extract_text()
return text
except:
return None
resume_text = extract_All_data(r"E:\AllResumesPdfs\37202883_Mumbai_6.pdf")
#resume_text =text.lower()
# load pre-trained model
nlp = spacy.load('en_core_web_lg')
# Grad all general stop words
STOPWORDS = set(stopwords.words('english'))
# Education Degrees
EDUCATION = [
'BE','B.E.', 'B.E', 'BS', 'B.S',
'ME', 'M.E', 'M.E.', 'MS', 'M.S', 'M.C.A.',
'BTECH', 'B.TECH', 'M.TECH', 'MTECH',
'SSC', 'HSC', 'CBSE', 'ICSE', 'X', 'XII'
] …
Run Code Online (Sandbox Code Playgroud) 我写了下面的代码,我想打印出前 10 个句子中的单词,并且我想删除所有不是名词、动词、形容词、副词或专有名称的单词。但我不知道怎么做?谁能帮我?
! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/moby_dick.txt'
wget.download(url, 'moby_dick.txt')
documents = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]
import spacy
nlp = spacy.load('en')
tokens = [[token.text for token in nlp(sentence)] for sentence in documents[:200]]
pos = [[token.pos_ for token in nlp(sentence)] for sentence in documents[:100]]
pos
Run Code Online (Sandbox Code Playgroud) spacy ×12
nlp ×8
python ×8
python-3.x ×2
entity ×1
nltk ×1
pandas ×1
stop-words ×1
tokenize ×1