ahm*_*ama 11 python nlp cython spacy
我是spaCy的新手.我添加了这篇文章作为文档,让我的新手很简单.
import spacy
nlp = spacy.load('en')
doc = nlp(u'KEEP CALM because TOGETHER We Rock !')
for word in doc:
print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
print(word.orth_)
Run Code Online (Sandbox Code Playgroud)
我想了解orth,lemma,tag和pos的含义是什么?这段代码打印出的值也print(word)与vs 之间有什么不同print(word.orth_)
alv*_*vas 15
orth,lemma,tag和pos的含义是什么?
请参阅https://spacy.io/docs/usage/pos-tagging#pos-schemes
print(word)与print(word.orth_)之间有什么不同
超短:
word.orth_并且word.text是一样的.事实上,cython属性以下划线结尾,它通常是开发人员并不真正想要向用户公开的变量.
简而言之:
当您访问https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537上的word.orth_属性时,它会尝试访问保留所有单词词汇表的索引:
property orth_:
def __get__(self):
return self.vocab.strings[self.c.lex.orth]
Run Code Online (Sandbox Code Playgroud)
(有关详细信息,请参阅In long下面的说明self.c.lex.orth)
并word.text返回仅包含该orth_属性的单词的字符串表示形式,请参阅https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128
property text:
def __get__(self):
return self.orth_
Run Code Online (Sandbox Code Playgroud)
而当你打印print(word),它调用__repr__返回的dunder功能word.__unicode__或word.__byte__指向回word.text变量,看https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55
cdef class Token:
"""
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
"""
def __cinit__(self, Vocab vocab, Doc doc, int offset):
self.vocab = vocab
self.doc = doc
self.c = &self.doc.c[offset]
self.i = offset
def __hash__(self):
return hash((self.doc, self.i))
def __len__(self):
"""
Number of unicode characters in token.text.
"""
return self.c.lex.length
def __unicode__(self):
return self.text
def __bytes__(self):
return self.text.encode('utf8')
def __str__(self):
if is_config(python3=True):
return self.__unicode__()
return self.__bytes__()
def __repr__(self):
return self.__str__()
Run Code Online (Sandbox Code Playgroud)
长期:
让我们一步一步地完成这个步骤:
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u'This is a foo bar sentence.')
>>> type(doc)
<type 'spacy.tokens.doc.Doc'>
Run Code Online (Sandbox Code Playgroud)
将句子传递给nlp()函数后,它会spacy.tokens.doc.Doc从文档中生成一个对象:
cdef class Doc:
"""
A sequence of `Token` objects. Access sentences and named entities,
export annotations to numpy arrays, losslessly serialize to compressed
binary strings.
Aside: Internals
The `Doc` object holds an array of `TokenC` structs.
The Python-level `Token` and `Span` objects are views of this
array, i.e. they don't own the data themselves.
Code: Construction 1
doc = nlp.tokenizer(u'Some text')
Code: Construction 2
doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
"""
Run Code Online (Sandbox Code Playgroud)
所以spacy.tokens.doc.Doc对象是一个spacy.tokens.token.Token对象序列.在Token对象中,我们看到了一系列的cython property,例如在https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162
property orth:
def __get__(self):
return self.c.lex.orth
Run Code Online (Sandbox Code Playgroud)
追溯它,我们看到self.c = &self.doc.c[offset]:
cdef class Token:
"""
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
"""
def __cinit__(self, Vocab vocab, Doc doc, int offset):
self.vocab = vocab
self.doc = doc
self.c = &self.doc.c[offset]
self.i = offset
Run Code Online (Sandbox Code Playgroud)
如果没有详细的文档,我们真的不知道是什么self.c意思,但从它的外观来看,它正在访问&self.doc引用中的一个标记,指向Doc doc传递给__cinit__函数的标记.所以最有可能的是,它是访问令牌的捷径
看着Doc.c:
cdef class Doc:
def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
self.vocab = vocab
size = 20
self.mem = Pool()
# Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
# However, we need to remember the true starting places, so that we can
# realloc.
data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
cdef int i
for i in range(size + (PADDING*2)):
data_start[i].lex = &EMPTY_LEXEME
data_start[i].l_edge = i
data_start[i].r_edge = i
self.c = data_start + PADDING
Run Code Online (Sandbox Code Playgroud)
现在我们看到它Doc.c指的是一个cython指针数组data_start,它分配内存来存储spacy.tokens.doc.Doc对象(如果我的解释<TokenC*>错误,请纠正我).
所以回过头来self.c = &self.doc.c[offset],它基本上是试图访问存储数组的内存点,更具体地说是访问数组中的"offset-th"项.
那是什么spacy.tokens.token.Token.
回到property:
property orth:
def __get__(self):
return self.c.lex.orth
Run Code Online (Sandbox Code Playgroud)
我们看到self.c.lex正在访问data_start[i].lexfromspacy.tokens.doc.Doc并且self.c.lex.orth只是一个整数,表示spacy.tokens.doc.Doc内部词汇表中保留的单词出现的索引.
因此,我们看到property orth_尝试self.vocab.strings从https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162访问带有te的索引self.c.lex.orth
property orth_:
def __get__(self):
return self.vocab.strings[self.c.lex.orth]
Run Code Online (Sandbox Code Playgroud)