spaCy [正,pos,标签,lema和文本]的文档

ahm*_*ama 11 python nlp cython spacy

我是spaCy的新手.我添加了这篇文章作为文档,让我的新手很简单.

import spacy
nlp = spacy.load('en')
doc = nlp(u'KEEP CALM because TOGETHER We Rock !')
for word in doc:
    print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
    print(word.orth_)
Run Code Online (Sandbox Code Playgroud)

我想了解orth,lemma,tag和pos的含义是什么?这段代码打印出的值也print(word)与vs 之间有什么不同print(word.orth_)

alv*_*vas 15

orth,lemma,tag和pos的含义是什么?

请参阅https://spacy.io/docs/usage/pos-tagging#pos-schemes

print(word)与print(word.orth_)之间有什么不同

超短:

word.orth_并且word.text是一样的.事实上,cython属性以下划线结尾,它通常是开发人员并不真正想要向用户公开的变量.

简而言之:

当您访问https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537上word.orth_属性时,它会尝试访问保留所有单词词汇表的索引:

property orth_:
        def __get__(self):
            return self.vocab.strings[self.c.lex.orth]
Run Code Online (Sandbox Code Playgroud)

(有关详细信息,请参阅In long下面的说明self.c.lex.orth)

word.text返回仅包含该orth_属性的单词的字符串表示形式,请参阅https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128

property text:
    def __get__(self):
        return self.orth_
Run Code Online (Sandbox Code Playgroud)

而当你打印print(word),它调用__repr__返回的dunder功能word.__unicode__word.__byte__指向回word.text变量,看https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55

cdef class Token:
    """
    An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
    """
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        self.vocab = vocab
        self.doc = doc
        self.c = &self.doc.c[offset]
        self.i = offset

    def __hash__(self):
        return hash((self.doc, self.i))

    def __len__(self):
        """
        Number of unicode characters in token.text.
        """
        return self.c.lex.length

    def __unicode__(self):
        return self.text

    def __bytes__(self):
        return self.text.encode('utf8')

    def __str__(self):
        if is_config(python3=True):
            return self.__unicode__()
        return self.__bytes__()

    def __repr__(self):
        return self.__str__()
Run Code Online (Sandbox Code Playgroud)

长期:

让我们一步一步地完成这个步骤:

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u'This is a foo bar sentence.')
>>> type(doc)
<type 'spacy.tokens.doc.Doc'>
Run Code Online (Sandbox Code Playgroud)

将句子传递给nlp()函数后,它会spacy.tokens.doc.Doc从文档中生成一个对象:

cdef class Doc:
    """
    A sequence of `Token` objects. Access sentences and named entities,
    export annotations to numpy arrays, losslessly serialize to compressed
    binary strings.
    Aside: Internals
        The `Doc` object holds an array of `TokenC` structs.
        The Python-level `Token` and `Span` objects are views of this
        array, i.e. they don't own the data themselves.
    Code: Construction 1
        doc = nlp.tokenizer(u'Some text')
    Code: Construction 2
        doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
    """
Run Code Online (Sandbox Code Playgroud)

所以spacy.tokens.doc.Doc对象是一个spacy.tokens.token.Token对象序列.在Token对象中,我们看到了一系列的cython property,例如在https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

property orth:
    def __get__(self):
        return self.c.lex.orth
Run Code Online (Sandbox Code Playgroud)

追溯它,我们看到self.c = &self.doc.c[offset]:

cdef class Token:
    """
    An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
    """
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        self.vocab = vocab
        self.doc = doc
        self.c = &self.doc.c[offset]
        self.i = offset
Run Code Online (Sandbox Code Playgroud)

如果没有详细的文档,我们真的不知道是什么self.c意思,但从它的外观来看,它正在访问&self.doc引用中的一个标记,指向Doc doc传递给__cinit__函数的标记.所以最有可能的是,它是访问令牌的捷径

看着Doc.c:

cdef class Doc:
    def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
        self.vocab = vocab
        size = 20
        self.mem = Pool()
        # Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
        # However, we need to remember the true starting places, so that we can
        # realloc.
        data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
        cdef int i
        for i in range(size + (PADDING*2)):
            data_start[i].lex = &EMPTY_LEXEME
            data_start[i].l_edge = i
            data_start[i].r_edge = i
        self.c = data_start + PADDING
Run Code Online (Sandbox Code Playgroud)

现在我们看到它Doc.c指的是一个cython指针数组data_start,它分配内存来存储spacy.tokens.doc.Doc对象(如果我的解释<TokenC*>错误,请纠正我).

所以回过头来self.c = &self.doc.c[offset],它基本上是试图访问存储数组的内存点,更具体地说是访问数组中的"offset-th"项.

那是什么spacy.tokens.token.Token.


回到property:

property orth:
    def __get__(self):
        return self.c.lex.orth
Run Code Online (Sandbox Code Playgroud)

我们看到self.c.lex正在访问data_start[i].lexfromspacy.tokens.doc.Doc并且self.c.lex.orth只是一个整数,表示spacy.tokens.doc.Doc内部词汇表中保留的单词出现的索引.

因此,我们看到property orth_尝试self.vocab.stringshttps://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162访问带有te的索引self.c.lex.orth

property orth_:
        def __get__(self):
            return self.vocab.strings[self.c.lex.orth]
Run Code Online (Sandbox Code Playgroud)