如何使用spaCy获取标记ID(我想将文本句子映射到整数序列)

rag*_*lpr 3 nlp spacy word-embedding

我想使用spacy来标记句子以获得一系列整数令牌ID,我可以将它用于下游任务.我期望使用类似下面的东西.请填写???

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_lg')

# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at ")

doc = nlp(text)

idxs = ??????

print(idxs)
# Want output to be something like;
>> array([ 8045, 70727, 24304, 96127, 44091, 37596, 24524, 35224, 36253])
Run Code Online (Sandbox Code Playgroud)

优选地,整数指的是一些特殊的嵌入id en_core_web_lg.

spacy.io/usage/vectors-similarity没有给出提示要查找的doc中的哪个属性.

我在交叉验证时询问了这一点,但确定为OT.谷歌搜索/描述此问题的适当术语也很有帮助.

rag*_*lpr 7

解决方案;

import spacy
nlp = spacy.load('en_core_web_md')
text = (u"When Sebastian Thrun started working on self-driving cars at ")

doc = nlp(text)

ids = []
for token in doc:
    if token.has_vector:
        id = nlp.vocab.vectors.key2row[token.norm]
    else:
        id = None
    ids.append(id)

print([token for token in doc])
print(ids)
#>> [When, Sebastian, Thrun, started, working, on, self, -, driving, cars, at]
#>> [71, 19994, None, 369, 422, 19, 587, 32, 1169, 1153, 41]
Run Code Online (Sandbox Code Playgroud)

打破这一点;

# A Vocabulary for which __getitem__ can take a chunk of text and returns a hash
nlp.vocab 
# >>  <spacy.vocab.Vocab at 0x12bcdce48>
nlp.vocab['hello'].norm # hash
# >> 5983625672228268878


# The tensor holding the word-vector
nlp.vocab.vectors.data.shape
# >> (20000, 300)

# A dict mapping hash -> row in this array
nlp.vocab.vectors.key2row
# >> {12646065887601541794: 0,
# >>  2593208677638477497: 1,
# >>  ...}

# So to get int id of 'earth'; 
i = nlp.vocab.vectors.key2row[nlp.vocab['earth'].norm]
nlp.vocab.vectors.data[i]

# Note that tokens have hashes but may not have vector
# (Hence no entry in .key2row)
nlp.vocab['Thrun'].has_vector
# >> False
Run Code Online (Sandbox Code Playgroud)


Nat*_*Coy 5

Spacy在文本上使用散列来获得独特的ID.所有的Token对象都具有给定的不同的使用情况多种形式TokenDocument

如果你只想要Tokens 的规范化形式,那么使用.norm属性,该属性是文本的整数表示(散列)

>>> import spacy
>>> nlp = spacy.load('en')
>>> text = "here is some test text"
>>> doc = nlp(text)
>>> [token.norm for token in doc]
[411390626470654571, 3411606890003347522, 7000492816108906599, 1618900948208871284, 15099781594404091470]
Run Code Online (Sandbox Code Playgroud)

您还可以使用其他属性,例如小写整数属性.lower或许多其他内容.使用help()DocumentToken以获取更多信息.

>>> help(doc[0])
Help on Token object:

class Token(builtins.object)
 |  An individual token – i.e. a word, punctuation symbol, whitespace,
 |  etc.
 |  
...
Run Code Online (Sandbox Code Playgroud)