alv*_*vas 4 python algorithm nlp cython spacy
SpaCy 如何在标记化过程中跟踪字符和标记偏移量?
在 SpaCy 中,有一个 Span 对象,用于保留令牌/span 的开始和结束偏移量https://spacy.io/api/span#init
有一种_recalculate_indices
方法似乎正在检索token_by_start
andtoken_by_end
但看起来所有重新计算都在进行。
当查看无关空间时,它会对span 进行一些智能对齐。
它是否在每次正则表达式执行后重新计算,是否跟踪角色的移动?它是否进行正则表达式执行后跨度搜索?
摘要:
在标记化过程中,这是跟踪偏移量和字符的部分。
简单的答案:它在字符串中一个字符一个字符地移动。
TL;DR 在底部。
逐块解释:
它接收要标记化的字符串,并开始一次一个字母/空格地迭代它。
for
这是字符串上的一个简单循环,其中uc
是字符串中的当前字符。
for uc in string:
Run Code Online (Sandbox Code Playgroud)
它首先检查当前字符是否是空格,然后比较最后的in_ws
设置是否与空格相反。如果相同,则会跳下并增加i += 1
。
in_ws
被用来知道它是否应该处理。他们想要在空间和角色上做事,所以他们不能只isspace()
在 上进行跟踪和操作False
。相反,当它第一次启动时,in_ws
被设置为结果string[0].isspace()
,然后与其自身进行比较。如果string[0]
是一个空格,它将计算相同的值,因此向下跳并增加i
(稍后讨论)并转到下一个uc
,直到达到uc
与第一个不同的值。实际上,这允许它在处理第一个空格或多个字符后排序多个空格,直到到达下一个空格边界。
if uc.isspace() != in_ws:
Run Code Online (Sandbox Code Playgroud)
它将继续遍历字符,直到到达下一个边界,并将当前字符的索引保持为i
。
It tracks two index values: start
and i
. start
is the start of the potential token that it is on, and i
is the ending character it is looking at. When the script starts, start
will be 0
. After a cycle of this, start
will be the index of the last space plus 1 which would make it the first letter of the current word.
It checks first if start
is less than i
which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.
if start < i:
Run Code Online (Sandbox Code Playgroud)
span
is the word that is currently being looked at for tokenization. It is the string sliced by the start
index value through the i
index value.
span = string[start:i]
Run Code Online (Sandbox Code Playgroud)
It is then taking the hash of the word (start
through i
) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize
method on that portion of the string.
key = hash_string(span)
cache_hit = self._try_cache(key, doc)
if not cache_hit:
self._tokenize(doc, span, key)
Run Code Online (Sandbox Code Playgroud)
Next it checks to see if the current character uc
is an exact space. If it is, it resets start to be i + 1
where i
is the index of the current character.
if uc == ' ':
doc.c[doc.length - 1].spacy = True
start = i + 1
Run Code Online (Sandbox Code Playgroud)
If the character is not a space, it sets start to be the current character's index. It then reverses in_ws
, indicating it is a character.
else:
start = i
in_ws = not in_ws
Run Code Online (Sandbox Code Playgroud)
And then it increases i += 1
and loops to the next character.
i += 1
Run Code Online (Sandbox Code Playgroud)
TL;DR
So all of that said, it keeps track of the character in the string that it is on using i
and it keeps the start of the word using start
. start
is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).