SpaCy 如何在标记化过程中跟踪字符和标记偏移量?

alv*_*vas 4 python algorithm nlp cython spacy

SpaCy 如何在标记化过程中跟踪字符和标记偏移量?

在 SpaCy 中,有一个 Span 对象,用于保留令牌/span 的开始和结束偏移量https://spacy.io/api/span#init

有一种_recalculate_indices方法似乎正在检索token_by_startandtoken_by_end但看起来所有重新计算都在进行。

当查看无关空间时,它会对span 进行一些智能对齐

它是否在每次正则表达式执行后重新计算,是否跟踪角色的移动?它是否进行正则表达式执行后跨度搜索?

MyN*_*leb 5

摘要:
在标记化过程中,是跟踪偏移量和字符的部分。

简单的答案:它在字符串中一个字符一个字符地移动。

TL;DR 在底部。


逐块解释:

它接收要标记化的字符串,并开始一次一个字母/空格地迭代它。

for这是字符串上的一个简单循环,其中uc是字符串中的当前字符。

for uc in string:
Run Code Online (Sandbox Code Playgroud)

它首先检查当前字符是否是空格,然后比较最后的in_ws设置是否与空格相反。如果相同,则会跳下并增加i += 1

in_ws被用来知道它是否应该处理。他们想要在空间和角色上做事,所以他们不能只isspace()在 上进行跟踪和操作False。相反,当它第一次启动时,in_ws被设置为结果string[0].isspace(),然后与其自身进行比较。如果string[0]是一个空格,它将计算相同的值,因此向下跳并增加i(稍后讨论)并转到下一个uc,直到达到uc与第一个不同的值。实际上,这允许它在处理第一个空格或多个字符后排序多个空格,直到到达下一个空格边界。

    if uc.isspace() != in_ws:
Run Code Online (Sandbox Code Playgroud)

它将继续遍历字符,直到到达下一个边界,并将当前字符的索引保持为i

It tracks two index values: start and i. start is the start of the potential token that it is on, and i is the ending character it is looking at. When the script starts, start will be 0. After a cycle of this, start will be the index of the last space plus 1 which would make it the first letter of the current word.

It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.

        if start < i:
Run Code Online (Sandbox Code Playgroud)

span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.

            span = string[start:i]
Run Code Online (Sandbox Code Playgroud)

It is then taking the hash of the word (start through i) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize method on that portion of the string.

            key = hash_string(span)
            cache_hit = self._try_cache(key, doc)
            if not cache_hit:
                self._tokenize(doc, span, key)
Run Code Online (Sandbox Code Playgroud)

Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.

        if uc == ' ':
            doc.c[doc.length - 1].spacy = True
            start = i + 1
Run Code Online (Sandbox Code Playgroud)

If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.

        else:
            start = i
        in_ws = not in_ws
Run Code Online (Sandbox Code Playgroud)

And then it increases i += 1 and loops to the next character.

    i += 1
Run Code Online (Sandbox Code Playgroud)

TL;DR
So all of that said, it keeps track of the character in the string that it is on using i and it keeps the start of the word using start. start is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).