SpaCy 如何在标记化过程中跟踪字符和标记偏移量？

in_ws被用来知道它是否应该处理。他们想要在空间和角色上做事，所以他们不能只isspace()在上进行跟踪和操作False。相反，当它第一次启动时，in_ws被设置为结果string[0].isspace()，然后与其自身进行比较。如果string[0]是一个空格，它将计算相同的值，因此向下跳并增加i（稍后讨论）并转到下一个uc，直到达到uc与第一个不同的值。实际上，这允许它在处理第一个空格或多个字符后排序多个空格，直到到达下一个空格边界。

    if uc.isspace() != in_ws:

Run Code Online (Sandbox Code Playgroud)

它将继续遍历字符，直到到达下一个边界，并将当前字符的索引保持为i。

It tracks two index values: start and i. start is the start of the potential token that it is on, and i is the ending character it is looking at. When the script starts, start will be 0. After a cycle of this, start will be the index of the last space plus 1 which would make it the first letter of the current word.

It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.

        if start < i:

Run Code Online (Sandbox Code Playgroud)

span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.

            span = string[start:i]

Run Code Online (Sandbox Code Playgroud)

It is then taking the hash of the word (start through i) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize method on that portion of the string.

            key = hash_string(span)
            cache_hit = self._try_cache(key, doc)
            if not cache_hit:
                self._tokenize(doc, span, key)

Run Code Online (Sandbox Code Playgroud)

Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.

        if uc == ' ':
            doc.c[doc.length - 1].spacy = True
            start = i + 1

Run Code Online (Sandbox Code Playgroud)

If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.

        else:
            start = i
        in_ws = not in_ws

Run Code Online (Sandbox Code Playgroud)

And then it increases i += 1 and loops to the next character.

    i += 1

Run Code Online (Sandbox Code Playgroud)

TL;DR
So all of that said, it keeps track of the character in the string that it is on using i and it keeps the start of the word using start. start is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).

归档时间：	6 年，5 月前
查看次数：	1678 次
最近记录：	6 年，4 月前