相关疑难解决方法(0)

spacy-io如何在没有GIL的情况下使用多线程？

from spacy.attrs import *
# All strings mapped to integers, for easy export to numpy
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])

from reddit_corpus import RedditComments
reddit = RedditComments('/path/to/reddit/corpus')
# Parse a stream of documents, with multi-threading (no GIL!)
# Processes over 100,000 tokens per second.
for doc in nlp.pipe(reddit.texts, batch_size=10000, n_threads=4):
    # Multi-word expressions, such as names, dates etc
    # can be merged into single tokens
    for ent in doc.ents:
        ent.merge(ent.root.tag_, ent.text, ent.ent_type_)
    # …

Run Code Online (Sandbox Code Playgroud)

python multithreading gil spacy

use*_*820

lucky-day

6
推荐指数

1
解决办法

1543
查看次数

读取 spacy 中的文本文件语料库

我看到的使用 spacy 的所有示例都只是在单个文本文件（尺寸很小）中读取。如何将文本文件语料库加载到 spacy 中？

我可以通过腌制语料库中的所有文本来使用 textacy 来做到这一点：

docs =  textacy.io.spacy.read_spacy_docs('E:/spacy/DICKENS/dick.pkl', lang='en')

for doc in docs:
    print(doc)

Run Code Online (Sandbox Code Playgroud)

但我不清楚如何使用这个生成器对象（文档）进行进一步分析。

另外，我宁愿使用 spacy，而不是 textacy。

spacy 也无法读取单个大文件（~ 2000000 个字符）。

任何帮助表示赞赏...

拉维

nlp pipeline generator multiprocessing spacy

Rav*_*avi

2020 02-11

3
推荐指数

1
解决办法

6831
查看次数

标签统计

spacy ×2

generator ×1

gil ×1

multiprocessing ×1

multithreading ×1

nlp ×1

pipeline ×1

python ×1

spacy-io如何在没有GIL的情况下使用多线程？

读取 spacy 中的文本文件语料库

标签 统计

标签统计