ValueError: [E088] 长度为 1027203 的文本超过 1000000 的最大值。spacy

Question

ValueError: [E088] 长度为 1027203 的文本超过 1000000 的最大值。spacy

我正在尝试通过文本创建单词语料库。我使用空间。所以有我的代码：

import spacy
nlp = spacy.load('fr_core_news_md')
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()

Run Code Online (Sandbox Code Playgroud)

但它返回这个异常：

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

Run Code Online (Sandbox Code Playgroud)

我试过这样的事情：

import spacy
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 1027203
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()

Run Code Online (Sandbox Code Playgroud)

但得到了同样的错误：

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

Run Code Online (Sandbox Code Playgroud)

如何解决？

Answer 1

Rah*_*l P 8

我与上面的答案不同，我认为 nlp.max_length 确实执行正确，但设置的值太低。看起来您已将其设置为错误消息中的值。将 nlp.max_length 增加到错误消息中的数字稍高一点：

nlp.max_length = 1030000 # or even higher

Run Code Online (Sandbox Code Playgroud)

理想情况下，它应该在此之后工作。

所以你的代码可以改成这样

import spacy
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 1030000 # or higher
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，10 月前
查看次数：	4328 次
最近记录：	5 年，2 月前