IndexError:无法将“int”放入索引大小的整数中

The*_*n 5 6 python runtime-error list append indexof

所以我试图让我的程序从文本文件中打印出每个单词和标点符号的索引(当它出现时)。我已经完成了那部分。- 但问题是当我尝试使用这些索引位置重新创建带有标点符号的原始文本时。这是我的代码:

with open('newfiles.txt') as f:
    s = f.read()
import re
#Splitting string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", s) if x not in ['',' ']]
print (matches)
d = {} 
i = 1
list_with_positions = []
# the dictionary entries:
for match in matches:
    if match not in d.keys():
        d[match] = i
        i+=1
    list_with_positions.append(d[match])

print (list_with_positions)
file = open("newfiletwo.txt","w")
file.write (''.join(str(e) for e in list_with_positions))
file.close()
file = open("newfilethree.txt","w")
file.write(''.join(matches))
file.close()
word_base = None
with open('newfilethree.txt', 'rt') as f_base:
    word_base = [None] + [z.strip() for z in f_base.read().split()]

sentence_seq = None
with open('newfiletwo.txt', 'rt') as f_select:
    sentence_seq = [word_base[int(i)] for i in f_select.read().split()]

print(' '.join(sentence_seq))
Run Code Online (Sandbox Code Playgroud)

正如我所说,第一部分工作正常,但随后出现错误:-

Traceback (most recent call last):
    File "E:\Python\Indexes.py", line 33, in <module>
       sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
    File "E:\Python\Indexes.py", line 33, in <listcomp>
       sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
IndexError: cannot fit 'int' into an index-sized integer
Run Code Online (Sandbox Code Playgroud)

当程序运行到代码底部的“sentence_seq”时会出现此错误

newfiles 是原始文本文件 - 一篇带有多个标点符号句子的随机文章

list_with_positions 是每个单词在原始文本中出现的实际位置的列表

匹配是分隔的不同单词 - 如果单词在文件中重复(确实如此)匹配应该只有不同的单词。

有谁知道为什么我收到错误?

rog*_*osh 1

您的方法的问题在于使用,''.join()因为这将所有内容连接在一起,没有空格。因此,眼前的问题是您尝试split()有效地输入一长串不带空格的数字;您返回的是一个包含 100 多个数字的单个值。因此,int当尝试将其用作索引时,会出现巨大的溢出。更重要的问题是指数可能会达到两位数等;当数字连接时没有空格时,您希望如何split()处理?

除此之外,您未能正确对待标点符号。' '.join()当尝试重建一个句子时同样无效,因为你有逗号、句号等,两边都有空格。

我尽力坚持使用您当前的代码/方法(我认为在尝试了解问题来自何处时改变整个方法没有巨大的价值),但对我来说仍然感觉不稳定。我放弃了regex,也许这是需要的。我并没有立即意识到有一个图书馆可以做这种事情,但几乎可以肯定一定有更好的方法

import string

punctuation_list = set(string.punctuation) # Has to be treated differently

word_base = []
index_dict = {}
with open('newfiles.txt', 'r') as infile:
    raw_data = infile.read().split()
    for index, item in enumerate(raw_data):
        index_dict[item] = index
        word_base.append(item)

with open('newfiletwo.txt', 'w') as outfile1, open('newfilethree.txt', 'w') as outfile2:
    for item in word_base:
        outfile1.write(str(item) + ' ')
        outfile2.write(str(index_dict[item]) + ' ')

reconstructed = ''
with open('newfiletwo.txt', 'r') as infile1, open('newfilethree.txt', 'r') as infile2:
    indices = infile1.read().split()
    words = infile2.read().split()
    reconstructed = ''.join([item + ' ' if item in punctuation_list else ' ' + item + ' ' for item in word_base])
Run Code Online (Sandbox Code Playgroud)