从没有空格,标点符号的文本文件创建每个单词的列表

Question

从没有空格,标点符号的文本文件创建每个单词的列表

我有一个很长的文本文件(剧本).我想把这个文本文件转换成一个列表(每个单词都是分开的),这样我以后就可以搜索它了.

我现在的代码是

file = open('screenplay.txt', 'r')
words = list(file.read().split())
print words

Run Code Online (Sandbox Code Playgroud)

我认为这可以将所有单词拆分成一个列表,但是我无法删除所有额外的东西,比如逗号和单词结尾处的句点.我也希望将大写字母设为小写(因为我希望能够以小写字母搜索并且显示大写字母和小写字母).任何帮助都会很棒:)

Answer 1

Col*_*nic 6

尝试/sf/answers/1256592081/中的算法,即.在空格上拆分文本,然后修剪标点符号.这样可以小心地删除单词边缘的标点符号,而不会损坏单词中的撇号we're.

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

Run Code Online (Sandbox Code Playgroud)

您可能想要添加一个 .lower()

Answer 2

Bri*_*ius 5

这是正则表达式的工作!

例如:

import re
file = open('screenplay.txt', 'r')
# .lower() returns a version with all upper case characters replaced with lower case characters.
text = file.read().lower()
file.close()
# replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
print words

Run Code Online (Sandbox Code Playgroud)

Answer 3

Bri*_*n H 1

使用替换方法。

mystring = mystring.replace(",", "")

Run Code Online (Sandbox Code Playgroud)

如果您想要一个更优雅的解决方案，您将在阅读 RegEx 表达式时多次使用它。大多数语言都使用它们，它们对于更复杂的替换等非常有用

归档时间：	12 年，9 月前
查看次数：	36413 次
最近记录：	9 年，10 月前