Python拆分文本文件保留换行符

Question

Python拆分文本文件保留换行符

Chr*_*hes 4 python counter split newline

我正在尝试将文本文件拆分为单词，\n并被视为一个单词。

我的输入是这个文本文件：

War and Peace

by Leo Tolstoy/Tolstoi

Run Code Online (Sandbox Code Playgroud)

我想要一个这样的列表输出：

['War','and','Peace','\n','\n','by','Leo','Tolstoy/Tolstoi']

Run Code Online (Sandbox Code Playgroud)

使用.split()我得到这个：

['War', 'and', 'Peace\n\nby', 'Leo', 'Tolstoy/Tolstoi']

Run Code Online (Sandbox Code Playgroud)

所以我开始编写一个程序，将 \n 作为一个单独的条目放在单词后面，代码如下：

for oldword in text:
counter = 0
newword = oldword
while "\n" in newword:
    newword = newword.replace("\n","",1)
    counter += 1

text[text.index(oldword)] = newword

while counter > 0:
    text.insert(text.index(newword)+1, "\n")
    counter -= 1

Run Code Online (Sandbox Code Playgroud)

但是，该程序似乎挂在了线上counter -= 1，我终生无法弄清楚原因。

注意：我意识到如果这样做，结果将是 ['Peaceby',"\n","\n"]; 这是以后要解决的不同问题。

Answer 1

Kas*_*mvd 6

你不需要这么复杂的方式，你可以简单地使用正则表达式并re.findall()找到所有的单词和新行：

>>> s="""War and Peace
... 
... by Leo Tolstoy/Tolstoi"""
>>> 
>>> re.findall(r'\S+|\n',s)
['War', 'and', 'Peace', '\n', '\n', 'by', 'Leo', 'Tolstoy/Tolstoi']

Run Code Online (Sandbox Code Playgroud)

'\S+|\n'将匹配长度为 1 或更多 ( \S+) 或换行 ( \n)的无空白字符的所有组合。

如果要从文件中获取文本，可以执行以下操作：

with open('file_name') as f:
     re.findall(r'\S+|\n',f.read())

Run Code Online (Sandbox Code Playgroud)

阅读有关正则表达式的更多信息http://www.regular-expressions.info/

归档时间：	10 年，1 月前
查看次数：	5181 次
最近记录：	10 年，1 月前