我有一个以下格式的文本文件:
DELIMITER1
extract me
extract me
extract me
DELIMITER2
Run Code Online (Sandbox Code Playgroud)
我想extract me在.txt文件中提取DELIMITER1和DELIMITER2之间的每个块
这是我目前的不良代码:
import re
def GetTheSentences(file):
fileContents = open(file)
start_rx = re.compile('DELIMITER')
end_rx = re.compile('DELIMITER2')
line_iterator = iter(fileContents)
start = False
for line in line_iterator:
if re.findall(start_rx, line):
start = True
break
while start:
next_line = next(line_iterator)
if re.findall(end_rx, next_line):
break
print next_line
continue
line_iterator.next()
Run Code Online (Sandbox Code Playgroud)
有任何想法吗?
Bre*_*wey 21
您可以使用简化这一个正则表达式re.S中,DOTALL标志.
import re
def GetTheSentences(infile):
with open(infile) as fp:
for result in re.findall('DELIMITER1(.*?)DELIMITER2', fp.read(), re.S):
print result
# extract me
# extract me
# extract me
Run Code Online (Sandbox Code Playgroud)
这也使用了非贪婪的运算符.*?,因此将找到多个不重叠的DELIMITER1-DELIMITER2对的块.
如果分隔符在一行内:
def get_sentences(filename):
with open(filename) as file_contents:
d1, d2 = '.', ',' # just example delimiters
for line in file_contents:
i1, i2 = line.find(d1), line.find(d2)
if -1 < i1 < i2:
yield line[i1+1:i2]
sentences = list(get_sentences('path/to/my/file'))
Run Code Online (Sandbox Code Playgroud)
如果他们在自己的线上:
def get_sentences(filename):
with open(filename) as file_contents:
d1, d2 = '.', ',' # just example delimiters
results = []
for line in file_contents:
if d1 in line:
results = []
elif d2 in line:
yield results
else:
results.append(line)
sentences = list(get_sentences('path/to/my/file'))
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
13228 次 |
| 最近记录: |