我有一些代码读取名称文件并创建一个列表:
names_list = open("names", "r").read().splitlines()
Run Code Online (Sandbox Code Playgroud)
每个名称都用换行符分隔,如下所示:
Allman
Atkinson
Behlendorf
Run Code Online (Sandbox Code Playgroud)
我想忽略任何只包含空格的行.我知道我可以通过创建循环并检查我读取的每一行然后将其添加到列表(如果它不是空白)来完成此操作.
我只是想知道是否有更多的Pythonic方式呢?
aar*_*ing 61
我会堆叠生成器表达式:
with open(filename) as f_in:
lines = (line.rstrip() for line in f_in) # All lines including the blank ones
lines = (line for line in lines if line) # Non-blank lines
Run Code Online (Sandbox Code Playgroud)
现在,lines是所有非空行.这样可以避免两次调用线路上的条带.如果你想要一个行列表,那么你可以这样做:
with open(filename) as f_in:
lines = (line.rstrip() for line in f_in)
lines = list(line for line in lines if line) # Non-blank lines in a list
Run Code Online (Sandbox Code Playgroud)
你也可以用一个单行(exlude withstatement)来做,但它没有更高效,更难阅读:
with open(filename) as f_in:
lines = list(line for line in (l.strip() for l in f_in) if line)
Run Code Online (Sandbox Code Playgroud)
我同意这是因为重复令牌而丑陋.如果您愿意,您可以编写一个生成器:
def nonblank_lines(f):
for l in f:
line = l.rstrip()
if line:
yield line
Run Code Online (Sandbox Code Playgroud)
然后称之为:
with open(filename) as f_in:
for line in nonblank_lines(f_in):
# Stuff
Run Code Online (Sandbox Code Playgroud)
with open(filename) as f_in:
lines = filter(None, (line.rstrip() for line in f_in))
Run Code Online (Sandbox Code Playgroud)
并在CPython上(具有确定性引用计数)
lines = filter(None, (line.rstrip() for line in open(filename)))
Run Code Online (Sandbox Code Playgroud)
在Python 2中使用,itertools.ifilter如果你想要一个生成器,在Python 3中,list如果你想要一个列表,只需传递整个内容.
Fel*_*ing 17
你可以使用列表理解:
with open("names", "r") as f:
names_list = [line.strip() for line in f if line.strip()]
Run Code Online (Sandbox Code Playgroud)
更新:删除不必要的readlines().
为避免调用line.strip()两次,您可以使用生成器:
names_list = [l for l in (line.strip() for line in f) if l]
Run Code Online (Sandbox Code Playgroud)
小智 8
我想有一个简单的解决方案,我最近在这里浏览了这么多答案后使用了它。
with open(file_name) as f_in:
for line in f_in:
if len(line.split()) == 0:
continue
Run Code Online (Sandbox Code Playgroud)
这只是做同样的工作,忽略所有空行。
如果你想要,你可以把你在列表中的所有内容理解:
names_list = [line for line in open("names.txt", "r").read().splitlines() if line]
要么
all_lines = open("names.txt", "r").read().splitlines()
names_list = [name for name in all_lines if name]
Run Code Online (Sandbox Code Playgroud)
splitlines()已经删除了行结尾.
我不认为这些只是明确地循环显式:
names_list = []
with open('names.txt', 'r') as _:
for line in _:
line = line.strip()
if line:
names_list.append(line)
Run Code Online (Sandbox Code Playgroud)
编辑:
虽然,过滤器看起来很可读和简洁:
names_list = filter(None, open("names.txt", "r").read().splitlines())
当必须对文本进行处理以从中提取数据时,我总是首先考虑正则表达式,因为:
据我所知,正则表达式是为此发明的
遍历行对我来说似乎很笨拙:它本质上包括搜索换行符,然后搜索要在每一行中提取的数据;使用正则表达式进行两次搜索而不是直接唯一的搜索
使用正则表达式的方法很简单;只有编写要编译为正则表达式对象的正则表达式字符串有时很困难,但在这种情况下,对行进行迭代的处理也会很复杂
对于这里讨论的问题,正则表达式解决方案快速且易于编写:
import re
names = re.findall('\S+',open(filename).read())
Run Code Online (Sandbox Code Playgroud)
我比较了几种解决方案的速度:
import re
from time import clock
A,AA,B1,B2,BS,reg = [],[],[],[],[],[]
D,Dsh,C1,C2 = [],[],[],[]
F1,F2,F3 = [],[],[]
def nonblank_lines(f):
for l in f:
line = l.rstrip()
if line: yield line
def short_nonblank_lines(f):
for l in f:
line = l[0:-1]
if line: yield line
for essays in xrange(50):
te = clock()
with open('raa.txt') as f:
names_listA = [line.strip() for line in f if line.strip()] # Felix Kling
A.append(clock()-te)
te = clock()
with open('raa.txt') as f:
names_listAA = [line[0:-1] for line in f if line[0:-1]] # Felix Kling with line[0:-1]
AA.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f_in:
namesB1 = [ name for name in (l.strip() for l in f_in) if name ] # aaronasterling without list()
B1.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
namesB2 = [ name for name in (l[0:-1] for l in f_in) if name ] # aaronasterling without list() and with line[0:-1]
B2.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
namesBS = [ name for name in f_in.read().splitlines() if name ] # a list comprehension with read().splitlines()
BS.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f:
xreg = re.findall('\S+',f.read()) # eyquem
reg.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f_in:
linesC1 = list(line for line in (l.strip() for l in f_in) if line) # aaronasterling
C1.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
linesC2 = list(line for line in (l[0:-1] for l in f_in) if line) # aaronasterling with line[0:-1]
C2.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f_in:
yD = [ line for line in nonblank_lines(f_in) ] # aaronasterling update
D.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
yDsh = [ name for name in short_nonblank_lines(f_in) ] # nonblank_lines with line[0:-1]
Dsh.append(clock()-te)
#-------------------------------------------------------
te = clock()
with open('raa.txt') as f_in:
linesF1 = filter(None, (line.rstrip() for line in f_in)) # aaronasterling update 2
F1.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
linesF2 = filter(None, (line[0:-1] for line in f_in)) # aaronasterling update 2 with line[0:-1]
F2.append(clock()-te)
te = clock()
with open('raa.txt') as f_in:
linesF3 = filter(None, f_in.read().splitlines()) # aaronasterling update 2 with read().splitlines()
F3.append(clock()-te)
print 'names_listA == names_listAA==namesB1==namesB2==namesBS==xreg\n is ',\
names_listA == names_listAA==namesB1==namesB2==namesBS==xreg
print 'names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3\n is ',\
names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3,'\n\n\n'
def displ((fr,it,what)): print fr + str( min(it) )[0:7] + ' ' + what
map(displ,(('* ', A, '[line.strip() for line in f if line.strip()] * Felix Kling\n'),
(' ', B1, ' [name for name in (l.strip() for l in f_in) if name ] aaronasterling without list()'),
('* ', C1, 'list(line for line in (l.strip() for l in f_in) if line) * aaronasterling\n'),
('* ', reg, 're.findall("\S+",f.read()) * eyquem\n'),
('* ', D, '[ line for line in nonblank_lines(f_in) ] * aaronasterling update'),
(' ', Dsh, '[ line for line in short_nonblank_lines(f_in) ] nonblank_lines with line[0:-1]\n'),
('* ', F1 , 'filter(None, (line.rstrip() for line in f_in)) * aaronasterling update 2\n'),
(' ', B2, ' [name for name in (l[0:-1] for l in f_in) if name ] aaronasterling without list() and with line[0:-1]'),
(' ', C2, 'list(line for line in (l[0:-1] for l in f_in) if line) aaronasterling with line[0:-1]\n'),
(' ', AA, '[line[0:-1] for line in f if line[0:-1] ] Felix Kling with line[0:-1]\n'),
(' ', BS, '[name for name in f_in.read().splitlines() if name ] a list comprehension with read().splitlines()\n'),
(' ', F2 , 'filter(None, (line[0:-1] for line in f_in)) aaronasterling update 2 with line[0:-1]'),
(' ', F3 , 'filter(None, f_in.read().splitlines() aaronasterling update 2 with read().splitlines()'))
)
Run Code Online (Sandbox Code Playgroud)
使用正则表达式的解决方案简单明了。尽管如此,它并不是最快的。使用 filter() 的 aaronasterling 的解决方案对我来说非常快(我不知道这个特定的 filter() 的速度)并且优化解决方案的时间下降到最大时间的 27%。我想知道是什么造就了 filter-splitlines 关联的奇迹:
names_listA == names_listAA==namesB1==namesB2==namesBS==xreg
is True
names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3
is True
* 0.08266 [line.strip() for line in f if line.strip()] * Felix Kling
0.07535 [name for name in (l.strip() for l in f_in) if name ] aaronasterling without list()
* 0.06912 list(line for line in (l.strip() for l in f_in) if line) * aaronasterling
* 0.06612 re.findall("\S+",f.read()) * eyquem
* 0.06486 [ line for line in nonblank_lines(f_in) ] * aaronasterling update
0.05264 [ line for line in short_nonblank_lines(f_in) ] nonblank_lines with line[0:-1]
* 0.05451 filter(None, (line.rstrip() for line in f_in)) * aaronasterling update 2
0.04689 [name for name in (l[0:-1] for l in f_in) if name ] aaronasterling without list() and with line[0:-1]
0.04582 list(line for line in (l[0:-1] for l in f_in) if line) aaronasterling with line[0:-1]
0.04171 [line[0:-1] for line in f if line[0:-1] ] Felix Kling with line[0:-1]
0.03265 [name for name in f_in.read().splitlines() if name ] a list comprehension with read().splitlines()
0.03638 filter(None, (line[0:-1] for line in f_in)) aaronasterling update 2 with line[0:-1]
0.02198 filter(None, f_in.read().splitlines() aaronasterling update 2 with read().splitlines()
Run Code Online (Sandbox Code Playgroud)
但这个问题很特殊,也是最简单的:每一行只有一个名字。所以解决方案只是带有线、分裂和 [0:-1] 切割的游戏。
相反,正则表达式与行无关,它直接找到所需的数据:我认为这是一种更自然的解决方式,从最简单的情况应用到更复杂的情况,因此通常是首选的方式文本的处理。
编辑
我忘了说我使用的是 Python 2.7,我用一个包含 500 次以下链的文件测量了上述时间
SMITH
JONES
WILLIAMS
TAYLOR
BROWN
DAVIES
EVANS
WILSON
THOMAS
JOHNSON
ROBERTS
ROBINSON
THOMPSON
WRIGHT
WALKER
WHITE
EDWARDS
HUGHES
GREEN
HALL
LEWIS
HARRIS
CLARKE
PATEL
JACKSON
WOOD
TURNER
MARTIN
COOPER
HILL
WARD
MORRIS
MOORE
CLARK
LEE
KING
BAKER
HARRISON
MORGAN
ALLEN
JAMES
SCOTT
PHILLIPS
WATSON
DAVIS
PARKER
PRICE
BENNETT
YOUNG
GRIFFITHS
MITCHELL
KELLY
COOK
CARTER
RICHARDSON
BAILEY
COLLINS
BELL
SHAW
MURPHY
MILLER
COX
RICHARDS
KHAN
MARSHALL
ANDERSON
SIMPSON
ELLIS
ADAMS
SINGH
BEGUM
WILKINSON
FOSTER
CHAPMAN
POWELL
WEBB
ROGERS
GRAY
MASON
ALI
HUNT
HUSSAIN
CAMPBELL
MATTHEWS
OWEN
PALMER
HOLMES
MILLS
BARNES
KNIGHT
LLOYD
BUTLER
RUSSELL
BARKER
FISHER
STEVENS
JENKINS
MURRAY
DIXON
HARVEY
Run Code Online (Sandbox Code Playgroud)