我试图找到一种有效的方法来解析包含固定宽度线的文件.例如,前20个字符表示一列,从21:30表示另一个,依此类推.
假设该行包含100个字符,那么将一行解析为多个组件的有效方法是什么?
我可以在每行使用字符串切片,但如果线条很大则有点难看.还有其他快速方法吗?
Rei*_*cke 65
我不确定这是否有效,但它应该是可读的(而不是手动切片).我定义了一个slices
获取字符串和列长度的函数,并返回子字符串.我把它做成了一个生成器,所以对于很长的行,它不构建一个临时的子串列表.
def slices(s, *args):
position = 0
for length in args:
yield s[position:position + length]
position += length
Run Code Online (Sandbox Code Playgroud)
例
In [32]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2))
Out[32]: ['ab']
In [33]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2, 10, 50))
Out[33]: ['ab', 'cdefghijkl', 'mnopqrstuvwxyz0123456789']
In [51]: d,c,h = slices('dogcathouse', 3, 3, 5)
In [52]: d,c,h
Out[52]: ('dog', 'cat', 'house')
Run Code Online (Sandbox Code Playgroud)
但是我认为如果你需要同时使用所有列,那么生成器的优势就会丢失.一个人可以受益的地方就是你想要逐个处理列,比如循环.
mar*_*eau 63
使用Python标准库的struct
模块既相当简单又快速,因为它是用C语言编写的.
以下是如何使用它来做你想要的.它还允许通过为字段中的字符数指定负值来跳过字符列.
import struct
fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size))
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))
Run Code Online (Sandbox Code Playgroud)
输出:
fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
Run Code Online (Sandbox Code Playgroud)
以下修改将使其适用于Python 2或3(并处理Unicode输入):
import sys
fieldstruct = struct.Struct(fmtstring)
if sys.version_info[0] < 3:
parse = fieldstruct.unpack_from
else:
# converts unicode input to byte string and results back to unicode string
unpack = fieldstruct.unpack_from
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))
Run Code Online (Sandbox Code Playgroud)
这是一种使用字符串切片的方法,正如您正在考虑的那样,但担心它可能会变得太难看.关于它的好处是,除了不是那么难看之外,它在Python 2和3中都能保持不变,并且能够处理Unicode字符串.我没有对它进行基准测试,但怀疑它可能与struct
模块版本的速度竞争.通过删除填充字段的能力可以略微加快速度.
try:
from itertools import izip_longest # added in Py 2.6
except ImportError:
from itertools import zip_longest as izip_longest # name change in Py 3.x
try:
from itertools import accumulate # added in Py 3.2
except ImportError:
def accumulate(iterable):
'Return running totals (simplified version).'
total = next(iterable)
yield total
for value in iterable:
total += value
yield total
def make_parser(fieldwidths):
cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one
parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
# optional informational function attributes
parse.size = sum(abs(fw) for fw in fieldwidths)
parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
return parse
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24) # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))
Run Code Online (Sandbox Code Playgroud)
输出:
format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
Run Code Online (Sandbox Code Playgroud)
Tom*_*m M 22
还有两个选项比已经提到的解决方案更容易和更漂亮:
第一个是使用熊猫:
import pandas as pd
path = 'filename.txt'
# Using Pandas with a column specification
col_specification = [(0, 20), (21, 30), (31, 50), (51, 100)]
data = pd.read_fwf(path, colspecs=col_specification)
Run Code Online (Sandbox Code Playgroud)
第二个选项使用numpy.loadtxt:
import numpy as np
# Using NumPy and letting it figure it out automagically
data_also = np.loadtxt(path)
Run Code Online (Sandbox Code Playgroud)
这实际上取决于您希望以何种方式使用您的数据.
Joh*_*hin 12
下面的代码给出了一个草图,说明如果要进行一些严格的固定列宽文件处理,您可能想要做什么.
"严重"=多种文件类型中的多种记录类型,最多1000个字节的记录,布局定义者和"对立"生产者/消费者是政府部门的态度,布局变化导致未使用的列,高达一百万条记录在一个文件中,......
功能:预编译结构格式.忽略不需要的列.将输入字符串转换为所需的数据类型(草图省略错误处理).将记录转换为对象实例(或者如果您愿意,可以将dicts或命名元组转换).
码:
import struct, datetime, io, pprint
# functions for converting input fields to usable data
cnv_text = rstrip
cnv_int = int
cnv_date_dmy = lambda s: datetime.datetime.strptime(s, "%d%m%Y") # ddmmyyyy
# etc
# field specs (field name, start pos (1-relative), len, converter func)
fieldspecs = [
('surname', 11, 20, cnv_text),
('given_names', 31, 20, cnv_text),
('birth_date', 51, 8, cnv_date_dmy),
('start_date', 71, 8, cnv_date_dmy),
]
fieldspecs.sort(key=lambda x: x[1]) # just in case
# build the format for struct.unpack
unpack_len = 0
unpack_fmt = ""
for fieldspec in fieldspecs:
start = fieldspec[1] - 1
end = start + fieldspec[2]
if start > unpack_len:
unpack_fmt += str(start - unpack_len) + "x"
unpack_fmt += str(end - start) + "s"
unpack_len = end
field_indices = range(len(fieldspecs))
print unpack_len, unpack_fmt
unpacker = struct.Struct(unpack_fmt).unpack_from
class Record(object):
pass
# or use named tuples
raw_data = """\
....v....1....v....2....v....3....v....4....v....5....v....6....v....7....v....8
Featherstonehaugh Algernon Marmaduke 31121969 01012005XX
"""
f = cStringIO.StringIO(raw_data)
headings = f.next()
for line in f:
# The guts of this loop would of course be hidden away in a function/method
# and could be made less ugly
raw_fields = unpacker(line)
r = Record()
for x in field_indices:
setattr(r, fieldspecs[x][0], fieldspecs[x][3](raw_fields[x]))
pprint.pprint(r.__dict__)
print "Customer name:", r.given_names, r.surname
Run Code Online (Sandbox Code Playgroud)
输出:
78 10x20s20s8s12x8s
{'birth_date': datetime.datetime(1969, 12, 31, 0, 0),
'given_names': 'Algernon Marmaduke',
'start_date': datetime.datetime(2005, 1, 1, 0, 0),
'surname': 'Featherstonehaugh'}
Customer name: Algernon Marmaduke Featherstonehaugh
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
48444 次 |
最近记录: |