Python从文件中提取数据

Mic*_*Lee 1 python file-io

我试图提取具有特定文本文件的文本:

----
data1
data1
data1
extractme
----
data2
data2
data2
----
data3
data3
extractme
----
Run Code Online (Sandbox Code Playgroud)

然后将其转储到文本文件中

----
data1
data1
data1
extractme
---
data3
data3
extractme
---
Run Code Online (Sandbox Code Playgroud)

谢谢您的帮助.

Pet*_*ons 5

这对我来说效果很好.您的示例数据位于名为"data.txt"的文件中,输出将转到"result.txt"

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer = []
keepCurrentSet = True
for line in inFile:
    buffer.append(line)
    if line.startswith("----"):
        #---- starts a new data set
        if keepCurrentSet:
            outFile.write("".join(buffer))
        #now reset our state
        keepCurrentSet = False
        buffer = []
    elif line.startswith("extractme"):
        keepCurrentSet = True
inFile.close()
outFile.close()
Run Code Online (Sandbox Code Playgroud)


Ale*_*lli 5

我想象破折号的数量变化(输入中有4个,有时是4个,有时是输出中的3个)是一个错误,实际上并不需要(因为没有算法甚至暗示,以解释在不同的输出中有多少破折号)场合).

我会根据读取和一次产生一个行块来构造任务:

def readbyblock(f):
  while True:
      block = []
      for line in f:
          if line = '----\n': break
          block.append(line)
      if not block: break
      yield block
Run Code Online (Sandbox Code Playgroud)

这样(选择性)输出可以与输入整齐地分开:

with open('infile.txt') as fin:
    with open('oufile.txt', 'w') as fou:
        for block in readbyblock(fin):
            if 'extractme\n' in block:
                fou.writelines(block)
                fou.write('----\n')
Run Code Online (Sandbox Code Playgroud)

如果块很大,这在性能方面不是最佳的,因为它在if子句中隐含的块中的所有行上都有一个单独的循环.所以,一个好的重构可能是:

def selectivereadbyblock(f, marker='extractme\n'):
  while True:
      block = []
      extract = False
      for line in f:
          if line = '----\n': break
          block.append(line)
          if line==marker: extract = True
      if not block: break
      if extract: yield block

with open('infile.txt') as fin:
    with open('oufile.txt', 'w') as fou:
        for block in selectivereadbyblock(fin):
            fou.writelines(block)
            fou.write('----\n')
Run Code Online (Sandbox Code Playgroud)

参数化分隔符(现在硬编码为输入和输出的'---- \n')是另一种合理的编码调整.