如何在linux中只处理新的(未处理的)文件

Cil*_*vic 5 python linux bash

给定一个包含大量小文件(> 1 mio)的目录,可以快速记住哪些文件已被处理(用于数据库导入).

我尝试的第一个解决方案是bash脚本:

#find all gz files
for f in $(find $rawdatapath -name '*.gz'); do
    filename=`basename $f`

    #check whether the filename is already contained in the process list
    onlist=`grep $filename $processed_files`
    if [[ -z $onlist ]]
        then
            echo "processing, new: $filename"
            #unzip file and import into mongodb

            #write filename into processed list
            echo $filename #>> $processed_files
    fi
done
Run Code Online (Sandbox Code Playgroud)

对于较小的样本(160k文件),这大约需要8分钟(没有任何处理)

接下来我尝试了一个python脚本:

import os

path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = [line.strip() for line in open(processed_files_file)]

with open(processed_files_file, "a") as pff:
  for root, dirs, files in os.walk(path):
      for file in files:
          if file.endswith(".gz"):
              if file not in processed_files:
                  pff.write("%s\n" % file)
Run Code Online (Sandbox Code Playgroud)

这将在不到2分钟的时间内完成.

我有一个明显更快的方式吗?

其他方案:

  • 由于我使用s3sync下载新文件,因此将处理过的文件移动到其他位置并不方便
  • 由于文件的时间戳作为其名称的一部分,我可能会考虑依赖于按顺序处理它们,并且只将名称与"上次处理"的日期进行比较
  • 或者,我可以跟踪上次处理的时间,并且只处理自那以后修改过的文件.

Dan*_*iel 6

只需使用一套:

import os

path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = set(line.strip() for line in open(processed_files_file))

with open(processed_files_file, "a") as pff:
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith(".gz"):
                if file not in processed_files:
                    pff.write("%s\n" % file)
Run Code Online (Sandbox Code Playgroud)