给定一个包含大量小文件(> 1 mio)的目录,可以快速记住哪些文件已被处理(用于数据库导入).
我尝试的第一个解决方案是bash脚本:
#find all gz files
for f in $(find $rawdatapath -name '*.gz'); do
filename=`basename $f`
#check whether the filename is already contained in the process list
onlist=`grep $filename $processed_files`
if [[ -z $onlist ]]
then
echo "processing, new: $filename"
#unzip file and import into mongodb
#write filename into processed list
echo $filename #>> $processed_files
fi
done
Run Code Online (Sandbox Code Playgroud)
对于较小的样本(160k文件),这大约需要8分钟(没有任何处理)
接下来我尝试了一个python脚本:
import os
path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = [line.strip() for line in open(processed_files_file)]
with open(processed_files_file, "a") as pff:
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".gz"):
if file not in processed_files:
pff.write("%s\n" % file)
Run Code Online (Sandbox Code Playgroud)
这将在不到2分钟的时间内完成.
我有一个明显更快的方式吗?
其他方案:
只需使用一套:
import os
path = "/home/b2blogin/webapps/mongodb/rawdata/segment_slideproof_testing"
processed_files_file = os.path.join(path,"processed_files.txt")
processed_files = set(line.strip() for line in open(processed_files_file))
with open(processed_files_file, "a") as pff:
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".gz"):
if file not in processed_files:
pff.write("%s\n" % file)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
994 次 |
| 最近记录: |