use*_*514 95 python recursion list os.walk
我正在编写一个脚本来递归遍历主文件夹中的子文件夹并构建一个特定文件类型的列表.我遇到了脚本问题.目前设定如下
for root, subFolder, files in os.walk(PATH):
for item in files:
if item.endswith(".txt") :
fileNamePath = str(os.path.join(root,subFolder,item))
Run Code Online (Sandbox Code Playgroud)
问题是subFolder变量正在拉入子文件夹列表而不是ITEM文件所在的文件夹.我想在之前为子文件夹运行for循环并加入路径的第一部分,但我想我会仔细检查以确定是否有人在此之前有任何建议.谢谢你的帮助!
Joh*_*ooy 136
你应该使用dirpath
你所呼叫的root
.这些dirnames
是提供的,如果有你不想os.walk
递归的文件夹,你可以修剪它.
import os
result = [os.path.join(dp, f) for dp, dn, filenames in os.walk(PATH) for f in filenames if os.path.splitext(f)[1] == '.txt']
Run Code Online (Sandbox Code Playgroud)
编辑:
在最新的downvote之后,我发现这glob
是一个更好的扩展选择工具.
import os
from glob import glob
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]
Run Code Online (Sandbox Code Playgroud)
也是一个发电机版本
from itertools import chain
result = (chain.from_iterable(glob(os.path.join(x[0], '*.txt')) for x in os.walk('.')))
Run Code Online (Sandbox Code Playgroud)
Edit2 for Python 3.4+
from pathlib import Path
result = list(Path(".").rglob("*.[tT][xX][tT]"))
Run Code Online (Sandbox Code Playgroud)
Rot*_*eti 76
在Python 3.5中更改:使用"**"支持递归globs.
glob.glob()
得到了一个新的递归参数.
如果你想得到每个.txt
文件my_path
(递归包括子目录):
import glob
files = glob.glob(my_path + '/**/*.txt', recursive=True)
# my_path/ the dir
# **/ every file and dir under my_path
# *.txt every file that ends with '.txt'
Run Code Online (Sandbox Code Playgroud)
如果需要迭代器,可以使用iglob作为替代:
for file in glob.iglob(my_path, recursive=False):
# ...
Run Code Online (Sandbox Code Playgroud)
use*_*036 36
这似乎是最快的解决方案,我能想出,并且是比快os.walk
和比快得多glob
的解决方案。
f.path
为f.name
(不要更改子文件夹!)来选择返回完整路径或仅返回文件的名称。参数:dir: str, ext: list
。
函数返回两个列表:subfolders, files
.
有关详细的速度分析,请参见下文。
def run_fast_scandir(dir, ext): # dir: str, ext: list
subfolders, files = [], []
for f in os.scandir(dir):
if f.is_dir():
subfolders.append(f.path)
if f.is_file():
if os.path.splitext(f.name)[1].lower() in ext:
files.append(f.path)
for dir in list(subfolders):
sf, f = run_fast_scandir(dir, ext)
subfolders.extend(sf)
files.extend(f)
return subfolders, files
subfolders, files = run_fast_scandir(folder, [".jpg"])
Run Code Online (Sandbox Code Playgroud)
如果您需要文件大小,您还可以创建一个sizes
列表并添加f.stat().st_size
如下内容以显示 MiB:
sizes.append(f"{f.stat().st_size/1024/1024:.0f} MiB")
Run Code Online (Sandbox Code Playgroud)
速度分析
用于获取所有子文件夹和主文件夹中具有特定文件扩展名的所有文件的各种方法。
tl;博士:
fast_scandir
除了 os.walk 之外,它显然是赢家,并且是所有其他解决方案的两倍。os.walk
是第二位稍慢。glob
会大大减慢进程。fast_scandir took 499 ms. Found files: 16596. Found subfolders: 439
os.walk took 589 ms. Found files: 16596
find_files took 919 ms. Found files: 16596
glob.iglob took 998 ms. Found files: 16596
glob.glob took 1002 ms. Found files: 16596
pathlib.rglob took 1041 ms. Found files: 16596
os.walk-glob took 1043 ms. Found files: 16596
Run Code Online (Sandbox Code Playgroud)
使用 W7x64、Python 3.8.1、20 次运行完成了测试。439 个(部分嵌套)子文件夹中的 16596 个文件。
find_files
来自/sf/answers/3195245021/并允许您搜索多个扩展。
fast_scandir
是我自己写的,也会返回一个子文件夹列表。你可以给它一个要搜索的扩展列表(我测试了一个包含一个简单条目的列表,if ... == ".jpg"
没有显着差异)。
# -*- coding: utf-8 -*-
# Python 3
import time
import os
from glob import glob, iglob
from pathlib import Path
directory = r"<folder>"
RUNS = 20
def run_os_walk():
a = time.time_ns()
for i in range(RUNS):
fu = [os.path.join(dp, f) for dp, dn, filenames in os.walk(directory) for f in filenames if
os.path.splitext(f)[1].lower() == '.jpg']
print(f"os.walk\t\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_os_walk_glob():
a = time.time_ns()
for i in range(RUNS):
fu = [y for x in os.walk(directory) for y in glob(os.path.join(x[0], '*.jpg'))]
print(f"os.walk-glob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_glob():
a = time.time_ns()
for i in range(RUNS):
fu = glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
print(f"glob.glob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_iglob():
a = time.time_ns()
for i in range(RUNS):
fu = list(iglob(os.path.join(directory, '**', '*.jpg'), recursive=True))
print(f"glob.iglob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def run_pathlib_rglob():
a = time.time_ns()
for i in range(RUNS):
fu = list(Path(directory).rglob("*.jpg"))
print(f"pathlib.rglob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")
def find_files(files, dirs=[], extensions=[]):
# /sf/answers/3195245021/
new_dirs = []
for d in dirs:
try:
new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
except OSError:
if os.path.splitext(d)[1].lower() in extensions:
files.append(d)
if new_dirs:
find_files(files, new_dirs, extensions )
else:
return
def run_fast_scandir(dir, ext): # dir: str, ext: list
# /sf/answers/4186265541/
subfolders, files = [], []
for f in os.scandir(dir):
if f.is_dir():
subfolders.append(f.path)
if f.is_file():
if os.path.splitext(f.name)[1].lower() in ext:
files.append(f.path)
for dir in list(subfolders):
sf, f = run_fast_scandir(dir, ext)
subfolders.extend(sf)
files.extend(f)
return subfolders, files
if __name__ == '__main__':
run_os_walk()
run_os_walk_glob()
run_glob()
run_iglob()
run_pathlib_rglob()
a = time.time_ns()
for i in range(RUNS):
files = []
find_files(files, dirs=[directory], extensions=[".jpg"])
print(f"find_files\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}")
a = time.time_ns()
for i in range(RUNS):
subf, files = run_fast_scandir(directory, [".jpg"])
print(f"fast_scandir\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}. Found subfolders: {len(subf)}")
Run Code Online (Sandbox Code Playgroud)
Jef*_*ima 17
我会将John La Rooy的列表理解翻译成嵌套的,以防万一其他人无法理解它.
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]
Run Code Online (Sandbox Code Playgroud)
应相当于:
import glob
result = []
for x in os.walk(PATH):
for y in glob.glob(os.path.join(x[0], '*.txt')):
result.append(y)
Run Code Online (Sandbox Code Playgroud)
这是列表理解的文档以及os.walk和glob.glob的函数.
Las*_*yes 11
您的原始解决方案几乎是正确的,但是变量“root”在递归路径时会动态更新。os.walk() 是一个递归生成器。每个元组集(根、子文件夹、文件)都是针对特定根的,就像您设置它的方式一样。
IE
root = 'C:\\'
subFolder = ['Users', 'ProgramFiles', 'ProgramFiles (x86)', 'Windows', ...]
files = ['foo1.txt', 'foo2.txt', 'foo3.txt', ...]
root = 'C:\\Users\\'
subFolder = ['UserAccount1', 'UserAccount2', ...]
files = ['bar1.txt', 'bar2.txt', 'bar3.txt', ...]
...
Run Code Online (Sandbox Code Playgroud)
我对您的代码稍作调整以打印完整列表。
import os
for root, subFolder, files in os.walk(PATH):
for item in files:
if item.endswith(".txt") :
fileNamePath = str(os.path.join(root,item))
print(fileNamePath)
Run Code Online (Sandbox Code Playgroud)
希望这可以帮助!
编辑:(基于反馈)
OP 误解/错误标记了 subFolder 变量,因为它实际上是 "root" 中的所有子文件夹。因此,OP,您正在尝试执行 os.path.join(str, list, str),这可能不会像您预期的那样进行。
为了帮助增加清晰度,您可以尝试以下标签方案:
import os
for current_dir_path, current_subdirs, current_files in os.walk(RECURSIVE_ROOT):
for aFile in current_files:
if aFile.endswith(".txt") :
txt_file_path = str(os.path.join(current_dir_path, aFile))
print(txt_file_path)
Run Code Online (Sandbox Code Playgroud)
您可以通过这种方式返回绝对路径文件列表。
def list_files_recursive(path):
"""
Function that receives as a parameter a directory path
:return list_: File List and Its Absolute Paths
"""
import os
files = []
# r = root, d = directories, f = files
for r, d, f in os.walk(path):
for file in f:
files.append(os.path.join(r, file))
lst = [file for file in files]
return lst
if __name__ == '__main__':
result = list_files_recursive('/tmp')
print(result)
Run Code Online (Sandbox Code Playgroud)
它不是最Python的答案,但我会把它放在这里很有趣,因为这是递归中的一个很好的课程
def find_files( files, dirs=[], extensions=[]):
new_dirs = []
for d in dirs:
try:
new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
except OSError:
if os.path.splitext(d)[1] in extensions:
files.append(d)
if new_dirs:
find_files(files, new_dirs, extensions )
else:
return
Run Code Online (Sandbox Code Playgroud)
在我的机器上,我有两个文件夹,root
并且root2
mender@multivax ]ls -R root root2
root:
temp1 temp2
root/temp1:
temp1.1 temp1.2
root/temp1/temp1.1:
f1.mid
root/temp1/temp1.2:
f.mi f.mid
root/temp2:
tmp.mid
root2:
dummie.txt temp3
root2/temp3:
song.mid
Run Code Online (Sandbox Code Playgroud)
比方说,我想找到所有.txt
和所有.mid
文件在任何这些目录中,然后我可以做
files = []
find_files( files, dirs=['root','root2'], extensions=['.mid','.txt'] )
print(files)
#['root2/dummie.txt',
# 'root/temp2/tmp.mid',
# 'root2/temp3/song.mid',
# 'root/temp1/temp1.1/f1.mid',
# 'root/temp1/temp1.2/f.mid']
Run Code Online (Sandbox Code Playgroud)
新的pathlib
库将其简化为一行:
from pathlib import Path
result = list(Path(PATH).glob('**/*.txt'))
Run Code Online (Sandbox Code Playgroud)
您还可以使用生成器版本:
from pathlib import Path
for file in Path(PATH).glob('**/*.txt'):
pass
Run Code Online (Sandbox Code Playgroud)
这将返回Path
对象,您几乎可以将其用于任何对象,或者通过获取字符串形式的文件名file.name
。
归档时间: |
|
查看次数: |
115099 次 |
最近记录: |