我在Java中编写了一个简单的应用程序,它采用路径列表并生成一个包含该原始列表下所有文件路径的文件.
如果我有paths.txt有:
c:\folder1\
c:\folder2\
...
...
c:\folder1000\
Run Code Online (Sandbox Code Playgroud)
我的应用程序在每个多线程路径上运行递归函数,并返回包含这些文件夹下所有文件路径的文件.
现在我想用Python编写这个应用程序.
我写了一个简单的应用程序,用于os.walk()运行给定的文件夹并打印文件路径输出.
现在我想并行运行它,我已经看到Python有一些模块:多线程和多处理.
什么是最好的做什么?在这种方式下,它是如何执行的?
Ray*_*ger 25
这是一个多处理解决方案:
from multiprocessing.pool import Pool
from multiprocessing import JoinableQueue as Queue
import os
def explore_path(path):
directories = []
nondirectories = []
for filename in os.listdir(path):
fullname = os.path.join(path, filename)
if os.path.isdir(fullname):
directories.append(fullname)
else:
nondirectories.append(filename)
outputfile = path.replace(os.sep, '_') + '.txt'
with open(outputfile, 'w') as f:
for filename in nondirectories:
print >> f, filename
return directories
def parallel_worker():
while True:
path = unsearched.get()
dirs = explore_path(path)
for newdir in dirs:
unsearched.put(newdir)
unsearched.task_done()
# acquire the list of paths
with open('paths.txt') as f:
paths = f.split()
unsearched = Queue()
for path in paths:
unsearched.put(path)
pool = Pool(5)
for i in range(5):
pool.apply_async(parallel_worker)
unsearched.join()
print 'Done'
Run Code Online (Sandbox Code Playgroud)
这是python中的线程模式,对我有用.由于线程在CPython中的工作方式,我不确定线程是否会提高性能.
import threading
import Queue
import os
class PathThread (threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def printfiles(self, p):
for path, dirs, files in os.walk(p):
for f in files:
print path + "/" + f
def run(self):
while True:
path = self.queue.get()
self.printfiles(path)
self.queue.task_done()
# threadsafe queue
pathqueue = Queue.Queue()
paths = ["foo", "bar", "baz"]
# spawn threads
for i in range(0, 5):
t = PathThread(pathqueue)
t.setDaemon(True)
t.start()
# add paths to queue
for path in paths:
pathqueue.put(path)
# wait for queue to get empty
pathqueue.join()
Run Code Online (Sandbox Code Playgroud)