这是一个常见问题.
情景是: -
folderA____ folderA1____folderA1a
\____folderA2____folderA2a
\___folderA2b
Run Code Online (Sandbox Code Playgroud)
...问题是如何列出根目录下所有文件夹中的所有文件folderA
.
pin*_*yid 26
首先要了解的是,在Google云端硬盘中,文件夹不是文件夹!
我们都习惯于Windows/nix等文件夹(aka目录)的概念.在现实世界中,文件夹是放置文档的容器.也可以将较小的文件夹放在较大的文件夹中.因此,可以将大文件夹视为包含其较小子文件夹内的所有文档.
但是,在Google云端硬盘中,文件夹不是容器,因此在Google云端硬盘的第一个版本中,它们甚至不称为文件夹,它们被称为收藏集.文件夹只是一个文件,其中包含(a)无内容,(b)特殊的mime类型(application/vnd.google-apps.folder).文件夹的使用方式正是如此与使用标签(也就是标签)的方式相同.理解这一点的最好方法是考虑GMail.如果查看打开的邮件项目的顶部,则会看到两个图标.带有工具提示"移动到"的文件夹和带有工具提示"标签"的标签.单击其中任何一个,将出现相同的对话框,所有这些都与标签有关.您的标签在左侧列出,在树状显示中看起来很像文件夹.重要的是,邮件项目可以有多个标签,或者您可以说,邮件项目可以位于多个文件夹中.Google Drive的文件夹与GMail标签的工作方式完全相同.
确定文件夹只是一个标签,没有什么可以阻止您在类似于文件夹树的层次结构中组织标签,实际上这是最常见的方式.
现在应该清楚的是,folderA2b中的文件(我们称之为MyFile)不是folderA的子项或孙项.它只是一个带有"folderA2b"标签(混淆地称为父)的文件.
好的,那么如何将所有文件"置于"文件夹A下?
替代方案1.递归
诱惑就是列出folderA的子项,对于任何文件夹的子项,递归列出他们的孩子,冲洗,重复.在极少数情况下,这可能是最好的方法,但对大多数情况来说,它有以下问题: -
备选方案2.共同的父母
如果您的应用程序正在创建所有文件(即您正在使用drive.file范围),则此方法效果最佳.除了上面的文件夹层次结构,还要创建一个名为"MyAppCommonParent"的虚拟父文件夹.当您将每个文件创建为其特定文件夹的子文件时,您还将其设为MyAppCommonParent的子文件.如果您记得将文件夹视为标签,这将变得更加直观.您现在可以通过简单查询轻松检索所有descdesndants MyAppCommonParent in parents
.
备选方案3.文件夹优先
首先获取所有文件夹.是的,所有这些.将它们全部存储在内存中后,您可以爬行其父属性并构建树结构和文件夹ID列表.然后你可以做一个files.list?q='folderA' in parents or 'folderA1' in parents or 'folderA1a' in parents...
.使用这种技术,您可以在两个http调用中获取所有内容.
选项3的伪代码有点像......
// get all folders from Drive
files.list?q=mimetype=application/vnd.google-apps.folder and trashed=false&fields=parents,name
// store in a Map, keyed by ID
// find the entry for folderA and note the ID
// find any entries where the ID is in the parents, note their IDs
// for each such entry, repeat recursively
// use all of the IDs noted above to construct a ...
// files.list?q='folderA-ID' in parents or 'folderA1-ID' in parents or 'folderA1a-ID' in parents...
备选方案2是最有效的,但只有在您控制文件创建时才有效.备选方案3通常比备选方案1更有效,但可能存在某些小树大小,其中1是最佳的.
上面@pinoyyid分享了一个优秀的替代方案 3的 Python 解决方案,以防它对任何人有用。我不是开发人员,所以它可能是无可救药的非 Pythonic ......但它可以工作,只进行 2 个 API 调用,而且速度非常快。
'<folder-id>' in parents
找到的每个子文件夹一个段构建一个 Google Drive 文件查询。有趣的是,Google Drive 似乎对'<folder-id>' in parents
每个查询有 599 个段的硬限制,因此如果您的文件夹到搜索的子文件夹比这更多,您需要对列表进行分块。
FOLDER_TO_SEARCH = '123456789' # ID of folder to search
DRIVE_ID = '654321' # ID of shared drive in which it lives
MAX_PARENTS = 500 # Limit set safely below Google max of 599 parents per query.
def get_all_folders_in_drive():
"""
Return a dictionary of all the folder IDs in a drive mapped to their parent folder IDs (or to the
drive itself if a top-level folder). That is, flatten the entire folder structure.
"""
folders_in_drive_dict = {}
page_token = None
max_allowed_page_size = 1000
just_folders = "trashed = false and mimeType = 'application/vnd.google-apps.folder'"
while True:
results = drive_api_ref.files().list(
pageSize=max_allowed_page_size,
fields="nextPageToken, files(id, name, mimeType, parents)",
includeItemsFromAllDrives=True, supportsAllDrives=True,
corpora='drive',
driveId=DRIVE_ID,
pageToken=page_token,
q=just_folders).execute()
folders = results.get('files', [])
page_token = results.get('nextPageToken', None)
for folder in folders:
folders_in_drive_dict[folder['id']] = folder['parents'][0]
if page_token is None:
break
return folders_in_drive_dict
def get_subfolders_of_folder(folder_to_search, all_folders):
"""
Yield subfolders of the folder-to-search, and then subsubfolders etc. Must be called by an iterator.
:param all_folders: The dictionary returned by :meth:`get_all_folders_in-drive`.
"""
temp_list = [k for k, v in all_folders.items() if v == folder_to_search] # Get all subfolders
for sub_folder in temp_list: # For each subfolder...
yield sub_folder # Return it
yield from get_subfolders_of_folder(sub_folder, all_folders) # Get subsubfolders etc
def get_relevant_files(self, relevant_folders):
"""
Get files under the folder-to-search and all its subfolders.
"""
relevant_files = {}
chunked_relevant_folders_list = [relevant_folders[i:i + MAX_PARENTS] for i in
range(0, len(relevant_folders), MAX_PARENTS)]
for folder_list in chunked_relevant_folders_list:
query_term = ' in parents or '.join('"{0}"'.format(f) for f in folder_list) + ' in parents'
relevant_files.update(get_all_files_in_folders(query_term))
return relevant_files
def get_all_files_in_folders(self, parent_folders):
"""
Return a dictionary of file IDs mapped to file names for the specified parent folders.
"""
files_under_folder_dict = {}
page_token = None
max_allowed_page_size = 1000
just_files = f"mimeType != 'application/vnd.google-apps.folder' and trashed = false and ({parent_folders})"
while True:
results = drive_api_ref.files().list(
pageSize=max_allowed_page_size,
fields="nextPageToken, files(id, name, mimeType, parents)",
includeItemsFromAllDrives=True, supportsAllDrives=True,
corpora='drive',
driveId=DRIVE_ID,
pageToken=page_token,
q=just_files).execute()
files = results.get('files', [])
page_token = results.get('nextPageToken', None)
for file in files:
files_under_folder_dict[file['id']] = file['name']
if page_token is None:
break
return files_under_folder_dict
if __name__ == "__main__":
all_folders_dict = get_all_folders_in_drive() # Flatten folder structure
relevant_folders_list = [FOLDER_TO_SEARCH] # Start with the folder-to-archive
for folder in get_subfolders_of_folder(FOLDER_TO_SEARCH, all_folders_dict):
relevant_folders_list.append(folder) # Recursively search for subfolders
relevant_files_dict = get_relevant_files(relevant_folders_list) # Get the files
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
8278 次 |
最近记录: |