使用 python 从 HDFS 获取文件名列表

Question

使用 python 从 HDFS 获取文件名列表

Hadoop 菜鸟在这里。

I've searched for some tutorials on getting started with hadoop and python without much success. I do not need to do any work with mappers and reducers yet, but it's more of an access issue.

As a part of Hadoop cluster, there are a bunch of .dat files on the HDFS.

In order to access those files on my client (local computer) using Python,

what do I need to have on my computer?

How do I query for filenames on HDFS ?

Any links would be helpful too.

Answer 1

JGC*_*JGC 10

据我所知，没有现成的解决方案，我发现的大多数答案都诉诸于使用hdfs命令调用。我在 Linux 上运行，并面临同样的挑战。我发现这个sh包很有用。这会为您处理运行 o/s 命令并管理 stdin/out/err。

有关更多信息，请参见此处：https : //amoffat.github.io/sh/

不是最简洁的解决方案，但它是一行（ish）并使用标准包。

这是我用于获取 HDFS 目录列表的精简代码。它将列出文件和文件夹，因此如果您需要区分它们，您可能需要进行修改。

import sh
hdfsdir = '/somedirectory'
filelist = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]

Run Code Online (Sandbox Code Playgroud)

我的输出 - 在这种情况下，这些都是目录：

[u'/somedirectory/transaction_basket_fct/date_id=2015-01-01',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-02',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-03',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-04',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-05',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-06',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-07',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-08']

Run Code Online (Sandbox Code Playgroud)

让我们分解一下：

要运行hdfs dfs -ls /somedirectory命令，我们可以sh像这样使用包：

import sh
sh.hdfs('dfs','-ls',hdfsdir)

Run Code Online (Sandbox Code Playgroud)

sh允许您无缝调用 o/s 命令，就好像它们是模块上的函数一样。您将命令参数作为函数参数传递。真的很整洁。

对我来说，这会返回如下内容：

Found 366 items
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-01
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-02
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-03
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-04
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-05

Run Code Online (Sandbox Code Playgroud)

使用新行字符将其拆分为行 .split('\n')

使用获取字符串中的最后一个“单词” line.rsplit(None,1)[-1]。

为了防止列表中的空元素出现问题，请使用 if len(line.rsplit(None,1))

最后使用删除列表中的第一个元素 (the Found 366 items)[1:]

Answer 2

Ehs*_*thi 6

对于“HDFS 上的文件名查询”，仅使用 python 3 的原始子进程库：

from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().split('\n')[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().split('\n')[1:]][:-1]

Run Code Online (Sandbox Code Playgroud)

Answer 3

sam*_*sam 5

what do I need to have on my computer?

You need Hadoop installed and running and ofcourse, Python.

How do I query for filenames on HDFS ?

You can try something like this here. I haven't tested the code so don't just rely on it.

from subprocess import Popen, PIPE

process = Popen('hdfs dfs -cat filename.dat',shell=True,stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()

check for returncode, std_err
if:
    everything is OK, do whatever with stdout
else:
    do something in else condition

Run Code Online (Sandbox Code Playgroud)

You can also look at Pydoop which is a Python API for Hadoop.

Although my example include shell=true, you can try running without it as it is a security risk. Why you shouldn't use shell=True?

Answer 4

小智 2

您应该有权登录访问集群中的节点。让集群管理员选择节点并设置帐户，并告知您如何安全地访问该节点。如果您是管理员，请告诉我集群是本地的还是远程的，如果是远程的，那么它是托管在您的计算机上、公司内部还是第三方云上，如果是的话，我可以提供更多相关信息。

要查询 HDFS 中的文件名，请登录集群节点并运行hadoop fs -ls [path]. 路径是可选的，如果未提供，则会列出主目录中的文件。如果-R作为选项提供，则它将递归列出路径中的所有文件。该命令还有其他选项。有关此命令和其他 Hadoop 文件系统 shell 命令的更多信息，请参阅http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html。

在 Python 中查询 HDFS 文件名的一种简单方法是使用esutil.hdfs.ls(hdfs_url='', recurse=False, full=False)，它hadoop fs -ls hdfs_url在子进程中执行，此外它还具有许多其他 Hadoop 文件系统 shell 命令的函数（请参阅http://code.google.com/p上的源代码） /esutil/source/browse/trunk/esutil/hdfs.py）。esutil可以与一起安装pip install esutil。它位于 PyPI 上https://pypi.python.org/pypi/esutil，其文档位于http://code.google.com/p/esutil/，其 GitHub 站点为https://github.com /esheldon/esutil。

归档时间：	10 年，6 月前
查看次数：	18633 次
最近记录：	5 年，11 月前