如何使用webhdfs列出HDFS目录内容?

DPE*_*PEZ 2 python json hadoop hdfs webhdfs

是否可以使用webhdfs?检查HDFS中目录的内容?

这可以像hdfs dfs -ls通常那样工作,而是使用webhdfs.

如何webhdfs使用Python 2.6 列出目录?

Mic*_*ill 5

你可以使用LISTSTATUS动词.文档位于List a Directory中,可以在WebHDFS REST API文档中找到以下代码:

curl,这是它的样子:

curl -i  "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS"
Run Code Online (Sandbox Code Playgroud)

响应是FileStatuses JSON对象:

{
  "name"      : "FileStatuses",
  "properties":
  {
    "FileStatuses":
    {
      "type"      : "object",
      "properties":
      {
        "FileStatus":
        {
          "description": "An array of FileStatus",
          "type"       : "array",
          "items"      : fileStatusProperties
        }
      }
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

fileStatusProperties(对于该items字段)具有此JSON模式:

var fileStatusProperties =
{
  "type"      : "object",
  "properties":
  {
    "accessTime":
    {
      "description": "The access time.",
      "type"       : "integer",
      "required"   : true
    },
    "blockSize":
    {
      "description": "The block size of a file.",
      "type"       : "integer",
      "required"   : true
    },
    "group":
    {
      "description": "The group owner.",
      "type"       : "string",
      "required"   : true
    },
    "length":
    {
      "description": "The number of bytes in a file.",
      "type"       : "integer",
      "required"   : true
    },
    "modificationTime":
    {
      "description": "The modification time.",
      "type"       : "integer",
      "required"   : true
    },
    "owner":
    {
      "description": "The user who is the owner.",
      "type"       : "string",
      "required"   : true
    },
    "pathSuffix":
    {
      "description": "The path suffix.",
      "type"       : "string",
      "required"   : true
    },
    "permission":
    {
      "description": "The permission represented as a octal string.",
      "type"       : "string",
      "required"   : true
    },
    "replication":
    {
      "description": "The number of replication of a file.",
      "type"       : "integer",
      "required"   : true
    },
   "type":
    {
      "description": "The type of the path object.",
      "enum"       : ["FILE", "DIRECTORY"],
      "required"   : true
    }
  }
};
Run Code Online (Sandbox Code Playgroud)

您可以使用pywebhdfs在Python中处理文件名,如下所示:

import json
from pprint import pprint
from pywebhdfs.webhdfs import PyWebHdfsClient

hdfs = PyWebHdfsClient(host='host',port='50070', user_name='hdfs')  # Use your own host/port/user_name config

data = hdfs.list_dir("dir/dir")  # Use your preferred directory, without the leading "/"

file_statuses = data["FileStatuses"]
pprint file_statuses   # Display the dict

for item in file_statuses["FileStatus"]:
    print item["pathSuffix"]   # Display the item filename
Run Code Online (Sandbox Code Playgroud)

print您可以根据需要实际使用项目,而不是每个对象.结果file_statuses只是一个Python dict,所以它可以像任何其他一样使用dict,只要你使用正确的键.