从hdfs读取csv文件中的数据帧

lor*_*tar 8 python hadoop hdfs pandas

我正在使用pydoop从hdfs读取文件,当我使用时:

import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
    print f.read()
Run Code Online (Sandbox Code Playgroud)

它显示了stdout中的文件.

有没有办法让我在这个文件中读取数据帧?我尝试过使用pandas的read_csv("/ home/file.csv"),但它告诉我无法找到该文件.确切的代码和错误是:

>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
  File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist
Run Code Online (Sandbox Code Playgroud)

hpa*_*ulj 15

我几乎一无所知hdfs,但我想知道以下是否可行:

with hd.open("/home/file.csv") as f:
    df =  pd.read_csv(f)
Run Code Online (Sandbox Code Playgroud)

我假设read_csv使用文件句柄,或者实际上任何可以为它提供行的迭代.我知道numpycsv读者.

pd.read_csv("/home/file.csv")如果常规Python文件open有效,它将起作用 - 即它将文件读取为常规本地文件.

with open("/home/file.csv") as f: 
    print f.read()
Run Code Online (Sandbox Code Playgroud)

但显然hd.open正在使用其他一些位置或协议,因此该文件不是本地的.如果我的建议不起作用,那么您(或我们)需要更多地了解hdfs文档.