pyspark 在一次加载中加载多个分区文件

Question

pyspark 在一次加载中加载多个分区文件

E B*_*E B 2 partitioned-view apache-spark apache-spark-sql pyspark orc

我正在尝试在一次加载中加载多个文件。它们都是分区文件，当我用 1 个文件尝试它时，它可以工作，但是当我列出 24 个文件时，它给了我这个错误，除了在加载后进行联合之外，我找不到任何有关限制的文档和解决方法。还有其他选择吗？

下面的代码重现了问题：

basepath = '/file/' 
paths = ['/file/df201601.orc', '/file/df201602.orc', '/file/df201603.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc', ]   

df = sqlContext.read.format('orc') \
               options(header='true',inferschema='true',basePath=basePath)\
               .load(*paths)

Run Code Online (Sandbox Code Playgroud)

收到错误：

 TypeError                                 Traceback (most recent call last)
 <ipython-input-43-7fb8fade5e19> in <module>()

---> 37 df = sqlContext.read.format('orc')                .options(header='true', inferschema='true',basePath=basePath)                .load(*paths)
     38 

TypeError: load() takes at most 4 arguments (24 given)

Run Code Online (Sandbox Code Playgroud)

Answer 1

zer*_*323 5

正如官方文档中所解释的，要读取多个文件，您应该传递list：

\n\n

\n
path \xe2\x80\x93 可选字符串或文件系统支持的数据源的字符串列表。
\n

\n\n

所以在你的情况下：

\n\n

(sqlContext.read\n    .format(\'orc\') \n    .options(basePath=basePath)\n    .load(path=paths))\n

Run Code Online (Sandbox Code Playgroud)\n\n

仅当使用可变参数定义时，参数解包 ( *) 才有意义load，例如：

\n\n

def load(this, *paths):\n    ...\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	8 年，1 月前
查看次数：	5356 次
最近记录：	8 年，1 月前