在蜂巢中如何将数据插入单个文件

Question

在蜂巢中如何将数据插入单个文件

从表1插入覆盖目录'wasb：/// hiveblob /'SELECT *; 这项工作有效，但是当我们发出诸如INSERT OVERWRITE DIRECTORY'wasb：///hiveblob/sample.csv'之类的命令时，请从表1中选择*；发生异常失败无法重命名：wasb：//incrementalhive-1@crmdbs.blob.core.windows.net/hive/scratch/hive_2015-06-08_10-01-03_930_4881174794406290153-1/-ext-10000至：wasb：/ hiveblob / sample.csv

因此，有什么方法可以将数据插入单个文件

Answer 1

Phi*_* P. 7

我认为您不能告诉蜂巢像wasb:///hiveblob/foo.csv直接写入特定文件。

您可以做的是：

在运行查询之前，告诉hive将输出文件合并为一个。这样，您可以拥有任意数量的reduce，但仍然只有一个输出文件。
运行查询，例如 INSERT OVERWRITE DIRECTORY ...
然后dfs -mv在蜂巢内使用将文件重命名为任何内容。

这可能不如hadoop fs -getmerger /your/src/folder /your/dest/folder/yourFileNameRamzy建议的使用单独方法那样痛苦。

根据您使用的运行时引擎，指示合并文件的方法可能会有所不同。

例如，如果您tez在配置单元查询中用作运行时引擎，则可以执行以下操作：

-- Set the tez execution engine
-- And instruct to merge the results
set hive.execution.engine=tez;
set hive.merge.tezfiles=true;

-- Your query goes here.
-- The results should end up in wasb:///hiveblob/000000_0 file.
INSERT OVERWRITE DIRECTORY 'wasb:///hiveblob/' SELECT * from table1;


-- Rename the output file into whatever you want
dfs -mv 'wasb:///hiveblob/000000_0' 'wasb:///hiveblob/foo.csv'

Run Code Online (Sandbox Code Playgroud)

（以上版本对我适用于以下版本：HDP 2.2，Tez 0.5.2和Hive 0.14.0）

对于MapReduce引擎（默认设置），您可以尝试这些，尽管我自己还没有尝试过：

-- Try this if you use MapReduce engine.
set hive.execution.engine=mr;
set hive.merge.mapredfiles=true;

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，5 月前
查看次数：	7291 次
最近记录：	10 年，4 月前