Sqoop导入 - 带有CDH5的镶木地板文件

Tao*_*a_k 2 hadoop sqoop

我正在尝试将数据直接从mysql导入到镶木地板,但它似乎无法正常工作......

我正在使用CDH5.3,其中包括Sqoop 1.4.5.

这是我的命令行:

sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database --username username --password mypass --query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id --hive-import --hive-table default.pages_users3 --target-dir hive_pages_users --as-parquetfile
Run Code Online (Sandbox Code Playgroud)

然后我收到这个错误:

Warning: /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/01/09 14:31:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.0
15/01/09 14:31:49 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/01/09 14:31:49 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
15/01/09 14:31:49 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
15/01/09 14:31:49 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/01/09 14:31:49 INFO tool.CodeGenTool: Beginning code generation
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE  (1 = 0) 
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE  (1 = 0) 
15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE  (1 = 0) 
15/01/09 14:31:50 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
Note: /tmp/sqoop-root/compile/b90e7b492f5b66554f2cca3f88ef7a61/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/01/09 14:31:51 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/b90e7b492f5b66554f2cca3f88ef7a61/QueryResult.jar
15/01/09 14:31:51 INFO mapreduce.ImportJobBase: Beginning query import.
15/01/09 14:31:51 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/01/09 14:31:51 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE  (1 = 0) 
15/01/09 14:31:51 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE  (1 = 0) 
15/01/09 14:31:51 WARN spi.Registration: Not loading URI patterns in org.kitesdk.data.spi.hive.Loader
15/01/09 14:31:51 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive?dataset=default.pages_users3
org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive?dataset=default.pages_users3
    at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:109)
    at org.kitesdk.data.Datasets.create(Datasets.java:189)
    at org.kitesdk.data.Datasets.create(Datasets.java:240)
    at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:81)
    at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:70)
    at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:112)
    at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:262)
    at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:721)
    at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:499)
    at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Run Code Online (Sandbox Code Playgroud)

我没有问题,将数据导入配置单元文件格式,但实木复合地板是一个问题......你有会出现这种情况的任何想法?

谢谢 :)

小智 9

请不要使用<db>.<table>--hive-table.这对Parquet导入不起作用.Sqoop使用Kite SDK编写Parquet文件,它不喜欢这种<db>.<table>格式.

相反,请使用--hive-database --hive-table.对于你的命令,它应该是:

    sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database \
    --username username --password mypass \
    --query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id \
    --hive-import --hive-database default --hive-table pages_users3 \
    --target-dir hive_pages_users --as-parquetfile
Run Code Online (Sandbox Code Playgroud)


Tag*_*gar 5

这是我在 CDH 5.5 中从 jdbc 导入 Hive parquet 文件的管道。JDBC 数据源适用于 Oracle,但下面的说明也适用于 MySQL。

1) 勺子:

$ sqoop import --connect "jdbc:oracle:thin:@(complete TNS descriptor)" \
    --username MRT_OWNER -P \
    --compress --compression-codec snappy \
    --as-parquetfile \
    --table TIME_DIM \
    --warehouse-dir /user/hive/warehouse \
    --num-mappers 1
Run Code Online (Sandbox Code Playgroud)

我选择 --num-mappers 为 1,因为 TIME_DIM 表只有大约 20k 行,并且不建议针对如此小的数据集将 parquet 表拆分为多个文件。每个映射器都会创建一个单独的输出(镶木地板)文件。

(ps. 对于 Oracle 用户:我必须作为源表的所有者进行连接,否则必须指定“MRT_OWNER.TIME_DIM”,并且收到错误 org.kitesdk.data.ValidationException:命名空间 MRT_OWNER.TIME_DIM 不是字母数字(加上 ' _'),似乎是一个 sqoop 错误)。

(ps2。表名必须全部大写。不确定这是否是 Oracle 特定的(不应该是)以及这是否是另一个 sqoop 错误)。

(ps3.--compress --compression-codec snappy参数被识别但似乎没有任何效果)

2)上面的命令创建一个名为

/user/hive/warehouse/TIME_DIM
Run Code Online (Sandbox Code Playgroud)

将其移动到特定的 Hive 数据库目录是一个明智的想法,例如:

$ hadoop fs -mv /hivewarehouse/TIME_DIM /hivewarehouse/dwh.db/time_dim
Run Code Online (Sandbox Code Playgroud)

假设 Hive 数据库/架构的名称是“dwh”。

3) 通过直接从 parquet 文件获取模式来创建 Hive 表:

$ hadoop fs -ls /user/hive/warehouse/dwh.db/time_dim | grep parquet

-rwxrwx--x+  3 hive hive       1216 2016-02-04 23:56 /user/hive/warehouse/dwh.db/time_dim/62679a1c-b848-426a-bb8e-9372328ddad7.parquet
Run Code Online (Sandbox Code Playgroud)

如果上述命令返回多个 parquet 文件(这意味着您有多个映射器,即 --num-mappers 参数),您可以在以下命令中选择任何 parquet 文件。

此命令应在 Impala 中运行,而不是在 Hive 中运行。Hive 目前无法从 parquet 文件推断架构,但 Impala 可以:

[impala-shell] > CREATE TABLE dwh.time_dim
LIKE PARQUET '/user/hive/warehouse/dwh.db/time_dim/62679a1c-b848-426a-bb8e-9372328ddad7.parquet'
COMMENT 'sqooped from MRT_OWNER.TIME_DIM'
STORED AS PARQUET
LOCATION     'hdfs:///user/hive/warehouse/dwh.db/time_dim'
;
Run Code Online (Sandbox Code Playgroud)

附:还可以使用 Spark 从 parquet 推断模式,例如

spark.read.schema('hdfs:///user/hive/warehouse/dwh.db/time_dim')
Run Code Online (Sandbox Code Playgroud)

4) 由于表不是在 Hive 中创建的(它自动收集统计信息),因此收集统计信息是个好主意:

[impala-shell] > compute stats dwh.time_dim;
Run Code Online (Sandbox Code Playgroud)

https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_import_literal