Sqoop导入按列数据类型拆分

Bag*_*thi 3 hive sqoop

sqoop import中Split by column的数据类型是否应该是数字数据类型(整数,bignint,数字)?不能是一个字符串?

dev*_*v ツ 10

是的,您可以拆分任何非数字数据类型.

但不建议这样做.

为什么?

对于分裂数据,Sqoop会发生火灾

SELECT MIN(col1), MAX(col2) FROM TABLE
Run Code Online (Sandbox Code Playgroud)

然后根据你的映射器数量划分它.

现在以整数为--split-by列的示例

表有一些id值为1到100的列,并且您使用4个映射器(-m 4在您的sqoop命令中)

Sqoop使用以下方法获取MIN和MAX值:

SELECT MIN(id), MAX(id) FROM TABLE
Run Code Online (Sandbox Code Playgroud)

OUTPUT:

1100

拆分整数很容易.你将做4个部分:

  • 1-25
  • 25-50
  • 51-75
  • 76-100

现在string为--split-by

表有一些name列"dev"到"sam",你使用4个映射器(-m 4在你的sqoop命令中)

Sqoop使用以下方法获取MIN和MAX值:

SELECT MIN(id), MAX(id) FROM TABLE
Run Code Online (Sandbox Code Playgroud)

OUTPUT:

开发,SAM

现在将如何分为4个部分.根据sqoop docs,

/**
   * This method needs to determine the splits between two user-provided
   * strings.  In the case where the user's strings are 'A' and 'Z', this is
   * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
   * splits for strings beginning with each letter, etc.
   *
   * If a user has provided us with the strings "Ham" and "Haze", however, we
   * need to create splits that differ in the third letter.
   *
   * The algorithm used is as follows:
   * Since there are 2**16 unicode characters, we interpret characters as
   * digits in base 65536. Given a string 's' containing characters s_0, s_1
   * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
   * base 65536. Having mapped the low and high strings into floating-point
   * values, we then use the BigDecimalSplitter to establish the even split
   * points, then map the resulting floating point values back into strings.
   */
Run Code Online (Sandbox Code Playgroud)

您将在代码中看到警告:

LOG.warn("Generating splits for a textual index column.");
LOG.warn("If your database sorts in a case-insensitive order, "
    + "this may result in a partial import or duplicate records.");
LOG.warn("You are strongly encouraged to choose an integral split column.");
Run Code Online (Sandbox Code Playgroud)

在Integer示例的情况下,所有映射器将获得平衡负载(所有将从RDBMS获取25条记录).

在字符串的情况下,数据排序的可能性较小.因此,很难给所有映射器提供类似的负载.


简而言之,Go for integer column as --split-bycolumn.