sqoop import中Split by column的数据类型是否应该是数字数据类型(整数,bignint,数字)?不能是一个字符串?
dev*_*v ツ 10
是的,您可以拆分任何非数字数据类型.
但不建议这样做.
对于分裂数据,Sqoop会发生火灾
SELECT MIN(col1), MAX(col2) FROM TABLE
Run Code Online (Sandbox Code Playgroud)
然后根据你的映射器数量划分它.
现在以整数为--split-by列的示例
表有一些id值为1到100的列,并且您使用4个映射器(-m 4在您的sqoop命令中)
Sqoop使用以下方法获取MIN和MAX值:
SELECT MIN(id), MAX(id) FROM TABLE
Run Code Online (Sandbox Code Playgroud)
OUTPUT:
1100
拆分整数很容易.你将做4个部分:
现在string为--split-by列
表有一些name列"dev"到"sam",你使用4个映射器(-m 4在你的sqoop命令中)
Sqoop使用以下方法获取MIN和MAX值:
SELECT MIN(id), MAX(id) FROM TABLE
Run Code Online (Sandbox Code Playgroud)
OUTPUT:
开发,SAM
现在将如何分为4个部分.根据sqoop docs,
/**
* This method needs to determine the splits between two user-provided
* strings. In the case where the user's strings are 'A' and 'Z', this is
* not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
* splits for strings beginning with each letter, etc.
*
* If a user has provided us with the strings "Ham" and "Haze", however, we
* need to create splits that differ in the third letter.
*
* The algorithm used is as follows:
* Since there are 2**16 unicode characters, we interpret characters as
* digits in base 65536. Given a string 's' containing characters s_0, s_1
* .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
* base 65536. Having mapped the low and high strings into floating-point
* values, we then use the BigDecimalSplitter to establish the even split
* points, then map the resulting floating point values back into strings.
*/
Run Code Online (Sandbox Code Playgroud)
您将在代码中看到警告:
LOG.warn("Generating splits for a textual index column.");
LOG.warn("If your database sorts in a case-insensitive order, "
+ "this may result in a partial import or duplicate records.");
LOG.warn("You are strongly encouraged to choose an integral split column.");
Run Code Online (Sandbox Code Playgroud)
在Integer示例的情况下,所有映射器将获得平衡负载(所有将从RDBMS获取25条记录).
在字符串的情况下,数据排序的可能性较小.因此,很难给所有映射器提供类似的负载.
简而言之,Go for integer column as --split-bycolumn.
| 归档时间: |
|
| 查看次数: |
5947 次 |
| 最近记录: |