在我们的组织中,我们一直在尝试使用基于hadoop生态系统的工具来实现ETL.虽然生态系统本身很大,但目前我们只使用一套非常有限的工具.我们的典型管道流程如下:
Source Database (1 or more) -> sqoop import -> pig scripts -> sqoop export -> Destination Database (1 or more)
Run Code Online (Sandbox Code Playgroud)
在一段时间内,我们遇到了上述实施ETL方法的多个问题.我们注意到的一个问题是,当尝试使用pig从HDFS读取字段时,字段无法正确对齐(其中HDFS上的数据通常通过sqoop导入)并且pig脚本因错误而失败.例如,由于未对齐,字符串可能以数字字段类型结束.
看来这个问题有两种方法:
在使用pig处理之前,删除字段中已知的问题字符.这是我们过去采取的方法.我们知道我们的源数据库中有一些不良数据 - 通常是不存在的字段中的新行和制表符.(注意:我们曾经将标签作为字段分隔符).所以我们所做的就是使用DB视图或自由格式查询选项sqoop,后者又使用REPLACE函数或其在源数据库中可用的等效函数(通常是mysql,但不常使用postgres).这种方法确实有效,但它具有HDFS数据与源数据不匹配的副作用.此外,其他一些导入的字段将不再有意义 - 例如,假设您在某个字段上有一个MD5或SHA1哈希,但该字段已被修改以替换某些字符,所以我们必须计算MD5或SHA1是否一致,而不是从源DB中导入一个.此外,这种方法在一定程度上涉及反复试验.我们不一定知道哪些字段需要提前修改(以及要删除的字符),因此我们可能需要多次迭代才能达到最终目标.
使用带有sqoop的机箱功能与转义相结合,并将其与pig中适当类型的加载器结合使用,这样不仅字段可以正确排列,而且给定字段(及其关联值)的表示方式与数据相同穿过管道.
以下是用于此实验的软件的特定版本:
Sqoop: 1.4.3
Pig: 0.12.0
Hadoop: 2.0.0
Run Code Online (Sandbox Code Playgroud)
由于我们的数据集通常很大(并且需要几个小时才能处理),我想我会想出一个非常小的数据集,它模仿我们遇到的一些数据问题.为此,我在mysql中放了一个小表(将用作源数据库):
mysql> desc example;
+-------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(1024) | YES | | NULL | |
| v1 | int(11) | YES | | NULL | |
| v2 | int(11) | YES | | NULL | |
| v3 | int(11) | YES | | NULL | |
+-------+---------------+------+-----+---------+----------------+
5 rows in set (0.00 sec)
Run Code Online (Sandbox Code Playgroud)
使用INSERT语句添加数据后,以下是示例表的内容:
mysql> select * from example;
+----+----------------------------------------------------------------------------+------+------+------+
| id | name | v1 | v2 | v3 |
+----+----------------------------------------------------------------------------+------+------+------+
| 1 | Some string, with a comma. | 1 | 2 | 3 |
| 2 | Another "string with quotes" | 4 | 5 | 6 |
| 3 | A string with
new line | 7 | 8 | 9 |
| 4 | A string with 3 new lines -
first new line
second new line
third new line | 10 | 11 | 12 |
| 5 | a string with "quote" and a
new line | 13 | 14 | 15 |
| 6 | clean record | 0 | 1 | 2 |
| 7 | single
newline | 0 | 1 | 2 |
| 8 | | 51 | 52 | 53 |
| 9 | NULL | 105 | NULL | 103 |
+----+----------------------------------------------------------------------------+------+------+------+
9 rows in set (0.00 sec)
Run Code Online (Sandbox Code Playgroud)
我们可以在名称字段中看到新行.我没有在此数据集中包含选项卡,因为我将分隔符从制表符切换为逗号,因此有一个带逗号的记录.由于典型的封闭字符是双引号,因此有一些带双引号的记录.最后在最后两个记录(id = 8和9)中,我想看看如何在char类型字段中表示空字符串和空值,以及如何在数字类型字段中表示null.
我在上表中尝试了以下sqoop导入:
sqoop import --connect jdbc:mysql://localhost/test --username user --password pass --table example --columns 'id, name, v1, v2, v3' --verbose --split-by id --target-dir example --fields-terminated-by , --escaped-by \\ --enclosed-by \" --num-mappers 1
Run Code Online (Sandbox Code Playgroud)
请注意,blackslash已使用转义字符,双引号作为附件,逗号作为字段分隔符.
以下是HDFS上数据的外观:
$hadoop fs -cat example/part-m-00000
"1","Some string, with a comma.","1","2","3"
"2","Another \"string with quotes\"","4","5","6"
"3","A string with
new line","7","8","9"
"4","A string with 3 new lines -
first new line
second new line
third new line","10","11","12"
"5","a string with \"quote\" and a
new line","13","14","15"
"6","clean record","0","1","2"
"7","single
newline","0","1","2"
"8","","51","52","53"
"9","null","105","null","103"
Run Code Online (Sandbox Code Playgroud)
我创建了一个小猪脚本来读取和解析上面的数据:
REGISTER '……./pig/contrib/piggybank/java/piggybank.jar';
data = LOAD 'example' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE') AS (id:int, name:chararray, v1:int, v2:int, v3:int);
dump data;
Run Code Online (Sandbox Code Playgroud)
请注意皮卡中可用的CSVExcelStorage加载程序的使用.由于我们在传入数据集中有换行符,因此我们启用了MULTILINE选项.上面的脚本产生以下输出:
(1,Some string, with a comma.,1,2,3)
(2,Another \string with quotes\",4,5,6)
(3,A string with
new line,7,8,9)
(4,A string with 3 new lines -
first new line
second new line
third new line,10,11,12)
(5,a string with \quote\" and a
new line,13,14,15)
(6,clean record,0,1,2)
(7,single
newline,0,1,2)
(8,",51,52,53)
(9,null,105,,103)
Run Code Online (Sandbox Code Playgroud)
在id为2和5的记录中,在第一个双引号的位置仍然存在黑色斜杠,而对于后续的双引号,斜杠和引号都保留.这不是我想要的.注意到基于Excel 2007的CSVExcelStorage使用双引号来转义引号(即连续的双引号被视为单引号),我将转义字符设为双引号:
sqoop import --connect jdbc:mysql://localhost/test --username user --password pass --table example --columns 'name, v1, v2, v3' --verbose --split-by id --target-dir example --fields-terminated-by , --escaped-by '\"' --enclosed-by '\"' --num-mappers 1
Run Code Online (Sandbox Code Playgroud)
在执行上述命令之前,我删除了现有数据:$ hadoop fs -rm -r example
在sqoop导入运行之后,以下是数据在HDFS上的显示方式:
$hadoop fs -cat example/part-m-00000
"1","Some string, with a comma.","1","2","3"
"2","Another """"string with quotes""""","4","5","6"
"3","A string with
new line","7","8","9"
"4","A string with 3 new lines -
first new line
second new line
third new line","10","11","12"
"5","a string with """"quote"""" and a
new line","13","14","15"
"6","clean record","0","1","2"
"7","single
newline","0","1","2"
"8","","51","52","53"
"9","null","105","null","103"
Run Code Online (Sandbox Code Playgroud)
I ran the same pig script once more on this data and it produces the following output:
(1,Some string, with a comma.,1,2,3)
(2,Another ""string with quotes"",4,5,6)
(3,A string with
new line,7,8,9)
(4,A string with 3 new lines -
first new line
second new line
third new line,10,11,12)
(5,a string with ""quote"" and a
new line,13,14,15)
(6,clean record,0,1,2)
(7,single
newline,0,1,2)
(8,",51,52,53)
(9,null,105,,103)
Run Code Online (Sandbox Code Playgroud)
Noticing that any double quotes in the string are now doubled effectively, I can get rid of this by using REPLACE function in pig:
data2 = FOREACH data GENERATE id, REPLACE(name, '""', '"') as name, v1, v2, v3;
dump data2;
Run Code Online (Sandbox Code Playgroud)
The above script produces the following output:
(1,Some string, with a comma.,1,2,3)
(2,Another "string with quotes",4,5,6)
(3,A string with
new line,7,8,9)
(4,A string with 3 new lines -
first new line
second new line
third new line,10,11,12)
(5,a string with "quote" and a
new line,13,14,15)
(6,clean record,0,1,2)
(7,single
newline,0,1,2)
(8,",51,52,53)
(9,null,105,,103)
Run Code Online (Sandbox Code Playgroud)
The above looks much more like the output I want. One last item I need to ensure is that nulls and empty strings for chararray type and nulls for int type are accounted for.
Towards that end, I add one more section to the above pig script that generates null and empty strings for char type and null for int type:
data3 = FOREACH data2 GENERATE id, name, v1, v2, v3, null as name2:chararray, '' as name3:chararray, null as v4:int;
dump data3;
Run Code Online (Sandbox Code Playgroud)
The output looks as follows:
(1,Some string, with a comma.,1,2,3,,,)
(2,Another "string with quotes",4,5,6,,,)
(3,A string with
new line,7,8,9,,,)
(4,A string with 3 new lines -
first new line
second new line
third new line,10,11,12,,,)
(5,a string with "quote" and a
new line,13,14,15,,,)
(6,clean record,0,1,2,,,)
(7,single
newline,0,1,2,,,)
(8,",51,52,53,,,)
(9,null,105,,103,,,)
Run Code Online (Sandbox Code Playgroud)
I stored the same output in HDFS using the following pig script:
STORE data3 INTO 'example_output' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE');
Run Code Online (Sandbox Code Playgroud)
Here is how data on HDFS looks like:
$hadoop fs -cat example_output/part-m-00000
1,"Some string, with a comma.",1,2,3,,,
2,"Another ""string with quotes""",4,5,6,,,
3,"A string with
new line",7,8,9,,,
4,"A string with 3 new lines -
first new line
second new line
third new line",10,11,12,,,
5,"a string with ""quote"" and a
new line",13,14,15,,,
6,clean record,0,1,2,,,
7,"single
newline",0,1,2,,,
8,"""",51,52,53,,,
9,null,105,,103,,,
Run Code Online (Sandbox Code Playgroud)
For nulls and empty strings, the only two records of interest are the bottom two ones (id = 8 and 9). It’s clear that there is a difference between empty string and null from source using sqoop versus that which is generated from pig. I could account for null and empty strings in the name field above similar to how I have done for the double quote but it seems rather manual and more steps than needed.
Notice that although we have used "enclosed-by" option in sqoop import (as opposed to "optionally-enclosed-by" option), the output from PIG uses enclosure only when there is a need for it i.e., if a quote or a comma appears in the field, then enclosing is performed, otherwise not - in other words, this looks like the sqoop equivalent of "optionally-enclosed-by" option.
The final stage in the pipeline is sqoop export. I put together the following table:
mysql> desc example_output;
+-------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------------+------+-----+---------+-------+
| id | int(11) | YES | | NULL | |
| name | varchar(1024) | YES | | NULL | |
| v1 | int(11) | YES | | NULL | |
| v2 | int(11) | YES | | NULL | |
| v3 | int(11) | YES | | NULL | |
| name2 | varchar(1024) | YES | | NULL | |
| name3 | varchar(1024) | YES | | NULL | |
| v4 | int(11) | YES | | NULL | |
+-------+---------------+------+-----+---------+-------+
8 rows in set (0.00 sec)
Run Code Online (Sandbox Code Playgroud)
Here is the sqoop export command I used:
sqoop export --connect jdbc:mysql://localhost/test --username user --password pass --table example_output --export-dir example_output --input-fields-terminated-by , --input-escaped-by '\"' --input-optionally-enclosed-by '\"' --num-mappers 1 --verbose
Run Code Online (Sandbox Code Playgroud)
The export options are similar to import options except that the "enclosed-by" has been replaced by "optionally-enclosed-by" and an "input-" prefix has been added to some of the options (e.g: --input-fields-terminated-by) since sqoop export uses those while reading input from HDFS.
This fails with the following error in the logs:
2014-02-25 22:19:05,750 ERROR org.apache.sqoop.mapreduce.TextExportMapper: Exception:
java.lang.RuntimeException: Can't parse input data: 'Some string, with a comma.,1,2,3,,,'
at example_output.__loadFromFields(example_output.java:396)
at example_output.parse(example_output.java:309)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:794)
at example_output.__loadFromFields(example_output.java:366)
... 12 more
2014-02-25 22:19:05,756 ERROR org.apache.sqoop.mapreduce.TextExportMapper: On input: 1,"Some string, with a comma.",1,2,3,,,
2014-02-25 22:19:05,757 ERROR org.apache.sqoop.mapreduce.TextExportMapper: On input file: hdfs://nameservice1/user/xyz/example_output/part-m-00000
2014-02-25 22:19:05,757 ERROR org.apache.sqoop.mapreduce.TextExportMapper: At position 0
Run Code Online (Sandbox Code Playgroud)
In an effort to troubleshoot this problem, I created a HDFS location that has only one record (id = 6) from the input data set:
$ hadoop fs -cat example_output_single_record/part-m-00000
6,clean record,0,1,2,,,
Run Code Online (Sandbox Code Playgroud)
Now the sqoop export command becomes:
sqoop export --connect jdbc:mysql://localhost/test --username user --password pass --table example_output --export-dir example_output_single_record --input-fields-terminated-by , --input-escaped-by '\"' --input-optionally-enclosed-by '\"' --num-mappers 1 --verbose
Run Code Online (Sandbox Code Playgroud)
The above command runs through fine and produces the desired result of inserting the single record into the destination DB:
mysql> select * from example_output;
+------+--------------+------+------+------+-------+-------+------+
| id | name | v1 | v2 | v3 | name2 | name3 | v4 |
+------+--------------+------+------+------+-------+-------+------+
| 6 | clean record | 0 | 1 | 2 | | | NULL |
+------+--------------+------+------+------+-------+-------+------+
1 row in set (0.00 sec)
Run Code Online (Sandbox Code Playgroud)
While null value has been preserved for the numeric field, both null and empty string mapped to empty string in the destination DB.
I think it would be easier if we can ensure that a given value for a given data type will be represented/processed exactly the same way regardless of whether it’s coming from sqoop or generated by pig. Has anyone figured out a way to ensure consistent representation/processing of a given data type while preserving the original field values? I have covered only two data types here (chararray and int) but I suppose some of the other data types also have potentially similar issues.
I have used "enclosed-by" option in sqoop import instead of "optionally-enclosed-by" so that every field value will be enclosed within double quotes. I just thought it would be a source of less confusion if every value in every field was enclosed instead of just those that need enclosing. What do others use and has one of these options worked better for your use case relative to the other? It looks like CSVExcelStorage doesn't support a notion of "enclosed-by" - are there any other storage functions that support this mechanism?
Any suggestions on how to get the sqoop export to work as intended on the full output of pig script (i.e., example_output on HDFS)?
也许您需要退一步并选择一个更简单的解决方案。那么你有换行符、制表符、逗号、双引号、空值、外来字符吗?甚至你的数据中可能有一些垃圾,但它到底有多随机呢?你能选择一个不起眼的分隔符并生存吗?
例如,使用0x17作为字段分隔符
在 sqoop 中使用分隔符:
--fields-terminated-by \0x17
Run Code Online (Sandbox Code Playgroud)
和猪:
LOAD 'input.dat' USING PigStorage('\\0x17') as (x,y,z);
Run Code Online (Sandbox Code Playgroud)
或者也许您可以使用其他一些晦涩的 ascii 值: http://en.wikipedia.org/wiki/ASCII