从CSV创建表,其值包含用引号括起来的逗号

nxl*_*xl4 6 sql hadoop impala

我正在尝试使用我上传到HDFS目录的CSV在Impala中创建一个表.CSV包含用引号括起来的逗号的值.

例:

1.66.96.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.66.128.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.0.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.128.0/18,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.192.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
Run Code Online (Sandbox Code Playgroud)

帕拉的文件说,这可以用一个来解决ESCAPED BY条款.这是我目前的代码:

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
  network STRING
 ,isp STRING
 ,organization STRING
 ,autonomous_system_number STRING
 ,autonomous_system_organization STRING
  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
Run Code Online (Sandbox Code Playgroud)

我也试过使用这个ESCAPED BY '"'条款.在这两种情况下,Impala都使用引号中的逗号并将其用作分隔符,将值拆分为两列.

有关如何修复代码的任何想法,以便不会发生这种情况?

编辑(6/9/2015)

所以,根据@KS Nidhin和@JTUP的建议,我已经完成了以下变化.但是,每个变体返回的结果与没有SERDEPROPERTIES运算符的查询返回相同的结果,逗号仍然会导致值显示在错误的列中:

变化1

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
  network STRING
 ,isp STRING
 ,organization STRING
 ,autonomous_system_number STRING
 ,autonomous_system_organization STRING
  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES ( "quoteChar" = "'", "escapeChar" = "\\" ) 

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
Run Code Online (Sandbox Code Playgroud)

变化2

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
  network STRING
 ,isp STRING
 ,organization STRING
 ,autonomous_system_number STRING
 ,autonomous_system_organization STRING
  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
WITH SERDEPROPERTIES ( 'quoteChar' = '"', 'escapeChar' = '\\' )

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
Run Code Online (Sandbox Code Playgroud)

变化3

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
  network STRING
 ,isp STRING
 ,organization STRING
 ,autonomous_system_number STRING
 ,autonomous_system_organization STRING
  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
WITH SERDEPROPERTIES (
   "separatorChar" = "\,",
   "quoteChar"     = "\""
)

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
Run Code Online (Sandbox Code Playgroud)

SERDEPROPERTIES尝试其他任何想法,还是操作员的进一步变化?

编辑(2016年10月6日)

我能够使用SERDESERDEPROPERTIES运算符在Hive中工作(基于Hive文档中提供的代码)获取查询的不同变体,并创建正确的表:

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;

CREATE TABLE GeoIP2_ISP_Blocks_IPv4(network STRING
 ,isp STRING
 ,organization STRING
 ,autonomous_system_number STRING
 ,autonomous_system_organization STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar'     = '"',
   'escapeChar'    = '\\'
)   
STORED AS TEXTFILE;

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
Run Code Online (Sandbox Code Playgroud)

由于SERDEImpala中没有运营商,因此该解决方案无法在那里运行.我很好在Hive中创建表格,但我仍然无法在Impala中找到可行的解决方案.

JT4*_*T4U 2

DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;

CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
  network STRING
 ,isp STRING
 ,organization STRING
 ,autonomous_system_number STRING
 ,autonomous_system_organization STRING
  )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'

WITH SERDEPROPERTIES (
   "separatorChar" = "\,",
   "quoteChar"     = "\""
)

LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';

INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;

LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/' 
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
Run Code Online (Sandbox Code Playgroud)

添加 SERDEPROPERTIES 应该可以解决问题