dat*_*ict 5 python csv delimiter apache-spark pyspark
我正在尝试在 pyspark 中使用 ^A(\001) 分隔符读取 csv 文件。我已经浏览了下面的链接,正如链接中提到的,我尝试了相同的方法,它按预期工作,即我能够读取 csv 文件并进一步处理它们。
链接: 如何使用spark-csv解析使用^A(即\001)作为分隔符的csv?
在职的
spark.read.option("wholeFile", "true"). \
option("inferSchema", "false"). \
option("header", "true"). \
option("quote", "\""). \
option("multiLine", "true"). \
option("delimiter", "\u0001"). \
csv("path/to/csv/file.csv")
Run Code Online (Sandbox Code Playgroud)
我想从数据库中读取它而不是硬编码分隔符,下面是我尝试过的方法。
update table set field_delimiter= 'field_delimiter=\\u0001'
Run Code Online (Sandbox Code Playgroud)
(键值对。使用键,我正在访问值)
delimiter = config.FIELD_DELIMITER (This will fetch the delimiter from the database)
>>print(delimiter)
\u0001
Run Code Online (Sandbox Code Playgroud)
不工作
spark.read.option("wholeFile", "true"). \
option("inferSchema", "false"). \
option("header", "true"). \
option("quote", "\""). \
option("multiLine", "true"). \
option("delimiter", delimiter). \
csv("path/to/csv/file.csv")
Run Code Online (Sandbox Code Playgroud)
错误:
: java.lang.IllegalArgumentException: Unsupported special character for delimiter: \u0001
at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:106)
at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:83)
at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:39)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:178)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:178)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:177)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Run Code Online (Sandbox Code Playgroud)
小智 0
我正在处理一个具有相同分隔符的文件,即“\u0001”。
为了使其在 python 3.x 版本中工作,我导入了:
from __future__ import unicode_literals
并将我的文件读入数据框:
df = spark.read.format("csv").option("inferSchema", True)\
.option("delimiter",u"\u0001").load(r"/application/file.csv")
Run Code Online (Sandbox Code Playgroud)
输出
+--------------+------------------------------------+------+---------------------+---------------------+-----------+------------------+-------------------+-----------------+
|ts |id |source|FaM |record_Num |primlim_no |first_name |middle_name |last_name |
+--------------+------------------------------------+------+---------------------+---------------------+-----------+------------------+-------------------+-----------------+
|20150728133902|3d942d41-edde-419c-a15b |AS4 |AGC |300104 |76000389072|lalal |H |RAMEN |
|20150728133902|5277f150-6890-4c99-b85a |AS4 |AGC |3001261 |76000027136|roberta |null |BIRDY |
|20150728133902|10c8f16b-cc2f-42b4-810d |AS4 |AGC |400005920 |76000328013|bobby |L |LORDS |
|20150728133902|5c1a8c4c-a590-4b3b-95f5 |AS4 |AGC |3154018172 |76000054981|jackie |A |DOWN |
|20150728133902|a510763b-57da-4767-972d |AS4 |AGC |3059318259 |76000350660|rob |W |THORN |
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1318 次 |
| 最近记录: |