jdu*_*075 8 scala utf-16 apache-spark
我需要读取一个iso-8859-1编码文件,做一些操作然后保存它(使用iso-8859-1编码).为了测试这个,我迷失了我在Databricks CSV软件包上找到的测试用例:https: //github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv /CsvSuite.scala
- 具体来说:测试("iso-8859-1编码文件的DSL测试")
val fileDF = spark.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("charset", "iso-8859-1")
.option("delimiter", "~") // bogus - hopefully something not in the file, just want 1 record per line
.load("s3://.../cars_iso-8859-1.csv")
fileDF.collect // I see the non-ascii characters correctly
val selectedData = fileDF.select("_c0") // just so show an operation
selectedData.write
.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", "~")
.option("charset", "iso-8859-1")
.save("s3://.../carOutput8859")
Run Code Online (Sandbox Code Playgroud)
此代码运行时没有错误 - 但它似乎不尊重输出上的iso-8859-1选项.在Linux提示符下(从S3复制 - >本地Linux)
file -i cars_iso-8859-1.csv
cars_iso-8859-1.csv: text/plain; charset=iso-8859-1
file -i carOutput8859.csv
carOutput8859.csv: text/plain; charset=utf-8
Run Code Online (Sandbox Code Playgroud)
我只是在寻找一些阅读和编写非UTF8文件的好例子.此时,我在方法上有很大的灵活性.(不一定是csv读者)任何推荐/例子?
归档时间: |
|
查看次数: |
5058 次 |
最近记录: |