我正在尝试使用符合以下条件的 pyspark csv 阅读器:
这是我尝试过的。
file: ab.csv
------
a,b
1,2
3,four
Run Code Online (Sandbox Code Playgroud)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
DDL = "a INTEGER, b INTEGER"
df = spark.read.csv('ab.csv', header=True, schema=DDL, enforceSchema=False,
columnNameOfCorruptRecord='broken')
print(df.show())
Run Code Online (Sandbox Code Playgroud)
输出:
+----+----+
| a| b|
+----+----+
| 1| 2|
|null|null|
+----+----+
Run Code Online (Sandbox Code Playgroud)
此命令不存储损坏的记录。如果我添加broken到架构并删除标头验证,则该命令会发出警告。
DDL = "a INTEGER, b INTEGER, broken STRING"
df = spark.read.csv('ab.csv', header=True, schema=DDL, enforceSchema=True,
columnNameOfCorruptRecord='broken')
print(df.show())
Run Code Online (Sandbox Code Playgroud)
输出:
WARN CSVDataSource:66 - Number of column in CSV header is not equal to …Run Code Online (Sandbox Code Playgroud)