无法使用pyspark从json读取数据

Pra*_*tel 6 apache-spark pyspark

我是PySpark的新手.任何人都可以帮助我如何使用pyspark读取json数据.我们所做的,

(1)main.py

import os.path
from pyspark.sql import SparkSession

def fileNameInput(filename,spark):

    try:
        if(os.path.isfile(filename)):
            loadFileIntoHdfs(filename,spark)
        else:
            print("File does not exists")
    except OSError:
        print("Error while finding file")


def loadFileIntoHdfs(fileName,spark):
    df = spark.read.json(fileName)
    df.show()


if __name__ == '__main__':

    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    file_name = input("Enter file location : ")
    fileNameInput(file_name,spark)
Run Code Online (Sandbox Code Playgroud)

当我运行上面的代码时,它会抛出错误消息

 File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o41.showString.
: org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
Run Code Online (Sandbox Code Playgroud)

提前致谢

ern*_*t_k 12

你的JSON在我的pyspark中运行.当记录文本跨越多行时,我会收到类似的错误.请确保每条记录适合一行.或者,告诉它支持多行记录:

spark.read.json(filename, multiLine=True)
Run Code Online (Sandbox Code Playgroud)

什么有效:

{ "employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }
Run Code Online (Sandbox Code Playgroud)

那输出:

spark.read.json('/home/ernest/Desktop/brokenjson.json').printSchema()
root
 |-- employees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)

当我尝试这样的输入时:

{
  "employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }
Run Code Online (Sandbox Code Playgroud)

然后我得到架构中的损坏记录:

root
 |-- _corrupt_record: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)

但是当与多行选项一起使用时,后一个输入也可以.