使用pyspark,我正在从contentdata2文件夹中读取包含一个JSON对象的多个文件,
df = spark.read\
.option("mode", "DROPMALFORMED")\
.json("./data/contentdata2/")
df.printSchema()
content = df.select('fields').collect()
Run Code Online (Sandbox Code Playgroud)
df.printSchema()产生的地方
root
|-- fields: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- field: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- value: string (nullable = true)
|-- id: string (nullable = true)
|-- score: double (nullable = true)
|-- siteId: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)
我希望访问fields.element.field,并存储等于body的每个字段,以及等于urlhash的字段(对于每个JSON对象).
内容的格式是一行(字段),包含其他行,如下所示:
[Row(fields=[Row(field=‘body’, type=None, value=’[“First line of text“,”Second line …Run Code Online (Sandbox Code Playgroud)