如何在一个字符串中读取整个文件

Question

如何在一个字符串中读取整个文件

Kum*_*mar 8 apache-spark apache-spark-sql

我想在pyspark.lf中读取json或xml文件,我的文件被分成多行

rdd= sc.textFIle(json or xml)

Run Code Online (Sandbox Code Playgroud)

输入

{
" employees":
[
 {
 "firstName":"John",
 "lastName":"Doe" 
},
 { 
"firstName":"Anna"
  ]
}

Run Code Online (Sandbox Code Playgroud)

输入分布在多条线上.

预期产出 {"employees:[{"firstName:"John",......]}

如何使用pyspark在一行中获取完整的文件？

请帮助我,我是新来的火花.

Answer 1

Jus*_*ony 5

如果您的数据没有按textFile预期在一行上形成,那么请使用wholeTextFiles.这将为您提供全部内容,以便您可以将其解析为您想要的任何格式.

Answer 2

abb*_*obh 5

有3种方式(我发明了第3种,前两种是标准的内置Spark功能),这里的解决方案在PySpark中:

textFile,wholeTextFile和标记的textFile(key = file,value = 1行from file.这是解析文件的两种给定方式之间的混合).

1.)textFile

输入: rdd = sc.textFile('/home/folder_with_text_files/input_file')

output:每个条目包含1行文件的数组ie.[line1,line2,...]

2.)wholeTextFiles

输入: rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')

输出:元组数组,第一项是文件路径的"键",第二项包含1个文件的全部内容即.

[(你的文件:/ home/folder_with_text_files /',u'file1_contents'),(你的文件:/ home/folder_with_text_files /',file2_contents),...]

3.)"标记"textFile

输入:

import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)

for filename in glob.glob(Data_File + "/*"):
    Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)

Run Code Online (Sandbox Code Playgroud)

output:包含元组的每个条目的数组,使用filename-as-key,其值为=每行文件.(从技术上讲,使用此方法,您还可以使用除实际文件路径名称之外的其他键 - 可能是哈希表示以节省内存).即.

[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
 ('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
  ...]

Run Code Online (Sandbox Code Playgroud)

您还可以重新组合为一个行列表:

Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
 ('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]

Run Code Online (Sandbox Code Playgroud)

或者将整个文件重新组合回单个字符串(在此示例中,结果与从wholeTextFiles获得的结果相同,但是从文件路径中删除字符串"file:".):

Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()

Answer 3

Ani*_*Jha 5

这就是您在Scala中的做法

rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
rdd.collect.foreach(t=>println(t._2))

Run Code Online (Sandbox Code Playgroud)

Answer 4

con*_*xyz 5

“如何在一个字符串中读取整个 [HDFS] 文件 [在 Spark 中，用作 sql]”：

例如

// Put file to hdfs from edge-node's shell...

hdfs dfs -put <filename>

// Within spark-shell...

// 1. Load file as one string
val f = sc.wholeTextFiles("hdfs:///user/<username>/<filename>")
val hql = f.take(1)(0)._2

// 2. Use string as sql/hql
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = hiveContext.sql(hql)

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，5 月前
查看次数：	13030 次
最近记录：	6 年，3 月前