如何在Spark中使用大量数据

Lær*_*rne 5 apache-spark

我正在使用python中的spark,尝试将PDF文件映射到一些自定义解析。目前,我正在使用加载PDFS pdfs = sparkContext.binaryFiles("some_path/*.pdf")。我将RDD设置为可在磁盘上缓存pdfs.persist( pyspark.StorageLevel.MEMORY_AND_DISK )

然后,我尝试映射解析操作。然后保存一个泡菜,但是它失败并在堆中出现内存不足错误。请问你能帮帮我吗?

这是我所做的简化代码:

from pyspark import SparkConf, SparkContext
import pyspark

#There is some code here that set a args object with argparse.
#But it's not very interesting and a bit long, so I skip it.

def extractArticles( tupleData ):
    url, bytesData = tupleData
    #Convert the bytesData into `content`, a list of dict
    return content

sc = SparkContext("local[*]","Legilux PDF Analyser")

inMemoryPDFs = sc.binaryFiles( args.filePattern )
inMemoryPDFs.persist( pyspark.StorageLevel.MEMORY_AND_DISK )


pdfData = inMemoryPDFs.flatMap( extractArticles )
pdfData.persist( pyspark.StorageLevel.MEMORY_AND_DISK )
pdfData.saveAsPickleFile( args.output )
Run Code Online (Sandbox Code Playgroud)