我正在使用python中的spark,尝试将PDF文件映射到一些自定义解析。目前,我正在使用加载PDFS pdfs = sparkContext.binaryFiles("some_path/*.pdf")。我将RDD设置为可在磁盘上缓存pdfs.persist( pyspark.StorageLevel.MEMORY_AND_DISK )。
然后,我尝试映射解析操作。然后保存一个泡菜,但是它失败并在堆中出现内存不足错误。请问你能帮帮我吗?
这是我所做的简化代码:
from pyspark import SparkConf, SparkContext
import pyspark
#There is some code here that set a args object with argparse.
#But it's not very interesting and a bit long, so I skip it.
def extractArticles( tupleData ):
url, bytesData = tupleData
#Convert the bytesData into `content`, a list of dict
return content
sc = SparkContext("local[*]","Legilux PDF Analyser")
inMemoryPDFs = sc.binaryFiles( args.filePattern )
inMemoryPDFs.persist( pyspark.StorageLevel.MEMORY_AND_DISK )
pdfData = inMemoryPDFs.flatMap( extractArticles )
pdfData.persist( pyspark.StorageLevel.MEMORY_AND_DISK )
pdfData.saveAsPickleFile( args.output )
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4753 次 |
| 最近记录: |