pyspark csv的URL到数据帧,而不写入磁盘

Rob*_*inL 8 csv apache-spark pyspark

如何将URL处的csv读取到Pyspark中的数据帧中而不将其写入磁盘?

我已经尝试了以下方法,但是没有运气:

import urllib.request
from io import StringIO

url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv"
response = urllib.request.urlopen(url)
data = response.read()      
text = data.decode('utf-8')  


f = StringIO(text)

df1 = sqlContext.read.csv(f, header = True, schema=customSchema)
df1.show()
Run Code Online (Sandbox Code Playgroud)

hi-*_*zir 5

TL; DR这是不可能的,并且通常通过驱动程序传输数据是死胡同。

如果文件很小,我只用sparkFiles

from pyspark import SparkFiles

spark.sparkContext.addFile(url)

spark.read.csv(SparkFiles.get("iris.csv"), header=True))
Run Code Online (Sandbox Code Playgroud)