AK9*_*K91 1 python apache-spark pyspark
.tsv.gzPySpark 有没有办法从 URL读取 a ?
from pyspark.sql import SparkSession
def create_spark_session():
return SparkSession.builder.appName("wikipediaClickstream").getOrCreate()
spark = create_spark_session()
url = "https://dumps.wikimedia.org/other/clickstream/2017-11/clickstream-jawiki-2017-11.tsv.gz"
# df = spark.read.csv(url, sep="\t") # doesn't work
df = spark.read.option("sep", "\t").csv(url) # doesn't work either
df.show(10)
Run Code Online (Sandbox Code Playgroud)
得到以下错误:
Py4JJavaError: An error occurred while calling o65.csv.
: java.lang.UnsupportedOperationException
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
/var/folders/sn/4dk4tbz9735crf4npgcnlt8r0000gn/T/ipykernel_1443/4137722240.py in <module>
1 url = "https://dumps.wikimedia.org/other/clickstream/2017-11/clickstream-jawiki-2017-11.tsv.gz"
2 # df = spark.read.csv(url, sep="\t")
----> 3 df = spark.read.option("sep", "\t").csv(url)
4 df.show(10)
Run Code Online (Sandbox Code Playgroud)
spark.version是3.1.2
您可以SparkContext.addFile在读取文件之前将文件下载到每个节点,如下所示:
from pyspark import SparkFiles
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
url = "https://dumps.wikimedia.org/other/clickstream/2017-11/clickstream-jawiki-2017-11.tsv.gz"
spark.sparkContext.addFile(url)
df = spark.read.option("sep", "\t").csv("file://" + SparkFiles.get("clickstream-jawiki-2017-11.tsv.gz"))
df.show(10)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5032 次 |
| 最近记录: |