Viv*_*Viv 10 python amazon-s3 apache-spark pyspark
我是Spark的新手,我无法找到这个...我有很多镶木地板文件上传到s3位置:
s3://a-dps/d-l/sco/alpha/20160930/parquet/
Run Code Online (Sandbox Code Playgroud)
此文件夹的总大小为20+ Gb,.如何将其分块并将其读入数据帧如何将所有这些文件加载到数据帧中?
分配给火花集群的内存为6 GB.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
import pandas
# SparkConf().set("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.0.0-alpha3")
sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", 'A')
sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", 's')
sqlContext = SQLContext(sc)
df2 = sqlContext.read.parquet("s3://sm/data/scor/alpha/2016/parquet/*")
Run Code Online (Sandbox Code Playgroud)
错误:
Py4JJavaError: An error occurred while calling o33.parquet.
: java.io.IOException: No FileSystem for scheme: s3
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
eli*_*sah 15
s3您使用的文件架构()不正确.您需要使用s3n模式或s3a(对于更大的s3对象):
// use sqlContext instead for spark <2
val df = spark.read
.load("s3n://bucket-name/object-path")
Run Code Online (Sandbox Code Playgroud)
我建议您阅读有关Hadoop-AWS模块的更多信息:与Amazon Web Services概述集成.
Art*_*iev 12
自Spark 2.0以来,你必须使用SparkSession而不是sqlContext
spark = SparkSession.builder
.master("local")
.appName("app name")
.config("spark.some.config.option", true).getOrCreate()
df = spark.read.parquet("s3://path/to/parquet/file.parquet")
Run Code Online (Sandbox Code Playgroud)