zwe*_*nde 6 amazon-s3 apache-spark parquet apache-spark-sql pyspark
我正在尝试从SparkSQL表中有效地选择单个分区(S3中的镶木地板).但是,我看到Spark打开表中所有镶木地板文件的证据,而不仅仅是那些通过过滤器的文件.对于具有大量分区的表,这甚至会使小查询变得昂贵.
这是一个说明性的例子.我使用SparkSQL和Hive Metastore在S3上创建了一个简单的分区表:
# Make some data
df = pandas.DataFrame({'pk': ['a']*5+['b']*5+['c']*5,
'k': ['a', 'e', 'i', 'o', 'u']*3,
'v': range(15)})
# Convert to a SparkSQL DataFrame
sdf = hiveContext.createDataFrame(df)
# And save it
sdf.write.partitionBy('pk').saveAsTable('dataset',
format='parquet',
path='s3a://bucket/dataset')
Run Code Online (Sandbox Code Playgroud)
在后续会话中,我想选择此表的子集:
dataset = hiveContext.table('dataset')
filtered_dataset = dataset.filter(dataset.pk == 'b')
print filtered_dataset.toPandas()
Run Code Online (Sandbox Code Playgroud)
在随后打印的日志中,我看到应该发生修剪:
15/07/05 02:39:39 INFO DataSourceStrategy: Selected 1 partitions out of 3, pruned -200.0% partitions.
Run Code Online (Sandbox Code Playgroud)
但后来我看到所有分区都打开了镶木地板文件:
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset 508
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 508
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset 509
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 509
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset 262
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 262
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset 509
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 509
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=b/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=b/part-r-00001.gz.parquet at pos 152
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=a/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=a/part-r-00001.gz.parquet at pos 151
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/_common_metadata to seek to new offset -266
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/_common_metadata at pos 4
15/07/05 02:39:39 INFO S3AFileSystem: Reopening dataset/pk=c/part-r-00001.gz.parquet to seek to new offset -365
15/07/05 02:39:39 INFO S3AFileSystem: Actually opening file dataset/pk=c/part-r-00001.gz.parquet at pos 152
Run Code Online (Sandbox Code Playgroud)
只有三个分区,这不是问题---但有数千个分区,会导致明显的延迟.为什么打开所有这些无关的文件?
| 归档时间: |
|
| 查看次数: |
885 次 |
| 最近记录: |