Aru*_*thy 5 hive hiveql apache-spark-sql
根据Spark doc https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#supported-hive-features,支持hive语句CLUSTER BY.但是,当我尝试使用直线下面的查询创建一个表
CREATE TABLE set_bucketing_test (key INT, value STRING) CLUSTERED BY (key) INTO 10 BUCKETS;
Run Code Online (Sandbox Code Playgroud)
我收到以下错误
Error: org.apache.spark.sql.catalyst.parser.ParseException:
Operation not allowed: CREATE TABLE ... CLUSTERED BY(line 1, pos 0)
Run Code Online (Sandbox Code Playgroud)
不知道我在做什么错.有帮助吗?
小智 0
您可以利用spark-sql中的cluster by功能来创建表、表连接等,它充当hive以避免spark2.1+中的数据交换和排序
请参阅https://issues.apache.org/jira/browse/SPARK-15453
目前 hive 无法识别此功能,因为 Spark 和 hive 之间的元数据不兼容,这就是为什么即使在 hive 端识别此表也不能使用相同的语法,这会将所有列视为array
以下示例可能会给您一些想法:
val df = (0 until 80000).map(i => (i, i.toString, i.toString)).toDF("item_id", "country", "state").coalesce(1)
您会看到“这与 Hive 不兼容”。通过向右滚动
df.write.bucketBy(100, "country", "state").sortBy("country", "state").saveAsTable("kofeng.lstg_bucket_test")
17/03/13 15:12:01 WARN HiveExternalCatalog: Persisting bucketed data source table `kofeng`.`lstg_bucket_test` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
df.write.bucketBy(100, "country", "state").sortBy("country", "state").saveAsTable("kofeng.lstg_bucket_test2")
Run Code Online (Sandbox Code Playgroud)
由于音量较小,请先禁用广播加入。
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.sql.autoBroadcastJoinThreshold", "0").getOrCreate()
Run Code Online (Sandbox Code Playgroud)
该计划在SPARK 2.1.0中避免交换和排序,在SPARK2.0中避免交换,仅过滤和扫描证明数据局部性利用。
val query = """
|SELECT *
|FROM
| kofeng.lstg_bucket_test a
|JOIN
| kofeng.lstg_bucket_test2 b
|ON a.country=b.country AND
| a.state=b.state
""".stripMargin
val joinDF = sql(query)
scala> joinDF.queryExecution.executedPlan
res10: org.apache.spark.sql.execution.SparkPlan =
*SortMergeJoin [country#71, state#72], [country#74, state#75], Inner
:- *Project [item_id#70, country#71, state#72]
: +- *Filter (isnotnull(country#71) && isnotnull(state#72))
: +- *FileScan parquet kofeng.lstg_bucket_test[item_id#70,country#71,state#72] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://ares-lvs-nn-ha/user/hive/warehouse/kofeng.db/lstg_bucket_test], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(state)], ReadSchema: struct<item_id:int,country:int,state:string>
+- *Project [item_id#73, country#74, state#75]
+- *Filter (isnotnull(country#74) && isnotnull(state#75))
+- *FileScan parquet kofeng.lstg_bucket_test2[item_id#73,country#74,state#75] Batched: true, Format: Parquet...
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1882 次 |
| 最近记录: |