我试图理解火花的物理计划,但我不理解某些部分,因为它们看起来与传统的rdbms不同.例如,在下面的这个计划中,它是关于对hive表的查询的计划.查询是这样的:
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= '1998-09-16'
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;
== Physical Plan ==
Sort [l_returnflag#35 ASC,l_linestatus#36 ASC], true, 0
+- ConvertToUnsafe
+- Exchange rangepartitioning(l_returnflag#35 ASC,l_linestatus#36 ASC,200), None
+- ConvertToSafe
+- TungstenAggregate(key=[l_returnflag#35,l_linestatus#36], functions=[(sum(l_quantity#31),mode=Final,isDistinct=false),(sum(l_extendedpr#32),mode=Final,isDistinct=false),(sum((l_extendedprice#32 * (1.0 - …
Run Code Online (Sandbox Code Playgroud) sql catalyst query-optimization apache-spark apache-spark-sql
当我启动 Spark 时,我收到以下警告:
Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
16/04/03 15:07:31 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/03 15:07:31 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/03 15:07:39 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 …
Run Code Online (Sandbox Code Playgroud) 我安装了这个火花版:spark-1.6.1-bin-hadoop2.6.tgz.
现在当我用./spark-shell
命令启动spark时我得到了这个问题(它显示了很多错误行,所以我只是把一些看起来很重要)
Cleanup action completed
16/03/27 00:19:35 ERROR Schema: Failed initialising database.
Failed to create database 'metastore_db', see the next exception for details.
org.datanucleus.exceptions.NucleusDataStoreException: Failed to create database 'metastore_db', see the next exception for details.
at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
Caused by: java.sql.SQLException: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
... 128 more
Caused by: ERROR XBM0H: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
Nested Throwables StackTrace:
java.sql.SQLException: Failed to create database 'metastore_db', see the next exception for details.
org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
... 128 …
Run Code Online (Sandbox Code Playgroud) 我使用HiveQL用spark执行此查询:
var hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
result = hiveContext.sql("select linestatus, sum(quantity) as sum_qty,count(*) as count_order from lineitem
where shipdate <= '1990-09-16' group by linestatus order by
linestatus")
Run Code Online (Sandbox Code Playgroud)
但是我得到这个错误:
<console>:1: error: unclosed character literal
where shipdate <= '1990-09-16' group by linestatus order by
Run Code Online (Sandbox Code Playgroud)
你知道为什么吗?
我正在尝试使用存储在 hdfs 中的文件创建 hive orc 表。
我有一个表“partsupp.tbl”文件,其中每一行的格式如下:
1|25002|8076|993.49|ven ideas. quickly even packages print. pending multipliers must have to are fluff|
Run Code Online (Sandbox Code Playgroud)
我创建了一个这样的配置单元表:
create table if not exists partsupp (PS_PARTKEY BIGINT,
PS_SUPPKEY BIGINT,
PS_AVAILQTY INT,
PS_SUPPLYCOST DOUBLE,
PS_COMMENT STRING)
STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY")
;
Run Code Online (Sandbox Code Playgroud)
现在我试图加载表中 .tbl 文件中的数据,如下所示:
LOAD DATA LOCAL INPATH '/tables/partsupp/partsupp.tbl' INTO TABLE partsupp
Run Code Online (Sandbox Code Playgroud)
但我遇到了这个问题:
No files matching path file:/tables/partsupp/partsupp.tbl
Run Code Online (Sandbox Code Playgroud)
但是文件存在于hdfs中...