我想仅对RDD的一个子集执行一些转换(以便更快地在REPL中进行实验).
可能吗?
RDD有take(num: Int): Array[T]方法,我想我需要类似的东西,但返回RDD [T]
通过spark-shell2.0 查询Hive时:
spark.sql("SELECT * FROM schemaname.tablename")
Run Code Online (Sandbox Code Playgroud)
它抛出一个错误:
16/08/13 09:24:17 INFO execution.SparkSqlParser: Parsing command: SELECT * FROM schemaname.tablename
org.apache.spark.sql.AnalysisException: Table or view not found: `schemaname`.`tablename`; line 1 pos 14
...
Run Code Online (Sandbox Code Playgroud)
Hive访问似乎通过正确配置hive-site.xml.在shell中Spark正在打印:
scala> spark.conf.get("spark.sql.warehouse.dir")
res5: String = /user/hive/warehouse
Run Code Online (Sandbox Code Playgroud)
在内部conf/hive-site.xml,配置了Hive,可以在Spark上访问其配置.列出数据库时,它显示现有default数据库.但它没有显示内部的表格default.
scala> spark.catalog.listDatabases.show(false)
+-------+----------------+---------------------------------------------+
|name |description |locationUri |
+-------+----------------+---------------------------------------------+
|default|default database|hdfs://hdfs-server-uri:8020/user/hive/warehouse|
+-------+----------------+---------------------------------------------+
scala> spark.catalog.listTables("default").show()
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+
Run Code Online (Sandbox Code Playgroud)
访问Hive时可能会遗漏什么?
我在deployAtEnd属性设置为的多模块项目中使用maven-deploy-plugin true.
执行后mvn deploy在根项目,部署插件是每个子项目执行的-我可以看到这样的:
[INFO] --- maven-deploy-plugin:2.8.2:deploy (default-deploy) @ subproject-name ---
[INFO] Deploying package:subproject-name:v1.1 at end
最后的调用是根项目:
[INFO] --- maven-deploy-plugin:2.8.2:deploy (default-deploy) @ parent-project ---
[INFO] Deploying package:parent-project:v1.1 at end
,这一切,不执行实际的部署.
如何在多模块项目中使部署插件正常工作deployAtEnd=true?