如何在新的Spark会话中再次读取Spark Table?

okw*_*wap 5 python apache-spark apache-spark-sql pyspark

我可以在创建表后立即读取它,但是如何在另一个Spark会话中再次读取它呢?

给定代码:

spark = SparkSession \
    .builder \
    .getOrCreate()

df = spark.read.parquet("examples/src/main/resources/users.parquet")
(df
 .write
 .saveAsTable("people_partitioned_bucketed"))

# retrieve rows from table as expected
spark.sql("select * from people_partitioned_bucketed").show()

spark.stop()

# open spark session again
spark = SparkSession \
    .builder \
    .getOrCreate()

# table not exist this time
spark.sql("select * from people_partitioned_bucketed").show()

```
Run Code Online (Sandbox Code Playgroud)

执行结果:

+------+----------------+--------------+
|  name|favorite_numbers|favorite_color|
+------+----------------+--------------+
|Alyssa|  [3, 9, 15, 20]|          null|
|   Ben|              []|           red|
+------+----------------+--------------+

Traceback (most recent call last):
  File "/home//workspace/spark/examples/src/main/python/sql/datasource.py", line 246, in <module>
    spark.sql("select * from people_partitioned_bucketed").show()
  File "/home//virtualenvs/spark/local/lib/python2.7/site-packages/pyspark/sql/session.py", line 603, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/home//virtualenvs/spark/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home//virtualenvs/spark/local/lib/python2.7/site-packages/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Table or view not found: people_partitioned_bucketed; line 1 pos 14'
Run Code Online (Sandbox Code Playgroud)

Sha*_*ica 1

查看文档

对于基于文件的数据源,例如text、parquet、json等,您可以通过path选项指定自定义表路径,例如df.write.option("path", "/some/path").saveAsTable(" t”)。当表被删除时,自定义表路径不会被删除,表数据仍然存在。如果不指定自定义表路径,Spark会将数据写入仓库目录下的默认表路径。当表被删除时,默认的表路径也将被删除。

也就是说,使用.txt保存表时需要指定路径path()。如果未指定路径,则关闭 Spark 会话时该表将被删除。

  • 事实并非如此。_当表被删除时,默认的表路径也会被删除。_我没有删除表,并且我确认在脚本执行后表数据仍然存在于 `spark-warehose` 目录中。 (2认同)