问题很简单:
master_dim.py调用dim_1.py和dim_2.py并行执行。这在 databricks pyspark 中可能吗?
下图解释了我想要做什么,由于某种原因它出错了,我在这里遗漏了什么吗?
amazon-web-services databricks azure-databricks aws-databricks databricks-community-edition
我设法使用 Kaggle API 从 Kaggle 下载数据集。数据存储在/databricks/driver目录下。
%sh pip install kaggle
%sh
export KAGGLE_USERNAME=my_name
export KAGGLE_KEY=my_key
kaggle competitions download -c ncaaw-march-mania-2021
%sh unzip ncaaw-march-mania-2021.zip
Run Code Online (Sandbox Code Playgroud)
问题是:如何在 DBFS 中使用它们?以下是我读取数据的方式以及尝试使用pyspark读取csv文件时遇到的错误:
spark.read.csv('/databricks/driver/WDataFiles_Stage1/Cities.csv')
AnalysisException: Path does not exist: dbfs:/databricks/driver/WDataFiles_Stage1/Cities.csv
Run Code Online (Sandbox Code Playgroud) 我想做一些小型练习项目,我希望使用 databricks 集群。这能做到吗。我希望有某种方法可以通过 databricks-connect 实用程序连接 databricks 集群。只需要一些步骤。提前致谢。
apache-spark pyspark databricks databricks-connect databricks-community-edition
尝试读取databricks 社区版集群中的增量日志文件。(databricks-7.2 版本)
df=spark.range(100).toDF("id")
df.show()
df.repartition(1).write.mode("append").format("delta").save("/user/delta_test")
Run Code Online (Sandbox Code Playgroud)
with open('/user/delta_test/_delta_log/00000000000000000000.json','r')  as f:
  for l in f:
    print(l)
Run Code Online (Sandbox Code Playgroud)
Getting file not found error:
FileNotFoundError: [Errno 2] No such file or directory: '/user/delta_test/_delta_log/00000000000000000000.json'
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<command-1759925981994211> in <module>
----> 1 with open('/user/delta_test/_delta_log/00000000000000000000.json','r')  as f:
      2   for l in f:
      3     print(l)
FileNotFoundError: [Errno 2] No such file or directory: '/user/delta_test/_delta_log/00000000000000000000.json'
Run Code Online (Sandbox Code Playgroud)
我尝试添加/dbfs/,但dbfs:/没有解决,仍然出现相同的错误。
with open('/dbfs/user/delta_test/_delta_log/00000000000000000000.json','r')  as f:
  for l in f:
    print(l)
Run Code Online (Sandbox Code Playgroud)
但是使用dbutils.fs.head我能够读取文件。 …
apache-spark pyspark databricks dbutils databricks-community-edition