使用工作流程对 github 中的 databricks python 代码进行 flake8 linting

Question

使用工作流程对 github 中的 databricks python 代码进行 flake8 linting

Kas*_*yap 8 python github flake8 apache-spark databricks

我的 databricks python 代码位于github. 我设置了一个基本工作流程来使用flake8. 这会失败，因为当我的脚本在 databricks 上运行时隐式可用的名称（例如spark、、等）在databricks 外部（在 github ubuntu vm 中）进行 lint时不可用。scdbutilsgetArgumentflake8

如何在github使用中检查 databricks 笔记本flake8？

例如我得到的错误：

test.py:1:1: F821 undefined name 'dbutils'
test.py:3:11: F821 undefined name 'getArgument'
test.py:5:1: F821 undefined name 'dbutils'
test.py:7:11: F821 undefined name 'spark'

Run Code Online (Sandbox Code Playgroud)

我的笔记本在github上：

dbutils.widgets.text("my_jdbcurl", "default my_jdbcurl")

jdbcurl = getArgument("my_jdbcurl")

dbutils.fs.ls(".")

df_node = spark.read.format("jdbc")\
  .option("driver", "org.mariadb.jdbc.Driver")\
  .option("url", jdbcurl)\
  .option("dbtable", "my_table")\
  .option("user", "my_username")\
  .option("password", "my_pswd")\
  .load()

Run Code Online (Sandbox Code Playgroud)

我的 .github/workflows/lint.yml

on:
  pull_request:
    branches: [ master ]

jobs:
  build:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2
    - uses: actions/setup-python@v1
      with:
        python-version: 3.8
    - run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Lint with flake8
      run: |
        pip install flake8
        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

Run Code Online (Sandbox Code Playgroud)

Answer 1

Kas*_*yap 1

长话短说

dbutils不要在需要本地运行的代码（IDE、单元测试等）和 Databricks（生产）中使用内置变量。相反，创建您自己的DBUtils类实例。

这是我们最终所做的：

创建了一个新的dbk_utils.py

from pyspark.sql import SparkSession

def get_dbutils(spark: SparkSession):
    try:
        from pyspark.dbutils import DBUtils
        return DBUtils(spark)

    except ModuleNotFoundError:
        import IPython
        return IPython.get_ipython().user_ns["dbutils"]

Run Code Online (Sandbox Code Playgroud)

dbutils并更新用于使用此实用程序的代码：

from dbk_utils import get_dbutils

my_dbutils = get_dbutils()

my_dbutils.widgets.text("my_jdbcurl", "default my_jdbcurl")
my_dbutils.fs.ls(".")

jdbcurl = my_dbutils.widgets.getArgument("my_jdbcurl")

df_node = spark.read.format("jdbc")\
  .option("driver", "org.mariadb.jdbc.Driver")\
  .option("url", jdbcurl)\
  .option("dbtable", "my_table")\
  .option("user", "my_username")\
  .option("password", "my_pswd")\
  .load()

Run Code Online (Sandbox Code Playgroud)

如果您也尝试进行单元测试，请查看：

归档时间：	5 年，8 月前
查看次数：	3862 次
最近记录：	2 年，3 月前