Apache Airflow:Dag 任务标记为僵尸,后台进程在远程服务器上运行

bku*_*rac 8 airflow

**Apache Airflow 版本:**1.10.9-composer

Kubernetes 版本:客户端版本:version.Info{主要:“1”,次要:“15+”​​,GitVersion:“v1.15.12-gke.6002”,GitCommit:“035184604aff4de66f7db7fddadb8e7be76b6717”,GitTreeState:“clean”,BuildDate:“ 2020-12-01T23:13:35Z",Go版本:"go1.12.17b4",编译器:"gc",平台:"linux/amd64"}

环境: Airflow,运行在 Kubernetes 之上 - Linux 版本 4.19.112

  • 操作系统: Linux 版本 4.19.112+ (builder@7fc5cdead624) (Chromium OS 9.0_pre361749_p20190714-r4 clang 版本 9.0.0 (/var/cache/chromeos-cache/distfiles/host/egit-src/llvm-project c11de5eada2decd0a495ea02676b6f483 8cd54fb)(基于在 LLVM 9.0.0svn)) #1 SMP 2020 年 9 月 4 日星期五 12:00:04 PDT 2020
  • 内核: Linux gke-europe-west2-asset-c-default-pool-dc35e2f2-0vgz 4.19.112+ #1 SMP Fri Sep 4 12:00:04 PDT 2020 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz正版英特尔 GNU/Linux

发生了什么 ? 正在运行的任务在执行时间超过最新心跳+5分钟后被标记为“僵尸”。该任务在另一个应用程序服务器的后台运行,使用 SSHOperator 触发。

    [2021-01-18 11:53:37,491] {taskinstance.py:888} INFO - Executing <Task(SSHOperator): load_trds_option_composite_file> on 2021-01-17T11:40:00+00:00
    [2021-01-18 11:53:37,495] {base_task_runner.py:131} INFO - Running on host: airflow-worker-6f6fd78665-lm98m
    [2021-01-18 11:53:37,495] {base_task_runner.py:132} INFO - Running: ['airflow', 'run', 'dsp_etrade_process_trds_option_composite_0530', 'load_trds_option_composite_file', '2021-01-17T11:40:00+00:00', '--job_id', '282759', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/dsp_etrade_trds_option_composite_0530.py', '--cfg_path', '/tmp/tmpge4_nva0']
Run Code Online (Sandbox Code Playgroud)

任务执行时间:

dag_id      dsp_etrade_process_trds_option_composite_0530
duration    7270.47
start_date  2021-01-18 11:53:37,491
end_date    2021-01-18 13:54:47.799728+00:00
Run Code Online (Sandbox Code Playgroud)

该时间段的调度程序日志:

[2021-01-18 13:54:54,432] {taskinstance.py:1135} ERROR - <TaskInstance: dsp_etrade_process_etrd.push_run_date 2021-01-18 13:30:00+00:00 [running]> detected as zombie
{
textPayload: "[2021-01-18 13:54:54,432] {taskinstance.py:1135} ERROR - <TaskInstance: dsp_etrade_process_etrd.push_run_date 2021-01-18 13:30:00+00:00 [running]> detected as zombie"
insertId: "1ca8zyfg3zvma66"
resource: {
type: "cloud_composer_environment"
labels: {3}
}
timestamp: "2021-01-18T13:54:54.432862699Z"
severity: "ERROR"
logName: "projects/asset-control-composer-prod/logs/airflow-scheduler"
receiveTimestamp: "2021-01-18T13:54:55.714437665Z"
}
Run Code Online (Sandbox Code Playgroud)

气流网络服务器日志:

X.X.X.X - - [18/Jan/2021:13:54:39 +0000] "GET /_ah/health HTTP/1.1" 200 187 "-" "GoogleHC/1.0"
{
textPayload: "172.17.0.5 - - [18/Jan/2021:13:54:39 +0000] "GET /_ah/health HTTP/1.1" 200 187 "-" "GoogleHC/1.0"
"
insertId: "1sne0gqg43o95n3"
resource: {2}
timestamp: "2021-01-18T13:54:45.401670481Z"
logName: "projects/asset-control-composer-prod/logs/airflow-webserver"
receiveTimestamp: "2021-01-18T13:54:50.598807514Z"
}
Run Code Online (Sandbox Code Playgroud)

气流信息日志:

2021-01-18 08:54:47.799 EST
{
textPayload: "NoneType: None
"
insertId: "1ne3hqgg47yzrpf"
resource: {2}
timestamp: "2021-01-18T13:54:47.799661030Z"
severity: "INFO"
logName: "projects/asset-control-composer-prod/logs/airflow-scheduler"
receiveTimestamp: "2021-01-18T13:54:50.914461159Z"
}

[2021-01-18 13:54:47,800] {taskinstance.py:1192} INFO - Marking task as FAILED.dag_id=dsp_etrade_process_trds_option_composite_0530, task_id=load_trds_option_composite_file, execution_date=20210117T114000, start_date=20210118T115337, end_date=20210118T135447
Copy link
{
textPayload: "[2021-01-18 13:54:47,800] {taskinstance.py:1192} INFO - Marking task as FAILED.dag_id=dsp_etrade_process_trds_option_composite_0530, task_id=load_trds_option_composite_file, execution_date=20210117T114000, start_date=20210118T115337, end_date=20210118T135447"
insertId: "1ne3hqgg47yzrpg"
resource: {2}
timestamp: "2021-01-18T13:54:47.800605248Z"
severity: "INFO"
logName: "projects/asset-control-composer-prod/logs/airflow-scheduler"
receiveTimestamp: "2021-01-18T13:54:50.914461159Z"
}
Run Code Online (Sandbox Code Playgroud)

Airflow 数据库显示最新的心跳为:

select state, latest_heartbeat from job where id=282759
--------------------------------------
state   | latest_heartbeat
running | 2021-01-18 13:48:41.891934
Run Code Online (Sandbox Code Playgroud)

气流配置:

celery
worker_concurrency=6

scheduler
scheduler_health_check_threshold=60
scheduler_zombie_task_threshold=300
max_threads=2

core
dag_concurrency=6

Kubernetes Cluster :
Worker nodes : 6
Run Code Online (Sandbox Code Playgroud)

预计会发生什么?

  • 后端进程大约需要 2 小时 30 分钟才能完成。在如此长时间运行的作业中,任务被检测为僵尸任务。尽管工作节点仍在处理任务。作业的状态仍标记为“正在运行”。说明任务在运行时是否未知。