JOH*_*OHN 7 python async-await airflow
我正在编写一个气流任务来读取一个大的 csv 并将其保存到 postgresql 数据库。我发现这个 asyncpg 包有一个复制功能,运行速度比任何其他包都要快。但是,它是异步的,我不知道如何将其合并到 Airflow 中。这是一个示例代码:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from pandas import DataFrame
import asyncpg
async def to_sql(dataframe, table_name, schema_name='public', timeout=None, truncate=False):
connection = await asyncpg.connect(user='postgres', host='host.docker.internal', database='quantaxis', password='123456')
result = await connection.copy_records_to_table(
table_name,
records=dataframe.values.tolist(),
columns=shared_columns,
schema_name=schema_name,
timeout=timeout)
await connection.close()
return result
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2020, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('pythonexp2123', default_args=default_args, schedule_interval=timedelta(days=1))
async def save_file_to_database(ds):
df = pd.read_csv("data{0}.csv".format(ds))
r = await to_sql(df, 'test')
return r
t1 = PythonOperator(
task_id='pushing_task',
provide_context=True,
python_callable=save_file_to_database,
dag=dag
)
t1
Run Code Online (Sandbox Code Playgroud)
当我运行它时,它会返回错误:
Can't Pickle Object <Corountine>
Run Code Online (Sandbox Code Playgroud)
我如何更改函数以使此 Dag 工作?我仍然想使用 asyncpg 包,因为它的速度。
小智 8
您可以尝试使用 asyncio 在 eventloop 中运行 async 函数。如果您使用的是 python 3.7 > 您可以简单地调用 asyncio.run(async_function())
https://docs.python.org/3/library/asyncio-task.html
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from pandas import DataFrame
import asyncpg
import asyncio
async def to_sql(dataframe, table_name, schema_name='public', timeout=None, truncate=False):
connection = await asyncpg.connect(user='postgres', host='host.docker.internal', database='quantaxis', password='123456')
result = await connection.copy_records_to_table(
table_name,
records=dataframe.values.tolist(),
columns=shared_columns,
schema_name=schema_name,
timeout=timeout)
await connection.close()
return result
default_args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2020, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('pythonexp2123', default_args=default_args, schedule_interval=timedelta(days=1))
async def save_file_to_database(ds):
df = pd.read_csv("data{0}.csv".format(ds))
r = await to_sql(df, 'test')
return r
def run_async(ds):
loop = asyncio.get_event_loop()
result = loop.run_until_complete(save_file_to_database(ds))
return result
t1 = PythonOperator(
task_id='pushing_task',
provide_context=True,
python_callable=run_async,
dag=dag
)
t1
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2413 次 |
| 最近记录: |