使用SqlAlchemy和cx_Oracle将Pandas DataFrame写入Oracle数据库时加速to_sql()

bre*_*mri 2 oracle performance sqlalchemy dataframe pandas

使用pandas dataframe的to_sql方法,我可以很容易地将少量行写入oracle数据库中的表:

from sqlalchemy import create_engine
import cx_Oracle
dsn_tns = "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<host>)(PORT=1521))\
       (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=<servicename>)))"
pwd = input('Please type in password:')
engine = create_engine('oracle+cx_oracle://myusername:' + pwd + '@%s' % dsn_tns)
df.to_sql('test_table', engine.connect(), if_exists='replace')
Run Code Online (Sandbox Code Playgroud)

但是对于任何常规大小的数据帧(我的有60k行,不是那么大),代码变得无法使用,因为它在我愿意等待的时间内(从而超过10分钟)从未完成.我用Google搜索并搜索了几次,最接近的解决方案是ansonw这个问题中给出的答案.但那个是关于mysql,而不是oracle.正如Ziggy Eunicien指出的那样,它对甲骨文不起作用.有任何想法吗?

编辑

以下是数据框中的行示例:

id          name            premium     created_date    init_p  term_number uprate  value   score   group   action_reason
160442353   LDP: Review     1295.619617 2014-01-20  1130.75     1           7       -42 236.328243  6       pass
164623435   TRU: Referral   453.224880  2014-05-20  0.00        11          NaN     -55 38.783290   1       suppress
Run Code Online (Sandbox Code Playgroud)

这是df的数据类型:

id               int64
name             object
premium          float64
created_date     object
init_p           float64
term_number      float64
uprate           float64
value            float64
score            float64
group            int64
action_reason    object
Run Code Online (Sandbox Code Playgroud)

Max*_*axU 14

熊猫+ SQLAlchemy的每默认情况下所有保存object(串)的列CLOB Oracle数据库中,这使得插入极其缓慢.

以下是一些测试:

import pandas as pd
import cx_Oracle
from sqlalchemy import types, create_engine

#######################################################
### DB connection strings config
#######################################################
tns = """
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = my-db-scan)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = my_service_name)
    )
  )
"""

usr = "test"
pwd = "my_oracle_password"

engine = create_engine('oracle+cx_oracle://%s:%s@%s' % (usr, pwd, tns))

# sample DF [shape: `(2000, 11)`]
# i took your 2 rows DF and replicated it: `df = pd.concat([df]* 10**3, ignore_index=True)`
df = pd.read_csv('/path/to/file.csv')
Run Code Online (Sandbox Code Playgroud)

DF信息:

In [61]: df.shape
Out[61]: (2000, 11)

In [62]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 11 columns):
id               2000 non-null int64
name             2000 non-null object
premium          2000 non-null float64
created_date     2000 non-null datetime64[ns]
init_p           2000 non-null float64
term_number      2000 non-null int64
uprate           1000 non-null float64
value            2000 non-null int64
score            2000 non-null float64
group            2000 non-null int64
action_reason    2000 non-null object
dtypes: datetime64[ns](1), float64(4), int64(4), object(2)
memory usage: 172.0+ KB
Run Code Online (Sandbox Code Playgroud)

我们来看看将它存储到Oracle DB需要多长时间:

In [57]: df.shape
Out[57]: (2000, 11)

In [58]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace')
1 loop, best of 1: 16 s per loop
Run Code Online (Sandbox Code Playgroud)

在Oracle DB中(注意CLOB):

AAA> desc test.test_table
 Name                            Null?    Type
 ------------------------------- -------- ------------------
 ID                                       NUMBER(19)
 NAME                                     CLOB        #  !!!
 PREMIUM                                  FLOAT(126)
 CREATED_DATE                             DATE
 INIT_P                                   FLOAT(126)
 TERM_NUMBER                              NUMBER(19)
 UPRATE                                   FLOAT(126)
 VALUE                                    NUMBER(19)
 SCORE                                    FLOAT(126)
 group                                    NUMBER(19)
 ACTION_REASON                            CLOB        #  !!!
Run Code Online (Sandbox Code Playgroud)

现在让我们指示pandas将所有object列保存为VARCHAR数据类型:

In [59]: dtyp = {c:types.VARCHAR(df[c].str.len().max())
    ...:         for c in df.columns[df.dtypes == 'object'].tolist()}
    ...:

In [60]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp)
1 loop, best of 1: 335 ms per loop
Run Code Online (Sandbox Code Playgroud)

这次是约.快了48倍

签入Oracle DB:

 AAA> desc test.test_table
 Name                          Null?    Type
 ----------------------------- -------- ---------------------
 ID                                     NUMBER(19)
 NAME                                   VARCHAR2(13 CHAR)        #  !!!
 PREMIUM                                FLOAT(126)
 CREATED_DATE                           DATE
 INIT_P                                 FLOAT(126)
 TERM_NUMBER                            NUMBER(19)
 UPRATE                                 FLOAT(126)
 VALUE                                  NUMBER(19)
 SCORE                                  FLOAT(126)
 group                                  NUMBER(19)
 ACTION_REASON                          VARCHAR2(8 CHAR)        #  !!!
Run Code Online (Sandbox Code Playgroud)

让我们用200.000行DF测试它:

In [69]: df.shape
Out[69]: (200000, 11)

In [70]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp, chunksize=10**4)
1 loop, best of 1: 4.68 s per loop
Run Code Online (Sandbox Code Playgroud)

在我的测试(不是最快的)环境中,200K行DF花了大约5秒钟.

结论:在将DataFrames保存到Oracle DB时,使用以下技巧以显式指定dtype的dtype所有DF列object.否则它将被保存为CLOB数据类型,这需要特殊处理并使其非常慢

dtyp = {c:types.VARCHAR(df[c].str.len().max())
        for c in df.columns[df.dtypes == 'object'].tolist()}

df.to_sql(..., dtype=dtyp)
Run Code Online (Sandbox Code Playgroud)

  • 谢谢。根据我的数据,将 2 个“对象”列指定为 VARCHAR 类型后,执行时间从 30 多分钟缩短到几秒钟! (3认同)
  • @MaxU 确保您从 sqlalchemy 导入类型调用 - - (2认同)
  • @MaxU我正在寻找这个确切的解决方案。我正在为这个解决方案添加书签,因为我的大多数 python 到 oracle 导出都具有字符串数据类型,这些数据类型正在作为对象导入到 oracle 数据库中。+1 (2认同)