Pandas/Google BigQuery:架构不匹配导致上传失败

use*_*204 3 python pandas google-bigquery

我的谷歌表中的架构如下所示:

price_datetime : DATETIME,
symbol         : STRING,
bid_open       : FLOAT,
bid_high       : FLOAT,
bid_low        : FLOAT,
bid_close      : FLOAT,
ask_open       : FLOAT,
ask_high       : FLOAT,
ask_low        : FLOAT,
ask_close      : FLOAT
Run Code Online (Sandbox Code Playgroud)

在我做了一个之后,pandas.read_gbq我得到了一个dataframe像这样的列 dtypes:

price_datetime     object
symbol             object
bid_open          float64
bid_high          float64
bid_low           float64
bid_close         float64
ask_open          float64
ask_high          float64
ask_low           float64
ask_close         float64
dtype: object
Run Code Online (Sandbox Code Playgroud)

现在我想使用,to_gbq所以我从这些 dtypes 转换我的本地数据帧(我刚刚制作的):

price_datetime    datetime64[ns]
symbol                    object
bid_open                 float64
bid_high                 float64
bid_low                  float64
bid_close                float64
ask_open                 float64
ask_high                 float64
ask_low                  float64
ask_close                float64
dtype: object
Run Code Online (Sandbox Code Playgroud)

到这些数据类型:

price_datetime     object
symbol             object
bid_open          float64
bid_high          float64
bid_low           float64
bid_close         float64
ask_open          float64
ask_high          float64
ask_low           float64
ask_close         float64
dtype: object
Run Code Online (Sandbox Code Playgroud)

通过做:

df['price_datetime'] = df['price_datetime'].astype(object)
Run Code Online (Sandbox Code Playgroud)

现在我(认为)我可以使用,to_gbq所以我这样做:

import pandas
pandas.io.gbq.to_gbq(df, <table_name>, <project_name>, if_exists='append')
Run Code Online (Sandbox Code Playgroud)

但我收到错误:

---------------------------------------------------------------------------
InvalidSchema                             Traceback (most recent call last)
<ipython-input-15-d5a3f86ad382> in <module>()
      1 a = time.time()
----> 2 pandas.io.gbq.to_gbq(df, <table_name>, <project_name>, if_exists='append')
      3 b = time.time()
      4 
      5 print(b-a)

C:\Users\me\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
    825         elif if_exists == 'append':
    826             if not connector.verify_schema(dataset_id, table_id, table_schema):
--> 827                 raise InvalidSchema("Please verify that the structure and "
    828                                     "data types in the DataFrame match the "
    829                                     "schema of the destination table.")

InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.
Run Code Online (Sandbox Code Playgroud)

use*_*204 6

我必须做两件事才能为我解决这个问题。首先,我删除了我的表,并使用列作为TIMESTAMP类型而不是DATETIME类型重新上传它。这确保了模式在pandas.DataFramewith 列类型datetime64[ns]上传到 using时匹配to_gbq,它转换datetime64[ns]TIMESTAMPtype 而不是DATETIMEtype (现在)。

我做的第二件事是从 升级pandas 0.19pandas 0.20. 这两件事解决了我的架构不匹配问题。


Wil*_*uks 5

这可能是与熊猫有关的问题。如果您检查to_gbq的代码,您会看到它运行以下代码:

table_schema = _generate_bq_schema(dataframe)
Run Code Online (Sandbox Code Playgroud)

哪里_generate_bq_schema给出:

def _generate_bq_schema(df, default_type='STRING'):
    """ Given a passed df, generate the associated Google BigQuery schema.
    Parameters
    ----------
    df : DataFrame
    default_type : string
        The default big query type in case the type of the column
        does not exist in the schema.
    """

    type_mapping = {
        'i': 'INTEGER',
        'b': 'BOOLEAN',
        'f': 'FLOAT',
        'O': 'STRING',
        'S': 'STRING',
        'U': 'STRING',
        'M': 'TIMESTAMP'
    }

    fields = []
    for column_name, dtype in df.dtypes.iteritems():
        fields.append({'name': column_name,
                       'type': type_mapping.get(dtype.kind, default_type)})

    return {'fields': fields}
Run Code Online (Sandbox Code Playgroud)

如您所见,没有类型映射到DATETIME. 这不可避免地被映射到类型STRING(因为它dtype.kind是“O”),然后发生冲突。

唯一的解决,现在,我所知道的就是,从改变你的表模式DATETIME要么TIMESTAMPSTRING

在 pandas-bq 存储库上开始一个新问题,要求更新此代码以接受DATETIME,这可能是一个好主意。

[编辑]:

我已经在他们的存储库中打开了这个问题