BigQuery TypeError:to_pandas() 获得意外的关键字参数“timestamp_as_object”

Sam*_*Sam 7 python pandas google-bigquery google-cloud-platform

环境详情

\n
    \n
  • 操作系统类型和版本:1.5.29-debian10
  • \n
  • Python版本:3.7
  • \n
  • google-cloud-bigquery版本:2.8.0
  • \n
\n

我正在配置一个 dataproc 集群,它将数据从 BigQuery 获取到 pandas 数据帧中。\n随着我的数据不断增长,我希望提高性能,并听说过使用 BigQuery 存储客户端。

\n

我过去也遇到过同样的问题,通过将 google-cloud-bigquery 设置为版本 1.26.1 解决了这个问题。\n如果我使用该版本,我会收到以下消息。

\n
/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/client.py:407: UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed.\n "Cannot create BigQuery Storage client, the dependency " \n
Run Code Online (Sandbox Code Playgroud)\n

代码片段执行但速度较慢。如果我不指定 pip 版本,则会遇到此错误。

\n

重现步骤

\n
    \n
  1. 在 dataproc 上创建集群
  2. \n
\n
gcloud dataproc clusters create testing-cluster  --region=europe-west1  --zone=europe-west1-b  --master-machine-type n1-standard-16  --single-node  --image-version 1.5-debian10  --initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh  --metadata \'PIP_PACKAGES=elasticsearch google-cloud-bigquery google-cloud-bigquery-storage pandas pandas_gbq\'\n
Run Code Online (Sandbox Code Playgroud)\n
    \n
  1. 在集群上执行以下脚本
  2. \n
\n
bqclient = bigquery.Client(project=project)\njob_config = bigquery.QueryJobConfig(\n    query_parameters=[\n        bigquery.ScalarQueryParameter("query_start", "STRING", str(\'2021-02-09 00:00:00\')),\n        bigquery.ScalarQueryParameter("query_end", "STRING", str(\'2021-02-09 23:59:59.99\')),\n    ]\n)\ndf = bqclient.query(query, job_config=job_config).to_dataframe(create_bqstorage_client=True)\n
Run Code Online (Sandbox Code Playgroud)\n
2021-02-11 10:10:14,069 - preprocessing logger initialized\n2021-02-11 10:10:14,069 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]\nTraceback (most recent call last):\n  File "/tmp/782503bcc80246258560a07d2179891f/immo_preprocessing-pageviews_kyero.py", line 104, in <module>\n    df = bqclient.query(base_query, job_config=job_config).to_dataframe(create_bqstorage_client=True)\n  File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/job/query.py", line 1333, in to_dataframe\n    date_as_object=date_as_object,\n  File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe\n    df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)\n  File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas\nTypeError: to_pandas() got an unexpected keyword argument \'timestamp_as_object\'\n
Run Code Online (Sandbox Code Playgroud)\n

使用 pandas-gbq 版本给出了完全相同的错误

\n
query_config = {\n    \'query\': {\n        \'parameterMode\': \'NAMED\',         \n        \'queryParameters\': [\n            {\n                \'name\': \'query_start\',\n                \'parameterType\': {\'type\': \'STRING\'},\n                \'parameterValue\': {\'value\': str(\'2021-02-09 00:00:00\')}\n            },\n            {\n                \'name\': \'query_end\',\n                \'parameterType\': {\'type\': \'STRING\'},\n                \'parameterValue\': {\'value\': str(\'2021-02-09 23:59:59.99\')}\n            },\n        ]\n    }\n}\ndf = pd.read_gbq(base_query, \n                 configuration=query_config, \n                 progress_bar_type=\'tqdm\',\n                 use_bqstorage_api=True)\n
Run Code Online (Sandbox Code Playgroud)\n
2021-02-11 09:21:19,532 - preprocessing logger initialized\n2021-02-11 09:21:19,532 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]\nstarted\nDownloading: 100%|\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88| 3107858/3107858 [00:14<00:00, 207656.33rows/s]\nTraceback (most recent call last):\n  File "/tmp/1830d5bcf198440e9e030c8e42a1b870/immo_preprocessing-pageviews.py", line 98, in <module>\n    use_bqstorage_api=True)\n  File "/opt/conda/default/lib/python3.7/site-packages/pandas/io/gbq.py", line 193, in read_gbq\n    **kwargs,\n  File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 977, in read_gbq\n    dtypes=dtypes,\n  File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 536, in run_query\n    user_dtypes=dtypes,\n  File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 590, in _download_results\n    **to_dataframe_kwargs\n  File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe\n    df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)\n  File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas\nTypeError: to_pandas() got an unexpected keyword argument \'timestamp_as_object\'\n\n
Run Code Online (Sandbox Code Playgroud)\n

https://github.com/googleapis/python-bigquery/issues/519

\n

Dav*_*Liu 6

@Sam 回答了这个问题,但我想我只想提及可操作的命令:

在 Jupyter 笔记本中:

!pip install pyarrow==3.0.0

在你的虚拟环境中

pip install pyarrow==3.0.0


Sam*_*Sam 3

Dataproc 默认安装 pyarrow 0.15.0,而 bigquery-storage-api 需要更新的版本。在安装时手动将 pyarrow 设置为 3.0.0 解决了该问题。话虽这么说,PySpark 有一个 Pyarrow >= 0.15.0 的兼容性设置 https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#apache-arrow-在 Spark 中, 我查看了 dataproc 的发行说明,自 2020 年 5 月以来,此环境变量被设置为默认值。

  • 可以确认 `!pip install pyarrow==3.0.0` 为我解决了这个问题 (2认同)