我试图在一个带有 id 列(int)、score 列(float)和“pass”列(boolean)的简单数据帧上使用 pyspark 的 toPandas() 函数。
我的问题是,每当我调用该函数时,我都会收到此错误:
> raise AttributeError("module {!r} has no attribute "
"{!r}".format(__name__, attr))
E AttributeError: module 'numpy' has no attribute 'bool'
/usr/local/lib/python3.8/site-packages/numpy/__init__.py:284: AttributeError
Run Code Online (Sandbox Code Playgroud)
柱子:
0 False
1 False
2 False
3 True
Name: pass, dtype: bool
Column<'pass'>
Run Code Online (Sandbox Code Playgroud)
我需要手动将此列转换为其他类型吗?
小智 5
总长度TR
解决方案是在转换为 pandas DataFrame 之前将布尔值转换为整数
import pyspark.sql.functions as F
import pyspark.sql.types as T
# Get boolean columns' names
bool_columns = [col[0] for col in df.dtypes if col[1] == 'boolean']
# Cast boolean to Integers
for col in bool_columns:
dft = dft.withColumn(col, F.col(col).cast(T.IntegerType()))
# Transform to Pandas
dfp = dft.toPandas()
Run Code Online (Sandbox Code Playgroud)
解释
我遇到了同样的问题:
import numpy
import pyspark
print('numpy.__version__', numpy.__version__)
print('pyspark.__version__', pyspark.__version__)
Run Code Online (Sandbox Code Playgroud)
回报
numpy.__version__ 1.24.2
pyspark.__version__ 3.3.1
Run Code Online (Sandbox Code Playgroud)
当我转换为 pandas 时,我得到同样的错误:
python3.10/site-packages/pyspark/sql/pandas/conversion.py:298: FutureWarning: In the future `np.bool` will be defined as the corresponding NumPy scalar.
return np.bool # type: ignore[attr-defined]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[2], line 3
1 # Spark read
2 df = spark.read.parquet('results/results_processed.parquet').dropDuplicates(['filename'])
----> 3 dfp = df.toPandas() # Convert to pandas
python3.10/site-packages/pyspark/sql/pandas/conversion.py:216, in PandasConversionMixin.toPandas(self)
213 else:
214 pandas_col = pdf[field.name]
--> 216 pandas_type = PandasConversionMixin._to_corrected_pandas_type(field.dataType)
217 # SPARK-21766: if an integer field is nullable and has null values, it can be
218 # inferred by pandas as a float column. If we convert the column with NaN back
219 # to integer type e.g., np.int16, we will hit an exception. So we use the
220 # pandas-inferred float type, rather than the corrected type from the schema
221 # in this case.
222 if pandas_type is not None and not (
223 isinstance(field.dataType, IntegralType)
224 and field.nullable
225 and pandas_col.isnull().any()
226 ):
File python3.10/site-packages/pyspark/sql/pandas/conversion.py:298, in PandasConversionMixin._to_corrected_pandas_type(dt)
296 return np.float64
297 elif type(dt) == BooleanType:
--> 298 return np.bool # type: ignore[attr-defined]
299 elif type(dt) == TimestampType:
300 return np.datetime64
File python3.10/site-packages/numpy/__init__.py:305, in __getattr__(attr)
300 warnings.warn(
301 f"In the future `np.{attr}` will be defined as the "
302 "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
304 if attr in __former_attrs__:
--> 305 raise AttributeError(__former_attrs__[attr])
307 # Importing Tester requires importing all of UnitTest which is not a
308 # cheap import Since it is mainly used in test suits, we lazy import it
309 # here to save on the order of 10 ms of import time for most users
310 #
311 # The previous way Tester was imported also had a side effect of adding
312 # the full `numpy.testing` namespace
313 if attr == 'testing':
AttributeError: module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
Run Code Online (Sandbox Code Playgroud)
正如 @arcteryxoveralteryx 所说,这似乎是 numpy 版本的弃用问题。但是,对于我正在使用的版本,pyspark 需要 numpy >= 1.15。在 PySpark 版本 3.4.0 中将得到修复,您可以在此处查看。
| 归档时间: |
|
| 查看次数: |
1702 次 |
| 最近记录: |