将布尔列转换为 pandas 时出现 Pyspark 错误

Ped*_*mas 4 pyspark

我试图在一个带有 id 列(int)、score 列(float)和“pass”列(boolean)的简单数据帧上使用 pyspark 的 toPandas() 函数。

我的问题是,每当我调用该函数时,我都会收到此错误:

>       raise AttributeError("module {!r} has no attribute "
                             "{!r}".format(__name__, attr))
E       AttributeError: module 'numpy' has no attribute 'bool'

/usr/local/lib/python3.8/site-packages/numpy/__init__.py:284: AttributeError
Run Code Online (Sandbox Code Playgroud)

柱子:

0    False
1    False
2    False
3     True
Name: pass, dtype: bool
Column<'pass'>
Run Code Online (Sandbox Code Playgroud)

我需要手动将此列转换为其他类型吗?

小智 5

总长度TR

解决方案是在转换为 pandas DataFrame 之前将布尔值转换为整数

import pyspark.sql.functions as F
import pyspark.sql.types as T

# Get boolean columns' names
bool_columns = [col[0] for col in df.dtypes if col[1] == 'boolean']

# Cast boolean to Integers
for col in bool_columns:
    dft = dft.withColumn(col, F.col(col).cast(T.IntegerType()))
    
# Transform to Pandas
dfp = dft.toPandas()  
Run Code Online (Sandbox Code Playgroud)

解释

我遇到了同样的问题:

import numpy
import pyspark
print('numpy.__version__', numpy.__version__)
print('pyspark.__version__', pyspark.__version__)
Run Code Online (Sandbox Code Playgroud)

回报

numpy.__version__ 1.24.2
pyspark.__version__ 3.3.1
Run Code Online (Sandbox Code Playgroud)

当我转换为 pandas 时,我得到同样的错误:

python3.10/site-packages/pyspark/sql/pandas/conversion.py:298: FutureWarning: In the future `np.bool` will be defined as the corresponding NumPy scalar.
  return np.bool  # type: ignore[attr-defined]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[2], line 3
      1 # Spark read
      2 df = spark.read.parquet('results/results_processed.parquet').dropDuplicates(['filename'])
----> 3 dfp = df.toPandas() # Convert to pandas

python3.10/site-packages/pyspark/sql/pandas/conversion.py:216, in PandasConversionMixin.toPandas(self)
    213 else:
    214     pandas_col = pdf[field.name]
--> 216 pandas_type = PandasConversionMixin._to_corrected_pandas_type(field.dataType)
    217 # SPARK-21766: if an integer field is nullable and has null values, it can be
    218 # inferred by pandas as a float column. If we convert the column with NaN back
    219 # to integer type e.g., np.int16, we will hit an exception. So we use the
    220 # pandas-inferred float type, rather than the corrected type from the schema
    221 # in this case.
    222 if pandas_type is not None and not (
    223     isinstance(field.dataType, IntegralType)
    224     and field.nullable
    225     and pandas_col.isnull().any()
    226 ):

File python3.10/site-packages/pyspark/sql/pandas/conversion.py:298, in PandasConversionMixin._to_corrected_pandas_type(dt)
    296     return np.float64
    297 elif type(dt) == BooleanType:
--> 298     return np.bool  # type: ignore[attr-defined]
    299 elif type(dt) == TimestampType:
    300     return np.datetime64

File python3.10/site-packages/numpy/__init__.py:305, in __getattr__(attr)
    300     warnings.warn(
    301         f"In the future `np.{attr}` will be defined as the "
    302         "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
    304 if attr in __former_attrs__:
--> 305     raise AttributeError(__former_attrs__[attr])
    307 # Importing Tester requires importing all of UnitTest which is not a
    308 # cheap import Since it is mainly used in test suits, we lazy import it
    309 # here to save on the order of 10 ms of import time for most users
    310 #
    311 # The previous way Tester was imported also had a side effect of adding
    312 # the full `numpy.testing` namespace
    313 if attr == 'testing':

AttributeError: module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
Run Code Online (Sandbox Code Playgroud)

正如 @arcteryxoveralteryx 所说,这似乎是 numpy 版本的弃用问题。但是,对于我正在使用的版本,pyspark 需要 numpy >= 1.15。在 PySpark 版本 3.4.0 中将得到修复,您可以在此处查看。