从pyspark中的Spark DF中仅选择数字/字符串列名称

Mar*_*ara 5 python apache-spark pyspark

我在pyspark(2.1.0)中有一个SparkDataFrame,我希望只获取数字列的名称或仅获取字符串列.

例如,这是我的DF的架构:

root
 |-- Gender: string (nullable = true)
 |-- SeniorCitizen: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: double (nullable = true)
 |-- Churn: string (nullable = true)
Run Code Online (Sandbox Code Playgroud)

这就是我需要的:

num_cols = [MonthlyCharges, TotalCharges]
str_cols = [Gender, SeniorCitizen, Churn]
Run Code Online (Sandbox Code Playgroud)

我该怎么做?谢谢!

小智 13

dtypes是元组列表(columnNane,type),你可以使用简单的过滤器

 columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]
Run Code Online (Sandbox Code Playgroud)

  • 较短:`[c for c, t in df.dtypes if t.startswith('string')]` (3认同)

abi*_*sis 7

PySpark 提供了与模式类型相关的丰富 API 。正如@DanieldePaula 提到的,您可以通过df.schema.fields.

这是一种基于静态类型检查的不同方法:

from pyspark.sql.types import StringType, DoubleType

df = spark.createDataFrame([
  [1, 2.3, "t1"],
  [2, 5.3, "t2"],
  [3, 2.1, "t3"],
  [4, 1.5, "t4"]
], ["cola", "colb", "colc"])

# get string
str_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]
# ['colc']

# or double
dbl_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, DoubleType)]
# ['colb']
Run Code Online (Sandbox Code Playgroud)