colRegex 在 pyspark 3.0 - python 3.7 中返回错误

use*_*077 0 python-3.x pyspark

我有一个 pyspark 数据框,其中包含一些带有后缀的列_24

df.columns = [timestamp',
 'air_temperature_median_24',
 'air_temperature_median_6',
 'wind_direction_mean_24',
 'wind_speed',
 'building_id']
Run Code Online (Sandbox Code Playgroud)

我尝试使用 colRegex 方法选择它们,但下面的代码导致异常:

df.select(ashrae.colRegex(".+'_24'")).show()

    ---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-103-a8189f0298e6> in <module>
----> 1 ashrae.select(ashrae.colRegex(".+'_24'")).show()

C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\pyspark\sql\dataframe.py in colRegex(self, colName)
    957         if not isinstance(colName, basestring):
    958             raise ValueError("colName should be provided as string")
--> 959         jc = self._jdf.colRegex(colName)
    960         return Column(jc)
    961 

C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.8.1-src.zip\py4j\java_gateway.py in __call__(self, *args)
   1284         answer = self.gateway_client.send_command(command)
   1285         return_value = get_return_value(
-> 1286             answer, self.gateway_client, self.target_id, self.name)
   1287 
   1288         for temp_arg in temp_args:

C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
     96     def deco(*a, **kw):
     97         try:
---> 98             return f(*a, **kw)
     99         except py4j.protocol.Py4JJavaError as e:
    100             converted = convert_exception(e.java_exception)

C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.8.1-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o151.colRegex.
: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.charAt(Unknown Source)
    at scala.collection.immutable.StringOps$.apply$extension(StringOps.scala:41)
    at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:202)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
    at org.apache.spark.sql.Dataset.resolve(Dataset.scala:259)
    at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1364)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Unknown Source)
Run Code Online (Sandbox Code Playgroud)

pyspark 运行时不会出现任何问题,因此这很可能是语法错误。

另一方面,这个语法也失败了:

df.select(df.colRegex("'.+ _24$'")).show()

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
     97         try:
---> 98             return f(*a, **kw)
     99         except py4j.protocol.Py4JJavaError as e:

C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.8.1-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling o151.colRegex.
: org.apache.spark.sql.AnalysisException: Cannot resolve column name "'.+[_24]$'" among (timestamp, log1p_meter_reading, square_feet, air_temperature_max_168, air_temperature_max_48, air_temperature_median_120, air_temperature_median_24, air_temperature_median_6, air_temperature_min_168, dew_temperature, dew_temperature_max_120, dew_temperature_max_48, dew_temperature_min_48, precip_depth_1_hr_mean_48, precip_depth_1_hr_mean_72, sea_level_pressure_mean_168, wind_direction_mean_12, wind_direction_mean_168, wind_direction_mean_24, wind_direction_mean_48, wind_speed, building_id, floor_count, hour, primary_use, week, year_built, dayofweek, log1p_meter_reading_pred_ridge, log1p_meter_reading_pred_RF, log1p_meter_reading_pred_xgboost, log1p_meter_reading_pred_lgb);
    at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:267)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.Dataset.resolve(Dataset.scala:260)
    at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1364)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Unknown Source)


During handling of the above exception, another exception occurred:

AnalysisException                         Traceback (most recent call last)
<ipython-input-114-d25b73f87021> in <module>
----> 1 df.select(df.colRegex("'.+_24$'")).show()

C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\pyspark\sql\dataframe.py in colRegex(self, colName)
    957         if not isinstance(colName, basestring):
    958             raise ValueError("colName should be provided as string")
--> 959         jc = self._jdf.colRegex(colName)
    960         return Column(jc)
    961 

C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\lib\py4j-0.10.8.1-src.zip\py4j\java_gateway.py in __call__(self, *args)
   1284         answer = self.gateway_client.send_command(command)
   1285         return_value = get_return_value(
-> 1286             answer, self.gateway_client, self.target_id, self.name)
   1287 
   1288         for temp_arg in temp_args:

C:\spark\spark-3.0.0-preview-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)
    100             converted = convert_exception(e.java_exception)
    101             if not isinstance(converted, UnknownException):
--> 102                 raise converted
    103             else:
    104                 raise

AnalysisException: Cannot resolve column name "'.+[_24]$'" among (timestamp, log1p_meter_reading, square_feet, air_temperature_max_168, air_temperature_max_48, air_temperature_median_120, air_temperature_median_24, air_temperature_median_6, air_temperature_min_168, dew_temperature, dew_temperature_max_120, dew_temperature_max_48, dew_temperature_min_48, precip_depth_1_hr_mean_48, precip_depth_1_hr_mean_72, sea_level_pressure_mean_168, wind_direction_mean_12, wind_direction_mean_168, wind_direction_mean_24, wind_direction_mean_48, wind_speed, building_id, floor_count, hour, primary_use, week, year_built, dayofweek, log1p_meter_reading_pred_ridge, log1p_meter_reading_pred_RF, log1p_meter_reading_pred_xgboost, log1p_meter_reading_pred_lgb);
Run Code Online (Sandbox Code Playgroud)

导致异常的原因是什么以及我应该如何更正代码?

jxc*_*jxc 6

尝试以下语法:

df.select(df.colRegex("`.+_24$`")).show()
Run Code Online (Sandbox Code Playgroud)

使用 colRegex 时,col 名称用反引号括起来。