如何在spark sql(Scala)中解析url

use*_*344 2 scala apache-spark

我正在使用以下函数来解析 url 但它会抛出错误,

val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
        .withColumn("host",parse_url($"url_col","HOST"))
        .withColumn("query",parse_url($"url_col","QUERY"))
        .show(false)
Run Code Online (Sandbox Code Playgroud)

错误:

<console>:285: error: not found: value parse_url
               .withColumn("host",parse_url($"url_col","HOST"))
                                  ^
<console>:286: error: not found: value parse_url
               .withColumn("query",parse_url($"url_col","QUERY"))
                                   ^
Run Code Online (Sandbox Code Playgroud)

请指导如何将 url 解析为不同的部分。

Ram*_*jan 5

parse_url仅可用作 sql 而不能用作 api。参考parse_url

所以你应该将它用作 sql 查询而不是通过 api 进行函数调用

您应该注册数据框使用如下查询

val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")

b.createOrReplaceTempView("temp")
spark.sql("SELECT url_col, parse_url(`url_col`, 'HOST') as HOST, parse_url(`url_col`,'QUERY') as QUERY from temp").show(false)
Run Code Online (Sandbox Code Playgroud)

这应该给你输出

+--------------------------------------------------------------------------------------------+-----------------+-------+
|url_col                                                                                     |HOST             |QUERY  |
+--------------------------------------------------------------------------------------------+-----------------+-------+
|http://spark.apache.org/path?query=1                                                        |spark.apache.org |query=1|
|https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative|people.apache.org|null   |
+--------------------------------------------------------------------------------------------+-----------------+-------+
Run Code Online (Sandbox Code Playgroud)

我希望答案有帮助


T. *_*ęda 5

Answer by @Ramesh is correct, but you also might want some hacky way to use this function without SQL queries :)

Hack is in the fact, that "callUDF" function calls not only UDFs, but any available function.

So you can write:

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

b.withColumn("host", callUDF("parse_url", $"url_col", lit("HOST"))).
 withColumn("query", callUDF("parse_url", $"url_col", lit("QUERY"))).
 show(false)
Run Code Online (Sandbox Code Playgroud)

Edit: after this Pull Request is merged, you can just use parse_url like a normal function. PR made after this question :)