use*_*344 2 scala apache-spark
我正在使用以下函数来解析 url 但它会抛出错误,
val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
.withColumn("host",parse_url($"url_col","HOST"))
.withColumn("query",parse_url($"url_col","QUERY"))
.show(false)
Run Code Online (Sandbox Code Playgroud)
错误:
<console>:285: error: not found: value parse_url
.withColumn("host",parse_url($"url_col","HOST"))
^
<console>:286: error: not found: value parse_url
.withColumn("query",parse_url($"url_col","QUERY"))
^
Run Code Online (Sandbox Code Playgroud)
请指导如何将 url 解析为不同的部分。
parse_url仅可用作 sql 而不能用作 api。参考parse_url
所以你应该将它用作 sql 查询而不是通过 api 进行函数调用
您应该注册数据框并使用如下查询
val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
b.createOrReplaceTempView("temp")
spark.sql("SELECT url_col, parse_url(`url_col`, 'HOST') as HOST, parse_url(`url_col`,'QUERY') as QUERY from temp").show(false)
Run Code Online (Sandbox Code Playgroud)
这应该给你输出
+--------------------------------------------------------------------------------------------+-----------------+-------+
|url_col |HOST |QUERY |
+--------------------------------------------------------------------------------------------+-----------------+-------+
|http://spark.apache.org/path?query=1 |spark.apache.org |query=1|
|https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative|people.apache.org|null |
+--------------------------------------------------------------------------------------------+-----------------+-------+
Run Code Online (Sandbox Code Playgroud)
我希望答案有帮助
Answer by @Ramesh is correct, but you also might want some hacky way to use this function without SQL queries :)
Hack is in the fact, that "callUDF" function calls not only UDFs, but any available function.
So you can write:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
b.withColumn("host", callUDF("parse_url", $"url_col", lit("HOST"))).
withColumn("query", callUDF("parse_url", $"url_col", lit("QUERY"))).
show(false)
Run Code Online (Sandbox Code Playgroud)
Edit: after this Pull Request is merged, you can just use parse_url like a normal function. PR made after this question :)
| 归档时间: |
|
| 查看次数: |
8589 次 |
| 最近记录: |