Ram*_*esh 11 scala apache-spark apache-spark-sql
val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
| 6| MSL12|
| 7| MSL|
| 8| HCP|
| 9| HCP12|
+---+-------+
Run Code Online (Sandbox Code Playgroud)
我想过滤出列'c2'的前三个字符"MSL"或"HCP"的记录.
所以输出应该如下所示.
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
Run Code Online (Sandbox Code Playgroud)
任何人都可以帮忙吗?
我知道df.filter($"c2".rlike("MSL"))- 这是为了选择记录,但如何排除记录.?
版本:Spark 1.6.2 Scala:2.10
Jeg*_*gan 20
这也有效.简洁,非常类似于SQL.
df.filter("c2 not like 'MSL%' and c2 not like 'HCP%'").show
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
Run Code Online (Sandbox Code Playgroud)
df.filter(not(
substring(col("c2"), 0, 3).isin("MSL", "HCP"))
)
Run Code Online (Sandbox Code Playgroud)
小智 6
我在下面用来过滤数据框中的行,这在 me.Spark 2.2 中起作用
val spark = new org.apache.spark.sql.SQLContext(sc)
val data = spark.read.format("csv").
option("header", "true").
option("delimiter", "|").
option("inferSchema", "true").
load("D:\\test.csv")
import spark.implicits._
val filter=data.filter($"dept" === "IT" )
Run Code Online (Sandbox Code Playgroud)
或者
val filter=data.filter($"dept" =!= "IT" )
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
50337 次 |
| 最近记录: |