Dee*_*tty 0 scala apache-spark
我有一个类似于以下示例的数据集:
tmj_dc_mgmt, Washington, en, 483, 457, 256, ['hiring', 'BusinessMgmt', 'Washington', 'Job']
SRiku0728, ???, ja, 6705, 357, 273, ['None']
BesiktaSeyma_, Akyurt, tr, 12921, 1801, 283, ['None']
AnnaKFrick, Virginia, en, 5731, 682, 1120, ['Investment', 'PPP', 'Bogota', 'jobs']
Accprimary, Manchester, en, 1650, 268, 404, ['None']
Wandii_S, Johannesburg, en, 15510, 828, 398, ['None']
Run Code Online (Sandbox Code Playgroud)
方括号内的记录是主题标签(不包括"无").
我正在尝试使用Spark和Scala在数据集中找到前10个主题标签.
我达到了这个目的:
val file = sc.textFile("/data")
val tmp1 = file
.map(_.split(","))
.map( p=>p(6))
.map(_.replaceAll("\\[|\\]",""))
.map(_.replaceAll("'",""))
.filter(x => x != " None")
.map(word => (word, 1))
.reduceByKey(_ + _)
Run Code Online (Sandbox Code Playgroud)
我不知道如何对此进行排序并从中排名前10位,我是Scala和Spark的新手.
任何帮助,将不胜感激.
您可以使用top自定义排序来实现您想要的效果:
val r = sc.parallelize(Seq(
"tmj_dc_mgmt, Washington, en, 483, 457, 256, ['hiring', 'BusinessMgmt', 'Washington', 'Job']",
"SRiku0728, ???, ja, 6705, 357, 273, ['None']",
"BesiktaSeyma_, Akyurt, tr, 12921, 1801, 283, ['None']",
"AnnaKFrick, Virginia, en, 5731, 682, 1120, ['Investment', 'PPP', 'BusinessMgmt', 'Bogota', 'jobs']",
"Accprimary, Manchester, en, 1650, 268, 404, ['None']",
"Wandii_S, Johannesburg, en, 15510, 828, 398, ['None']",
"Wandii_S, Johannesburg, en, 15510, 828, 398, ['Investment']"
))
val tag = ".*\\[([^\\]]*)\\]".r
val ordering = Ordering.by[(String, Int), Int](_._2)
r.collect{case tag(t) => t.split(",\\s*")}.flatMap(_.map(_.drop(1).dropRight(1))).filter(_ != "None").map(_ -> 1)
.reduceByKey(_ + _).top(10)(ordering).foreach(println)
Run Code Online (Sandbox Code Playgroud)
结果:
(BusinessMgmt,2)
(Investment,2)
(Washington,1)
(Bogota,1)
(PPP,1)
(jobs,1)
(Job,1)
(hiring,1)
Run Code Online (Sandbox Code Playgroud)
(我修改了你的测试数据以说明多个值)
或者,如果不同的哈希标记适合驱动程序的内存,则可以使用countByValue而不是在reduceByKey本地执行最终排序:
r.collect{case tag(t) => t.split(",\\s*")}.flatMap(_.map(_.drop(1).dropRight(1))).filter(_ != "None")
.countByValue().toList.sortBy(-_._2).take(10).foreach(println)
Run Code Online (Sandbox Code Playgroud)
另请注意,我使用不同的方法来提取主题标签,因为我相信你这样做会导致不正确的结果(当你选择第6列时,你得到['hiring',['Investment'而不是完整列表).
| 归档时间: |
|
| 查看次数: |
177 次 |
| 最近记录: |