Ang*_*nda 4 string scala apache-spark
我有一个数据集,其中包含格式的行(制表符分隔):
Title<\t>Text
Run Code Online (Sandbox Code Playgroud)
现在,对于每个单词Text,我想创建一(Word,Title)对.例如:
ABC Hello World
Run Code Online (Sandbox Code Playgroud)
给我
(Hello, ABC)
(World, ABC)
Run Code Online (Sandbox Code Playgroud)
使用Scala,我写了以下内容:
val file = sc.textFile("s3n://file.txt")
val title = file.map(line => line.split("\t")(0))
val wordtitle = file.map(line => (line.split("\t")(1).split(" ").map(word => (word, line.split("\t")(0)))))
Run Code Online (Sandbox Code Playgroud)
但这给了我以下输出:
[Lscala.Tuple2;@2204b589
[Lscala.Tuple2;@632a46d1
[Lscala.Tuple2;@6c8f7633
[Lscala.Tuple2;@3e9945f3
[Lscala.Tuple2;@40bf74a0
[Lscala.Tuple2;@5981d595
[Lscala.Tuple2;@5aed571b
[Lscala.Tuple2;@13f1dc40
[Lscala.Tuple2;@6bb2f7fa
[Lscala.Tuple2;@32b67553
[Lscala.Tuple2;@68d0b627
[Lscala.Tuple2;@8493285
Run Code Online (Sandbox Code Playgroud)
我该如何解决这个问题?
进一步阅读
我想要实现的是计算特定的Wordsa中发生的数量.TextTitle
我写的后续代码是:
val wordcountperfile = file.map(line => (line.split("\t")(1).split(" ").flatMap(word => word), line.split("\t")(0))).map(word => (word, 1)).reduceByKey(_ + _)
Run Code Online (Sandbox Code Playgroud)
但它不起作用.请随时提供您的意见.谢谢!
所以...在Spark中你使用称为RDD的分布式数据结构.它们提供类似于scala集合类型的功能.
val fileRdd = sc.textFile("s3n://file.txt")
// RDD[ String ]
val splitRdd = fileRdd.map( line => line.split("\t") )
// RDD[ Array[ String ]
val yourRdd = splitRdd.flatMap( arr => {
val title = arr( 0 )
val text = arr( 1 )
val words = text.split( " " )
words.map( word => ( word, title ) )
} )
// RDD[ ( String, String ) ]
// Now, if you want to print this...
yourRdd.foreach( { case ( word, title ) => println( s"{ $word, $title }" ) } )
// if you want to count ( this count is for non-unique words),
val countRdd = yourRdd
.groupBy( { case ( word, title ) => title } ) // group by title
.map( { case ( title, iter ) => ( title, iter.size ) } ) // count for every title
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
50061 次 |
| 最近记录: |