小编Esm*_*edi的帖子

scala spark中的RDD过滤器

我有一个数据集,我想提取那些(审查/文本)在x和y之间(审查/时间),例如(1183334400 <时间<1185926400),

这是我数据的一部分:

product/productId: B000278ADA
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: A17KXW1PCUAIIN
review/profileName: Mark Anthony "Mark"
review/helpfulness: 4/4
review/score: 5.0
review/time: 1174435200
review/summary: Jobst UltraSheer Knee High Stockings
review/text: Does a very good job of relieving fatigue.

product/productId: B000278ADB
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: A9Q3932GX4FX8
review/profileName: Trina Wehle
review/helpfulness: 1/1
review/score: 3.0
review/time: 1352505600
review/summary: Delivery was very long wait.....
review/text: It took almost 3 weeks to recieve the …
Run Code Online (Sandbox Code Playgroud)

scala apache-spark

8
推荐指数
1
解决办法
2万
查看次数

Spark-Scala中的文本预处理

我想对Spark-Scala中的大量文本数据应用预处理阶段,例如Lemmatization - Remove Stop Words(使用Tf-Idf) - POS标记,有什么方法可以在Spark中实现它们 - Scala?

例如,这是我的数据的一个示例:

The perfect fit for my iPod photo. Great sound for a great price. I use it everywhere. it is very usefulness for me.
Run Code Online (Sandbox Code Playgroud)

预处理后:

perfect fit iPod photo great sound great price use everywhere very useful
Run Code Online (Sandbox Code Playgroud)

他们有POS标签,例如 (iPod,NN) (photo,NN)

有一个POS标签(sister.arizona)是否适用于Spark?

text preprocessor scala text-mining apache-spark

-3
推荐指数
1
解决办法
5153
查看次数

标签 统计

apache-spark ×2

scala ×2

preprocessor ×1

text ×1

text-mining ×1