我有一个数据集,我想提取那些(审查/文本)在x和y之间(审查/时间),例如(1183334400 <时间<1185926400),
这是我数据的一部分:
product/productId: B000278ADA
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: A17KXW1PCUAIIN
review/profileName: Mark Anthony "Mark"
review/helpfulness: 4/4
review/score: 5.0
review/time: 1174435200
review/summary: Jobst UltraSheer Knee High Stockings
review/text: Does a very good job of relieving fatigue.
product/productId: B000278ADB
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: A9Q3932GX4FX8
review/profileName: Trina Wehle
review/helpfulness: 1/1
review/score: 3.0
review/time: 1352505600
review/summary: Delivery was very long wait.....
review/text: It took almost 3 weeks to recieve the …Run Code Online (Sandbox Code Playgroud) 我想对Spark-Scala中的大量文本数据应用预处理阶段,例如Lemmatization - Remove Stop Words(使用Tf-Idf) - POS标记,有什么方法可以在Spark中实现它们 - Scala?
例如,这是我的数据的一个示例:
The perfect fit for my iPod photo. Great sound for a great price. I use it everywhere. it is very usefulness for me.
Run Code Online (Sandbox Code Playgroud)
预处理后:
perfect fit iPod photo great sound great price use everywhere very useful
Run Code Online (Sandbox Code Playgroud)
他们有POS标签,例如 (iPod,NN) (photo,NN)
有一个POS标签(sister.arizona)是否适用于Spark?