我有一个垃圾邮件数据集,它具有以下数据类型:
\n\npyspark.rdd.PipelinedRDD
当我这样做时spams.take(3),我得到:
[["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C\'s apply 08452810075over18\'s"],\n [\'WINNER!! As a valued network customer you have been selected to receivea \xc2\xa3900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.\'],\n [\'Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030\']]
正如您所看到的,它内部有括号来分隔列表中的每个元素。我怎样才能摆脱这些括号?我尝试了很多方法来压平它,但似乎都不起作用。
\n您可以使用rdd的flatMap方法。它允许您从一行生成多行。
spams.flatMap(lambda x:x).take(3)
Run Code Online (Sandbox Code Playgroud)