Dav*_*ole 11 python-3.x apache-spark rdd pyspark
我有一个从JSON文件导入的PySpark RDD.数据元素包含许多具有不可取字符的值.为了参数,只有那些string.printable的字符应该在那些JSON文件中.
鉴于存在大量包含文本信息的元素,我一直在尝试找到一种方法,将传入的RDD映射到一个函数来清理数据并返回一个清理的RDD作为输出.我可以找到从RDD打印清理元素的方法,但不是整个元素集合,然后返回RDD.
示例文档可能如下所示,不良字符可能会蔓延到userAgent,marketingReference和pageTags元素或任何文本元素中.
{
"documentId": "abcdef12-1234-5678-fedc-cba9876543210",
"documentType": "contentSummary",
"dateTimeCreated": "2017-01-01T03:00:22.478Z"
"body": {
"requestUrl": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
"requestMethod": "GET",
"responseCode": "200",
"userAgent": "Mozilla/5.0 etc",
"requestHeaders": {
"connection": "close",
"host": "www.our-web-site.com",
"accept-language": "en-gb",
"via": "1.1 www.our-web-site.com",
"user-agent": "Mozilla/5.0 etc",
"x-forwarded-proto": "https",
"clientIp": "99.99.99.99",
"referer": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
"accept-encoding": "gzip, deflate",
"incap-client-ip": "99.99.99.99"
},
"body": {
"pageId": "/content/our-web-site/en-gb/holidays/interstitial",
"pageVersion": "1.0",
"pageClassification": "product-page",
"pageTags": "spark, python, rdd, other words",
"MarketingReference": "BUYMEPLEASE",
"referrer": "http://www.our-web-site.com/en-gb/line-of-business/product-category/irritating-guid/",
"webSessionId": "abcdef12-1234-5678-fedc-cba9876543210"
}
}
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
438 次 |
| 最近记录: |