小编Sac*_*rma的帖子

Spark中数据清理的方法

我是数据工程/机器学习和自学的新学生。在处理示例问题时，我遇到了以下数据清理任务

1. Remove extra whitespaces (keep one whitespace in between word but remove more 
than one whitespaces) and punctuations

2. Turn all the words to lower case and remove stop words (list from NLTK)

3. Remove duplicate words in ASSEMBLY_NAME column

Run Code Online (Sandbox Code Playgroud)

尽管我在大学作业期间一直在编写代码来执行这些任务，但我从未在任何项目中使用过一段代码来完成这些任务，并且我正在寻求专家的指导，他们可以通过指出来帮助我寻求完成任务的最佳方法(in python or scala)

目前已完成的工作：

1.从parquet文件中读取数据

partFitmentDF = sqlContext.read.parquet("/mnt/blob/devdatasciencesto/pga-parts-forecast/raw/parts-fits/")

display(partFitmentDF)

Run Code Online (Sandbox Code Playgroud)

2.从DF创建表

partFitmentDF.createOrReplaceTempView("partsFits")
partFitmentDF.write.mode("overwrite").format("delta").saveAsTable("partsFitsTable")

Run Code Online (Sandbox Code Playgroud)

3. 重新排列表中的 Fits_Assembly_name 数据，以便每个不同的 Itemno 的所有 Fits_Assembly_Name 和 Fits_Assembly_ID 都滚动到单行

%sql

select itemno, concat_ws(' | ' , collect_set(cast(fits_assembly_id as int))) as fits_assembly_id, concat_ws(' | ' …

Run Code Online (Sandbox Code Playgroud)

scala nltk python-3.x apache-spark pyspark

Sac*_*rma

2019 11-01

0
推荐指数

1
解决办法

4452
查看次数

标签统计

apache-spark ×1

nltk ×1

pyspark ×1

python-3.x ×1

scala ×1

Spark中数据清理的方法

标签 统计

小编Sac_rma的帖子

标签统计