Ann*_*e C 1 r dplyr data-cleaning
我有一个 CSV,它似乎是 Excel 数据透视表的输出,其名称嵌套为重复组的行标签。我想清理数据,以便行标签在单独的列中重复,最好使用 dplyr。
数据如下:
dd <- data.frame(variables = c("Abington", "Number of Sales","YTD Number of Sales","Median Sale Price","YTD Median Sale Price", "Acton", "Number of Sales","YTD Number of Sales","Median Sale Price","YTD Median Sale Price"), Year1 = c(" ", 16, 50,415000,413500," ",23,60,799900,704000), Year2 = c(" ",8,13,583000,575000," ",9,39,995000,800000))
dd
variables Year1 Year2
Abington
Number of Sales 16 8
YTD Number of Sales 50 13
Median Sale Price 415000 583000
YTD Median Sale Price 413500 575000
Acton
Number of Sales 23 9
YTD Number of Sales 60 39
Median Sale Price 799900 995000
YTD Median Sale Price 704000 800000
Run Code Online (Sandbox Code Playgroud)
我希望它看起来像这样:
Town variables Year1 Year2
Abington Number of Sales 16 8
Abington YTD Number of Sales 50 13
Abington Median Sale Price 415000 583000
Abington YTD Median Sale Price 413500 575000
Acton Number of Sales 23 9
Acton YTD Number of Sales 60 39
Acton Median Sale Price 799900 995000
Acton YTD Median Sale Price 704000 800000
Run Code Online (Sandbox Code Playgroud)
我们可以使用tidyverse
(或dplyr
& tidyr
) 来实现:
library(tidyverse)
dd %>%
mutate(Town = ifelse(Year1 == " " & Year2 == " ", variables, NA)) %>%
fill(Town, .direction = "down") %>%
filter(Town != variables) %>%
relocate(Town)
Run Code Online (Sandbox Code Playgroud)
导致:
Town variables Year1 Year2
1 Abington Number of Sales 16 8
2 Abington YTD Number of Sales 50 13
3 Abington Median Sale Price 415000 583000
4 Abington YTD Median Sale Price 413500 575000
5 Acton Number of Sales 23 9
6 Acton YTD Number of Sales 60 39
7 Acton Median Sale Price 799900 995000
8 Acton YTD Median Sale Price 704000 8e+05
Run Code Online (Sandbox Code Playgroud)
Year1
需要注意的是,和处的空值Year2
实际上是空格 (" "),而不是空字符串或 NA。
归档时间: |
|
查看次数: |
95 次 |
最近记录: |