jl7*_*795 6 pivot pivot-table r tidyr data-cleaning
我正在尝试旋转一个包含标题和副标题的表格,以便标题进入“日期”列,并且副标题是两列而不是重复。
这是我的数据的示例。
这是使用 生成的dput(),因此在原始 Excel 文件中,每个日期都跨越两个子标题(“蓝色”和“绿色”),在 R 中,这些空白单元格被重新命名为 X.1、X.2、 ETC。
table <- " X X.1 X02.Jul.12 X.2 X03.Jul.12 X.3 X04.Jul.12 X.4
1 category number blue green blue green blue green
2 G 1 1 0 1 0 1 0
3 G 2 2 99 2 99 1 99
4 G 3 1 1 1 99 1 99
5 G 4 1 1 1 1 2 99
6 G 5 1 0 1 0 1 99
7 G 6 1 99 1 1 1 99
8 G 7 1 0 1 0 1 0
9 G 8 1 1 1 1 1 99
10 G 9 1 1 1 1 1 1
11 H 1 1 1 1 1 1 1
12 H 2 1 99 1 0 1 0
13 H 3 1 1 1 1 1 99
14 H 4 1 99 1 2 1 99
15 H 5 1 1 1 1 1 1
16 H 6 1 0 1 0 1 99
17 H 7 1 1 2 1 1 99
18 H 8 2 0 2 0 1 1
19 H 9 2 0 2 0 1 1"
#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
Run Code Online (Sandbox Code Playgroud)
下面是 Excel 中的示例:
这是我想要实现的期望输出:
虽然这可以在 Excel 中手动完成,但我有多个包含超过 100 个日期/列的文件,因此更愿意找到一种在 R 中清理它的方法。
任何帮助,将不胜感激!
下面是数据集的表示,就好像它是从 Excel 中读取的,没有进行名称更正:
# Define the dataset.
df_excel <- structure(
list(
c("category", "G", "G", "G", "G", "G", "G", "G", "G", "G", "H", "H", "H", "H", "H", "H", "H", "H", "H"),
c("number", "1", "2", "3", "4", "5", "6", "7", "8", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
`02.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
c("green", "0", "99", "1", "1", "0", "99", "0", "1", "1", "1", "99", "1", "99", "1", "0", "1", "0", "0"),
`03.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2"),
c("green", "0", "99", "99", "1", "0", "1", "0", "1", "1", "1", "0", "1", "2", "1", "0", "1", "0", "0"),
`04.Jul.12` = c("blue", "1", "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
c("green", "0", "99", "99", "99", "99", "99", "0", "99", "1", "1", "0", "99", "99", "1", "99", "99", "1", "1")
),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19")
)
# Save dataset in Excel file ('reprex.xlsx') for reproducibility.
openxlsx::write.xlsx(x = df_excel, file = "./reprex.xlsx")
Run Code Online (Sandbox Code Playgroud)
这是一个tidyverse可以处理重复列名(例如blue)但不依赖于拼接这些名称的解决方案:
首先导入tidyverse并找到 Excel 文件:
# Load the tidyverse.
library(tidyverse)
# Filepath to the Excel file.
filepath <- "reprex.xlsx"
Run Code Online (Sandbox Code Playgroud)
然后读取 Excel 文件的三个相关部分:日期行(最上面)、标题(具有重复名称)和数据集。
# Extract the date row and fill in the blanks.
dates <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 0, n_max = 1) %>%
# Convert everything to dates where possible; leave blanks (NAs) elsewhere.
mutate(across(.cols = everything(), .fns = lubridate::as_datetime)) %>%
# Treat date row as a column.
as.double() %>% lubridate::as_datetime() %>% as_tibble() %>%
# Fill in the blanks with the preceding dates.
fill(1, .direction = "down") %>%
# Treat the result as a vector of dates.
.[[1]]
# Extract the header...
names <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 1, n_max = 1) %>%
# ...as a vector of column names (with duplicates).
as.character()
# Extract the (unnamed) dataset.
df <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 2, n_max = Inf)
Run Code Online (Sandbox Code Playgroud)
最后,使用此工作流程正确命名和透视数据。
# Cut out the headers from the data.
df <- df %>%
# Properly name the dataset.
set_names(nm = names) %>%
# Pivot the color columns.
pivot_longer(cols = !c(category, number), names_to = "color") %>%
# Convert to the proper datatypes.
mutate(
category = as.character(category),
number = as.integer(number),
value = as.numeric(value)
) %>%
# Identify each "clump" of colors by the one row from which it originated;
# where {'category', 'number'} uniquely identify each such row.
group_by(category, number) %>%
# Map the date names to each clump.
mutate(
# Index the entries in each clump.
date = row_number(),
# Map each date to its corresponding entry.
date = dates[!is.na(dates)][date],
# Ensure homogeneity as date objects.
date = lubridate::as_datetime(date)
) %>% ungroup() %>%
# Pivot the colors into consolidated columns: one for each color.
pivot_wider(names_from = color, values_from = value) %>%
# Sort as desired.
arrange(date, category, number)
Run Code Online (Sandbox Code Playgroud)
给你一个像你在这里reprex.xlsx描述的那样的
当我导入 excel .xlsx 文件而不是 .csv 文件时,日期变成数字(例如 41092)
该解决方案应产生以下结果df:
# A tibble: 54 x 5
category number date blue green
<chr> <int> <dttm> <dbl> <dbl>
1 G 1 2012-07-02 00:00:00 1 0
2 G 2 2012-07-02 00:00:00 2 99
3 G 3 2012-07-02 00:00:00 1 1
4 G 4 2012-07-02 00:00:00 1 1
5 G 5 2012-07-02 00:00:00 1 0
6 G 6 2012-07-02 00:00:00 1 99
7 G 7 2012-07-02 00:00:00 1 0
8 G 8 2012-07-02 00:00:00 1 1
9 G 9 2012-07-02 00:00:00 1 1
10 H 1 2012-07-02 00:00:00 1 1
11 H 2 2012-07-02 00:00:00 1 99
12 H 3 2012-07-02 00:00:00 1 1
13 H 4 2012-07-02 00:00:00 1 99
14 H 5 2012-07-02 00:00:00 1 1
15 H 6 2012-07-02 00:00:00 1 0
16 H 7 2012-07-02 00:00:00 1 1
17 H 8 2012-07-02 00:00:00 2 0
18 H 9 2012-07-02 00:00:00 2 0
19 G 1 2012-07-03 00:00:00 1 0
20 G 2 2012-07-03 00:00:00 2 99
21 G 3 2012-07-03 00:00:00 1 99
22 G 4 2012-07-03 00:00:00 1 1
23 G 5 2012-07-03 00:00:00 1 0
24 G 6 2012-07-03 00:00:00 1 1
25 G 7 2012-07-03 00:00:00 1 0
26 G 8 2012-07-03 00:00:00 1 1
27 G 9 2012-07-03 00:00:00 1 1
28 H 1 2012-07-03 00:00:00 1 1
29 H 2 2012-07-03 00:00:00 1 0
30 H 3 2012-07-03 00:00:00 1 1
31 H 4 2012-07-03 00:00:00 1 2
32 H 5 2012-07-03 00:00:00 1 1
33 H 6 2012-07-03 00:00:00 1 0
34 H 7 2012-07-03 00:00:00 2 1
35 H 8 2012-07-03 00:00:00 2 0
36 H 9 2012-07-03 00:00:00 2 0
37 G 1 2012-07-04 00:00:00 1 0
38 G 2 2012-07-04 00:00:00 1 99
39 G 3 2012-07-04 00:00:00 1 99
40 G 4 2012-07-04 00:00:00 2 99
41 G 5 2012-07-04 00:00:00 1 99
42 G 6 2012-07-04 00:00:00 1 99
43 G 7 2012-07-04 00:00:00 1 0
44 G 8 2012-07-04 00:00:00 1 99
45 G 9 2012-07-04 00:00:00 1 1
46 H 1 2012-07-04 00:00:00 1 1
47 H 2 2012-07-04 00:00:00 1 0
48 H 3 2012-07-04 00:00:00 1 99
49 H 4 2012-07-04 00:00:00 1 99
50 H 5 2012-07-04 00:00:00 1 1
51 H 6 2012-07-04 00:00:00 1 99
52 H 7 2012-07-04 00:00:00 1 99
53 H 8 2012-07-04 00:00:00 1 1
54 H 9 2012-07-04 00:00:00 1 1
Run Code Online (Sandbox Code Playgroud)
与 类似openxlsx::convertToDate(),readxl此处的函数会自动将 Excel 日期数字转换为正确的DateR。