旋转数据框以保留 R 中的列标题和子标题

jl7*_*795 6 pivot pivot-table r tidyr data-cleaning

我正在尝试旋转一个包含标题和副标题的表格,以便标题进入“日期”列,并且副标题是两列而不是重复。

这是我的数据的示例。

这是使用 生成的dput(),因此在原始 Excel 文件中,每个日期都跨越两个子标题(“蓝色”和“绿色”),在 R 中,这些空白单元格被重新命名为 X.1、X.2、 ETC。

table <- "          X    X.1 X02.Jul.12   X.2 X03.Jul.12   X.3 X04.Jul.12   X.4
1  category number       blue green       blue green       blue green
2         G      1          1     0          1     0          1     0
3         G      2          2    99          2    99          1    99
4         G      3          1     1          1    99          1    99
5         G      4          1     1          1     1          2    99
6         G      5          1     0          1     0          1    99
7         G      6          1    99          1     1          1    99
8         G      7          1     0          1     0          1     0
9         G      8          1     1          1     1          1    99
10        G      9          1     1          1     1          1     1
11        H      1          1     1          1     1          1     1
12        H      2          1    99          1     0          1     0
13        H      3          1     1          1     1          1    99
14        H      4          1    99          1     2          1    99
15        H      5          1     1          1     1          1     1
16        H      6          1     0          1     0          1    99
17        H      7          1     1          2     1          1    99
18        H      8          2     0          2     0          1     1
19        H      9          2     0          2     0          1     1"

#Create a dataframe with the above table
df <- read.table(text=table, header = TRUE)
df
Run Code Online (Sandbox Code Playgroud)

下面是 Excel 中的示例:

当前数据

这是我想要实现的期望输出:

所需输出

虽然这可以在 Excel 中手动完成,但我有多个包含超过 100 个日期/列的文件,因此更愿意找到一种在 R 中清理它的方法。

任何帮助,将不胜感激!

Excel 代表

下面是数据集的表示,就好像它是从 Excel 中读取的,没有进行名称更正:

# Define the dataset.
df_excel <- structure(
  list(
    c("category", "G", "G", "G", "G", "G", "G", "G", "G", "G", "H", "H", "H", "H", "H", "H", "H", "H", "H"),
    c("number", "1", "2", "3", "4", "5", "6", "7", "8", "9", "1", "2", "3", "4", "5", "6", "7", "8", "9"),
    `02.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2"),
    c("green", "0", "99", "1", "1", "0", "99", "0", "1", "1", "1", "99", "1", "99", "1", "0", "1", "0", "0"),
    `03.Jul.12` = c("blue", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2"),
    c("green", "0", "99", "99", "1", "0", "1", "0", "1", "1", "1", "0", "1", "2", "1", "0", "1", "0", "0"),
    `04.Jul.12` = c("blue", "1", "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
    c("green", "0", "99", "99", "99", "99", "99", "0", "99", "1", "1", "0", "99", "99", "1", "99", "99", "1", "1")
  ),
  class = "data.frame",
  row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19")
)

# Save dataset in Excel file ('reprex.xlsx') for reproducibility.
openxlsx::write.xlsx(x = df_excel, file = "./reprex.xlsx")
Run Code Online (Sandbox Code Playgroud)

Gre*_*reg 2

这是一个tidyverse可以处理重复列名(例如blue)但不依赖于拼接这些名称的解决方案:

解决方案

首先导入tidyverse并找到 Excel 文件:

# Load the tidyverse.
library(tidyverse)


# Filepath to the Excel file.
filepath <- "reprex.xlsx"
Run Code Online (Sandbox Code Playgroud)

然后读取 Excel 文件的三个相关部分:日期行(最上面)、标题(具有重复名称)和数据集。

# Extract the date row and fill in the blanks.
dates <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 0, n_max = 1) %>%
  # Convert everything to dates where possible; leave blanks (NAs) elsewhere.
  mutate(across(.cols = everything(), .fns = lubridate::as_datetime)) %>%
  # Treat date row as a column.
  as.double() %>% lubridate::as_datetime() %>% as_tibble() %>%
  # Fill in the blanks with the preceding dates.
  fill(1, .direction = "down") %>%
  # Treat the result as a vector of dates.
  .[[1]]


# Extract the header...
names <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 1, n_max = 1) %>%
  # ...as a vector of column names (with duplicates).
  as.character()


# Extract the (unnamed) dataset.
df <- readxl::read_excel(path = filepath, col_names = FALSE, skip = 2, n_max = Inf)
Run Code Online (Sandbox Code Playgroud)

最后,使用此工作流程正确命名和透视数据。

# Cut out the headers from the data.
df <- df %>%
  # Properly name the dataset.
  set_names(nm = names) %>%
  
  # Pivot the color columns.
  pivot_longer(cols = !c(category, number), names_to = "color") %>%

  # Convert to the proper datatypes.
  mutate(
    category = as.character(category),
    number = as.integer(number),
    value = as.numeric(value)
  ) %>%
  
  # Identify each "clump" of colors by the one row from which it originated;
  # where {'category', 'number'} uniquely identify each such row.
  group_by(category, number) %>%
  # Map the date names to each clump.
  mutate(
    # Index the entries in each clump.
    date = row_number(),
    # Map each date to its corresponding entry.
    date = dates[!is.na(dates)][date],
    # Ensure homogeneity as date objects.
    date = lubridate::as_datetime(date)
  ) %>% ungroup() %>%
  
  # Pivot the colors into consolidated columns: one for each color.
  pivot_wider(names_from = color, values_from = value) %>%
  
  # Sort as desired.
  arrange(date, category, number)
Run Code Online (Sandbox Code Playgroud)

结果

给你一个像你在这里reprex.xlsx描述的那样的

当我导入 excel .xlsx 文件而不是 .csv 文件时,日期变成数字(例如 41092)

该解决方案应产生以下结果df

# A tibble: 54 x 5
   category number date                 blue green
   <chr>     <int> <dttm>              <dbl> <dbl>
 1 G             1 2012-07-02 00:00:00     1     0
 2 G             2 2012-07-02 00:00:00     2    99
 3 G             3 2012-07-02 00:00:00     1     1
 4 G             4 2012-07-02 00:00:00     1     1
 5 G             5 2012-07-02 00:00:00     1     0
 6 G             6 2012-07-02 00:00:00     1    99
 7 G             7 2012-07-02 00:00:00     1     0
 8 G             8 2012-07-02 00:00:00     1     1
 9 G             9 2012-07-02 00:00:00     1     1
10 H             1 2012-07-02 00:00:00     1     1
11 H             2 2012-07-02 00:00:00     1    99
12 H             3 2012-07-02 00:00:00     1     1
13 H             4 2012-07-02 00:00:00     1    99
14 H             5 2012-07-02 00:00:00     1     1
15 H             6 2012-07-02 00:00:00     1     0
16 H             7 2012-07-02 00:00:00     1     1
17 H             8 2012-07-02 00:00:00     2     0
18 H             9 2012-07-02 00:00:00     2     0
19 G             1 2012-07-03 00:00:00     1     0
20 G             2 2012-07-03 00:00:00     2    99
21 G             3 2012-07-03 00:00:00     1    99
22 G             4 2012-07-03 00:00:00     1     1
23 G             5 2012-07-03 00:00:00     1     0
24 G             6 2012-07-03 00:00:00     1     1
25 G             7 2012-07-03 00:00:00     1     0
26 G             8 2012-07-03 00:00:00     1     1
27 G             9 2012-07-03 00:00:00     1     1
28 H             1 2012-07-03 00:00:00     1     1
29 H             2 2012-07-03 00:00:00     1     0
30 H             3 2012-07-03 00:00:00     1     1
31 H             4 2012-07-03 00:00:00     1     2
32 H             5 2012-07-03 00:00:00     1     1
33 H             6 2012-07-03 00:00:00     1     0
34 H             7 2012-07-03 00:00:00     2     1
35 H             8 2012-07-03 00:00:00     2     0
36 H             9 2012-07-03 00:00:00     2     0
37 G             1 2012-07-04 00:00:00     1     0
38 G             2 2012-07-04 00:00:00     1    99
39 G             3 2012-07-04 00:00:00     1    99
40 G             4 2012-07-04 00:00:00     2    99
41 G             5 2012-07-04 00:00:00     1    99
42 G             6 2012-07-04 00:00:00     1    99
43 G             7 2012-07-04 00:00:00     1     0
44 G             8 2012-07-04 00:00:00     1    99
45 G             9 2012-07-04 00:00:00     1     1
46 H             1 2012-07-04 00:00:00     1     1
47 H             2 2012-07-04 00:00:00     1     0
48 H             3 2012-07-04 00:00:00     1    99
49 H             4 2012-07-04 00:00:00     1    99
50 H             5 2012-07-04 00:00:00     1     1
51 H             6 2012-07-04 00:00:00     1    99
52 H             7 2012-07-04 00:00:00     1    99
53 H             8 2012-07-04 00:00:00     1     1
54 H             9 2012-07-04 00:00:00     1     1
Run Code Online (Sandbox Code Playgroud)

笔记

与 类似openxlsx::convertToDate()readxl此处的函数会自动将 Excel 日期数字转换为正确的DateR。