Pivot_longer 用于多列重复测量数据

tcv*_*992 1 pivot r dplyr tidyr

我正在尝试使用包pivot_longer中的函数dplyr将我的数据转换为长格式。当前的广泛数据涉及 3 次重复测量患者的年龄、收缩压、是否使用降压药物 (med_hypt) 以及时间不变的“性别”变量。

示例数据和我尝试过的:

library(tidyverse)
library(dplyr)
library(magrittr)

wide_data <- structure(list(id = c(12002, 17001, 17002, 42001, 66001, 82002, 166002, 177001, 177002, 240001), 
                            sex = structure(c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), 
                                            .Label = c("men", "women"), class = "factor"), 
                            time1_age = c(71.2, 67.9, 66.5, 57.7, 57.1, 60.9, 80.9, 59.7, 58.2, 66.6), 
                            time1_systolicBP = c(102, 152, NA_real_, 170, 151, 135, 162, 133, 131, 117), 
                            time1_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
                            time2_age = c(74.2, 69.2, 67.8, 58.9, 58.4, 62.5, 82.2, 61, 59.5, 67.8), 
                            time2_systolicBP = c(NA_real_, 146, NA_real_, 151, 129, 129, 137, 144, NA_real_, 132), 
                            time2_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), 
                            time3_age = c(78, 74.2, 72.8, 64.1, 63.3, 67.7, 87.1, 66, 64.5, 72.9), 
                            time3_systolicBP = c(NA_real_, 160.5, NA_real_, 171, 135, 160, 151, 166, 129, 150.5), 
                            time3_med_hypt = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), 
                       row.names = c(NA, 10L), class = "data.frame")

# Pivoting to a longer format
long_data <- wide_data %>% 
  pivot_longer(
    cols=!id,
    names_to=c(".value", "time"), 
    names_sep="_", 
    values_drop_na=FALSE
  )
Run Code Online (Sandbox Code Playgroud)

这会产生以下小标题:

# A tibble: 40 x 6
      id time       sex   time1 time2 time3
   <dbl> <chr>      <fct> <dbl> <dbl> <dbl>
 1 12002 NA         women  NA    NA    NA  
 2 12002 age        NA     71.2  74.2  78  
 3 12002 systolicBP NA    102    NA    NA  
 4 12002 med        NA      0     0     0  
 5 17001 NA         men    NA    NA    NA  
 6 17001 age        NA     67.9  69.2  74.2
 7 17001 systolicBP NA    152   146   160. 
 8 17001 med        NA      0     0     0  
 9 17002 NA         women  NA    NA    NA  
10 17002 age        NA     66.5  67.8  72.8
# ... with 30 more rows
Run Code Online (Sandbox Code Playgroud)

我想要的列名称是 id、time、age、sex、systolicBP 和 med_hypt。每个患者 3 行对应 3 次重复测量。

Ano*_*n R 5

这可能不会给已经发布的解决方案添加任何新内容,唯一的区别是regex用于参数names_pattern

  • 如果您发现某些列名称由 1 分隔_,而其他列名称由 2 分隔_\\w+捕获任何单词字符,现在如果我指定后面有一个数字,如\\d+in time3time3_age我们告诉存储与in列pivot_longer对应的这部分列名。然后其余的列名称用于我们尝试测量 line和的变量名称。time3timeagesystolicBPmed_hypt
  • 应该注意的是,如果我们使用\\w+\\d+而不是\\w+仅其余部分将被捕获为列名,无论是否带med_hypt下划线systolicBP。但如果我们只使用\\w+它也可以捕获 med 并且结果列将hypt代替med_hypt.
  • 最后,由于我定义了两个捕获组,因此我必须定义names_patternnames_sep以某种方式来指定如何定义和分隔它们。
library(dplyr)

wide_data %>%
  pivot_longer(!c(id, sex), names_to = c("time", ".value"), 
               names_pattern = "(\\w+\\d+)_(\\w+)")

# A tibble: 30 x 6
      id sex   time    age systolicBP med_hypt
   <dbl> <fct> <chr> <dbl>      <dbl>    <dbl>
 1 12002 women time1  71.2       102         0
 2 12002 women time2  74.2        NA         0
 3 12002 women time3  78          NA         0
 4 17001 men   time1  67.9       152         0
 5 17001 men   time2  69.2       146         0
 6 17001 men   time3  74.2       160.        0
 7 17002 women time1  66.5        NA         0
 8 17002 women time2  67.8        NA         0
 9 17002 women time3  72.8        NA         0
10 42001 men   time1  57.7       170         0
# ... with 20 more rows
Run Code Online (Sandbox Code Playgroud)