我的数据如下所示:
# A tibble: 6 x 4
name val time x1
<chr> <dbl> <date> <dbl>
1 C Farolillo 7 2016-04-20 51.5
2 C Farolillo 3 2016-04-21 56.3
3 C Farolillo 7 2016-04-22 56.3
4 C Farolillo 13 2016-04-23 57.9
5 C Farolillo 7 2016-04-24 58.7
6 C Farolillo 9 2016-04-25 59.0
Run Code Online (Sandbox Code Playgroud)
我正在尝试使用该pivot_wider函数根据name列扩展数据。我使用以下代码:
yy <- d %>%
pivot_wider(., names_from = name, values_from = val)
Run Code Online (Sandbox Code Playgroud)
这给了我以下警告消息:
Warning message:
Values in `val` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list(val = list)` to suppress this warning.
* Use `values_fn = list(val = length)` to identify where the duplicates arise
* Use `values_fn = list(val = summary_fun)` to summarise duplicates
Run Code Online (Sandbox Code Playgroud)
输出看起来像:
time x1 out1 out2
2016-04-20 51.50000 <dbl> <dbl>
2 2016-04-21 56.34615 <dbl> <dbl>
3 2016-04-22 56.30000 <dbl> <dbl>
4 2016-04-23 57.85714 <dbl> <dbl>
5 2016-04-24 58.70968 <dbl> <dbl>
6 2016-04-25 58.96774 <dbl> <dbl>
Run Code Online (Sandbox Code Playgroud)
我知道这里提到了这个问题,为了解决这个问题,他们建议使用汇总统计。但是我有时间序列数据,因此不想使用汇总统计数据,因为每天都有一个值(而不是多个值)。
我知道问题是因为该val列有重复项(即在上面的例子中 7 出现了 3 次。
关于如何pivot_wider 和克服这个问题的任何建议?
数据:
d <- structure(list(name = c("C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica"), val = c(7, 3, 7, 13, 7,
9, 20, 19, 4, 5, 5, 2, 6, 6, 16, 13, 7, 6, 3, 3, 6, 10, 5, 3,
5, 3, 4, 4, 10, 11, 4, 13, 8, 2, 8, 10, 3, 10, 14, 4, 2, 4, 6,
6, 8, 8, 3, 3, 13, 10, 13, 32, 25, 31, 34, 26, 33, 35, 43, 22,
22, 21, 10, 33, 33, 48, 47, 27, 23, 11, 13, 25, 31, 20, 16, 10,
9, 23, 11, 23, 26, 16, 34, 17, 4, 24, 21, 10, 26, 32, 10, 5,
9, 19, 14, 27, 27, 10, 8, 28, 32, 25), time = structure(c(16911,
16912, 16913, 16914, 16915, 16916, 16917, 16918, 16919, 16920,
16921, 16922, 16923, 16923, 16924, 16925, 16926, 16927, 16928,
16929, 16930, 16931, 16932, 16933, 16934, 16935, 16936, 16937,
16938, 16939, 16940, 16941, 16942, 16943, 16944, 16945, 16946,
16947, 16948, 16949, 16950, 16951, 16952, 16953, 16954, 16955,
16956, 16957, 16958, 16959, 16960, 16911, 16912, 16913, 16914,
16915, 16916, 16917, 16918, 16919, 16920, 16921, 16922, 16923,
16923, 16924, 16925, 16926, 16927, 16928, 16929, 16930, 16931,
16932, 16933, 16934, 16935, 16936, 16937, 16938, 16939, 16940,
16941, 16942, 16943, 16944, 16945, 16946, 16947, 16948, 16949,
16950, 16951, 16952, 16953, 16954, 16955, 16956, 16957, 16958,
16959, 16960), class = "Date"), x1 = c(51.5, 56.3461538461538,
56.3, 57.8571428571429, 58.7096774193548, 58.9677419354839, 64.4615384615385,
61.9310344827586, 60.3214285714286, 59.4137931034483, 59.5806451612903,
57.3448275862069, 64.0333333333333, 64.0333333333333, 70.15625,
71.3636363636364, 62.8125, 56.4375, 56.4516129032258, 51.741935483871,
52.84375, 53.09375, 52.969696969697, 54, 54.3870967741936, 60.3870967741936,
64.4516129032258, 66.2903225806452, 68.2333333333333, 69.7741935483871,
70.5806451612903, 73.8275862068966, 72.8181818181818, 64.6764705882353,
64.4838709677419, 68.7741935483871, 62.1764705882353, 68.969696969697,
70.1935483870968, 59.6774193548387, 59.9677419354839, 63.125,
67.5882352941177, 71.4705882352941, 73.8529411764706, 76.1935483870968,
72.6451612903226, 76.0645161290323, 76.4193548387097, 81.7741935483871,
85.0645161290323, 51.5, 56.3461538461538, 56.3, 57.8571428571429,
58.7096774193548, 58.9677419354839, 64.4615384615385, 61.9310344827586,
60.3214285714286, 59.4137931034483, 59.5806451612903, 57.3448275862069,
64.0333333333333, 64.0333333333333, 70.15625, 71.3636363636364,
62.8125, 56.4375, 56.4516129032258, 51.741935483871, 52.84375,
53.09375, 52.969696969697, 54, 54.3870967741936, 60.3870967741936,
64.4516129032258, 66.2903225806452, 68.2333333333333, 69.7741935483871,
70.5806451612903, 73.8275862068966, 72.8181818181818, 64.6764705882353,
64.4838709677419, 68.7741935483871, 62.1764705882353, 68.969696969697,
70.1935483870968, 59.6774193548387, 59.9677419354839, 63.125,
67.5882352941177, 71.4705882352941, 73.8529411764706, 76.1935483870968,
72.6451612903226, 76.0645161290323, 76.4193548387097, 81.7741935483871,
85.0645161290323)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-102L))
Run Code Online (Sandbox Code Playgroud)
Ron*_*hah 63
为每个创建一个唯一的标识符行name,然后使用pivot_wider
library(dplyr)
d %>%
group_by(name) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = name, values_from = val) %>%
select(-row)
# A tibble: 51 x 4
# time x1 `C Farolillo` `Plaza Eliptica`
# <date> <dbl> <dbl> <dbl>
# 1 2016-04-20 51.5 7 32
# 2 2016-04-21 56.3 3 25
# 3 2016-04-22 56.3 7 31
# 4 2016-04-23 57.9 13 34
# 5 2016-04-24 58.7 7 26
# 6 2016-04-25 59.0 9 33
# 7 2016-04-26 64.5 20 35
# 8 2016-04-27 61.9 19 43
# 9 2016-04-28 60.3 4 22
#10 2016-04-29 59.4 5 22
# … with 41 more rows
Run Code Online (Sandbox Code Playgroud)
Ame*_*eer 10
通常错误
Warning message:
Values in `val` are not uniquely identified; output will contain list-cols.
Run Code Online (Sandbox Code Playgroud)
最常见的原因是数据中的重复行(排除 val 列之后),而不是 val 列中的重复行。
which(duplicated(d))
# [1] 14 65
Run Code Online (Sandbox Code Playgroud)
OP 的数据似乎有两个重复的行,这导致了这个问题。删除重复的行也可以消除错误。
yy <- d %>% distinct() %>% pivot_wider(., names_from = name, values_from = val)
yy
Run Code Online (Sandbox Code Playgroud)
# A tibble: 50 x 4
time x1 `C Farolillo` `Plaza Eliptica`
<date> <dbl> <dbl> <dbl>
1 2016-04-20 51.5 7 32
2 2016-04-21 56.3 3 25
3 2016-04-22 56.3 7 31
4 2016-04-23 57.9 13 34
5 2016-04-24 58.7 7 26
6 2016-04-25 59.0 9 33
7 2016-04-26 64.5 20 35
8 2016-04-27 61.9 19 43
9 2016-04-28 60.3 4 22
10 2016-04-29 59.4 5 22
# ... with 40 more rows
Run Code Online (Sandbox Code Playgroud)
尽管在OP示例中不可见,但在某些情况下,当没有必要时,接受的答案会重复行。这种方法在某些情况下避免了:
d %>%
pivot_wider(names_from = name, values_from = val
, values_fn = list) %>%
unnest(cols = everything() )
Run Code Online (Sandbox Code Playgroud)
如果存在列表,为了避免警告和错误,请保留values_fn = list
例子:
d1 <- tail(d)[1:5,]
d5<-d1
d5$name<-"some"
withlist <- tibble(d1, l = list(c(1,2),c(1,2),c(1,2),c(1,2),c(1,2) ) )
withlist2 <- tibble(d5, l = list( list(1,2),list(1,2),list(1,2,3),list(1,2),list(1,2) ) )
withl <- rbind(withlist,withlist2)
res<-withl %>%
pivot_wider(names_from = name, values_from = l
, values_fn = list)
as.data.frame(res)
# val time x1 Plaza Eliptica some
#1 27 2016-06-03 76.19 1, 2 1, 2
#2 10 2016-06-04 72.65 1, 2 1, 2
#3 8 2016-06-05 76.06 1, 2 1, 2, 3
#4 28 2016-06-06 76.42 1, 2 1, 2
#5 32 2016-06-07 81.77 1, 2 1, 2
Run Code Online (Sandbox Code Playgroud)
问题是由于您想要传播/旋转更广泛的数据具有重复标识符这一事实引起的。虽然上面的两个建议,即从行号创建一个唯一的人工 ID mutate(row = row_number()),或者只过滤distinct行将允许你更广泛地旋转,但它们会改变你的表的结构,这可能会出现一个逻辑的、组织的问题下次你尝试加入任何东西时。
id_cols显式使用参数是一个更好的做法,以查看您实际上希望在广泛旋转后必须是唯一的,如果遇到问题,请先重新组织原始表。当然,您可能会找到过滤不同行或添加新 ID 的原因,很可能您希望在代码中避免重复。