将数据框架从宽格式转换为长格式

mro*_*opa 142 r reshape dataframe r-faq

将我data.frame从宽表转换为长表时遇到一些麻烦.目前它看起来像这样:

Code Country        1950    1951    1952    1953    1954
AFG  Afghanistan    20,249  21,352  22,532  23,557  24,555
ALB  Albania        8,097   8,986   10,058  11,123  12,246
Run Code Online (Sandbox Code Playgroud)

现在我想把它data.frame变成一个长期的data.frame.像这样的东西:

Code Country        Year    Value
AFG  Afghanistan    1950    20,249
AFG  Afghanistan    1951    21,352
AFG  Afghanistan    1952    22,532
AFG  Afghanistan    1953    23,557
AFG  Afghanistan    1954    24,555
ALB  Albania        1950    8,097
ALB  Albania        1951    8,986
ALB  Albania        1952    10,058
ALB  Albania        1953    11,123
ALB  Albania        1954    12,246
Run Code Online (Sandbox Code Playgroud)

我已经看过并尝试了它melt()reshape()功能,因为有些人提出了类似的问题.但是,到目前为止我只得到凌乱的结果.

如果有可能我想用这个reshape()功能来做,因为它看起来有点好处理.

Jaa*_*aap 129

三种替代解决方案

1:有 melt

library(data.table)
long <- melt(setDT(wide), id.vars = c("Code","Country"), variable.name = "year")
Run Code Online (Sandbox Code Playgroud)

赠送:

> long
    Code     Country year  value
 1:  AFG Afghanistan 1950 20,249
 2:  ALB     Albania 1950  8,097
 3:  AFG Afghanistan 1951 21,352
 4:  ALB     Albania 1951  8,986
 5:  AFG Afghanistan 1952 22,532
 6:  ALB     Albania 1952 10,058
 7:  AFG Afghanistan 1953 23,557
 8:  ALB     Albania 1953 11,123
 9:  AFG Afghanistan 1954 24,555
10:  ALB     Albania 1954 12,246
Run Code Online (Sandbox Code Playgroud)

一些替代符号可以产生相同的结果:

melt(setDT(wide), id.vars = 1:2, variable.name = "year")
melt(setDT(wide), measure.vars = 3:7, variable.name = "year")
melt(setDT(wide), measure.vars = as.character(1950:1954), variable.name = "year")
Run Code Online (Sandbox Code Playgroud)

2:有 reshape2

您可以使用与包中相同的melt功能data.table(这是一个扩展和改进的实现).melt从中reshape2还有更多参数NA来自-function na.rm = TRUE.例如,您还可以指定变量列的名称:

library(tidyr)
long <- wide %>% gather(year, value, -c(Code, Country))
Run Code Online (Sandbox Code Playgroud)

一些替代符号:

wide %>% gather(year, value, -Code, -Country)
wide %>% gather(year, value, -1:-2)
wide %>% gather(year, value, -(1:2))
wide %>% gather(year, value, -1, -2)
wide %>% gather(year, value, 3:7)
wide %>% gather(year, value, `1950`:`1954`)
Run Code Online (Sandbox Code Playgroud)

3:有 melt

library(reshape2)
long <- melt(wide, id.vars = c("Code", "Country"))
Run Code Online (Sandbox Code Playgroud)

一些替代符号:

# you can also define the id-variables by column number
melt(wide, id.vars = 1:2)

# as an alternative you can also specify the measure-variables
# all other variables will then be used as id-variables
melt(wide, measure.vars = 3:7)
melt(wide, measure.vars = as.character(1950:1954))
Run Code Online (Sandbox Code Playgroud)

如果要排除gather值,可以添加,到函数gsub以及as.numeric函数.


数据的另一个问题是R将读取值作为字符值(作为数字的结果data.table).您可以使用dplyr和修复它melt:

long$value <- as.numeric(gsub(",", "", long$value))
Run Code Online (Sandbox Code Playgroud)

或直接使用reshape2melt:

# data.table
long <- melt(setDT(wide),
             id.vars = c("Code","Country"),
             variable.name = "year")[, value := as.numeric(gsub(",", "", value))]

# tidyr and dplyr
long <- wide %>% gather(year, value, -c(Code,Country)) %>% 
  mutate(value = as.numeric(gsub(",", "", value)))
Run Code Online (Sandbox Code Playgroud)

数据:

wide <- read.table(text="Code Country        1950    1951    1952    1953    1954
AFG  Afghanistan    20,249  21,352  22,532  23,557  24,555
ALB  Albania        8,097   8,986   10,058  11,123  12,246", header=TRUE, check.names=FALSE)
Run Code Online (Sandbox Code Playgroud)

  • 根据 [tidyverse 博客](https://www.tidyverse.org/blog/2019/09/tidyr-1-0-0/),`gather` 现已退役,并被 `pivot_longer` 取代。他们指出:“新的 `pivot_longer()` 和 `pivot_wider()` 提供了 `spread()` 和 `gather()` 的现代替代方案。它们经过精心重新设计,更容易学习和记忆,并包含许多新功能.spread() 和gather() 不会消失,但它们已经退役,这意味着它们不再处于积极的开发之中。” (4认同)

Ani*_*iko 82

reshape()需要一段时间才能习惯,就像melt/ cast.假设您的数据框被调用,这是一个重塑的解决方案d:

reshape(d, 
        direction = "long",
        varying = list(names(d)[3:7]),
        v.names = "Value",
        idvar = c("Code", "Country"),
        timevar = "Year",
        times = 1950:1954)
Run Code Online (Sandbox Code Playgroud)


Sha*_*ane 33

使用重塑包:

#data
x <- read.table(textConnection(
"Code Country        1950    1951    1952    1953    1954
AFG  Afghanistan    20,249  21,352  22,532  23,557  24,555
ALB  Albania        8,097   8,986   10,058  11,123  12,246"), header=TRUE)

library(reshape)

x2 <- melt(x, id = c("Code", "Country"), variable_name = "Year")
x2[,"Year"] <- as.numeric(gsub("X", "" , x2[,"Year"]))
Run Code Online (Sandbox Code Playgroud)


akr*_*run 32

随着tidyr_1.0.0,另一个选择是pivot_longer

library(tidyr)
pivot_longer(df1, -c(Code, Country), values_to = "Value", names_to = "Year")
# A tibble: 10 x 4
#   Code  Country     Year  Value 
#   <fct> <fct>       <chr> <fct> 
# 1 AFG   Afghanistan 1950  20,249
# 2 AFG   Afghanistan 1951  21,352
# 3 AFG   Afghanistan 1952  22,532
# 4 AFG   Afghanistan 1953  23,557
# 5 AFG   Afghanistan 1954  24,555
# 6 ALB   Albania     1950  8,097 
# 7 ALB   Albania     1951  8,986 
# 8 ALB   Albania     1952  10,058
# 9 ALB   Albania     1953  11,123
#10 ALB   Albania     1954  12,246
Run Code Online (Sandbox Code Playgroud)

数据

df1 <- structure(list(Code = structure(1:2, .Label = c("AFG", "ALB"), class = "factor"), 
    Country = structure(1:2, .Label = c("Afghanistan", "Albania"
    ), class = "factor"), `1950` = structure(1:2, .Label = c("20,249", 
    "8,097"), class = "factor"), `1951` = structure(1:2, .Label = c("21,352", 
    "8,986"), class = "factor"), `1952` = structure(2:1, .Label = c("10,058", 
    "22,532"), class = "factor"), `1953` = structure(2:1, .Label = c("11,123", 
    "23,557"), class = "factor"), `1954` = structure(2:1, .Label = c("12,246", 
    "24,555"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))
Run Code Online (Sandbox Code Playgroud)

  • 这需要更多的赞成票。根据 [Tidyverse 博客](https://www.tidyverse.org/blog/2019/09/tidyr-1-0-0/),`gather` 正在退役,而 `pivot_longer` 现在是完成任务的正确方法这。 (6认同)
  • @EvanRosica 直到他们决定再次更改功能:p (3认同)

A5C*_*2T1 13

由于这个答案用标记,我觉得从基础R分享另一个选择是有用的:stack.

但是请注意,这stack不工作factor秒-如果它只能is.vectorTRUE,并从该文件is.vector中,我们发现:

is.vectorTRUE如果x是指定模式的向量,则返回,该模式不具有名称以外的属性.FALSE否则返回.

我正在使用来自@ Jaap的答案的样本数据,其中year列中factor的值是s.

这是stack方法:

cbind(wide[1:2], stack(lapply(wide[-c(1, 2)], as.character)))
##    Code     Country values  ind
## 1   AFG Afghanistan 20,249 1950
## 2   ALB     Albania  8,097 1950
## 3   AFG Afghanistan 21,352 1951
## 4   ALB     Albania  8,986 1951
## 5   AFG Afghanistan 22,532 1952
## 6   ALB     Albania 10,058 1952
## 7   AFG Afghanistan 23,557 1953
## 8   ALB     Albania 11,123 1953
## 9   AFG Afghanistan 24,555 1954
## 10  ALB     Albania 12,246 1954
Run Code Online (Sandbox Code Playgroud)


Mar*_*son 10

这是显示gatherfrom 的使用的另一个例子tidyr.您可以gather通过单独删除列(就像我在这里一样)或通过包含您想要的年份来选择列.

请注意,为了处理逗号(如果check.names = FALSE未设置则添加X ),我也使用dplyr带有parse_numberfrom 的mutate readr将文本值转换回数字.这些都是其中的一部分,tidyverse因此可以加载library(tidyverse)

wide %>%
  gather(Year, Value, -Code, -Country) %>%
  mutate(Year = parse_number(Year)
         , Value = parse_number(Value))
Run Code Online (Sandbox Code Playgroud)

返回:

   Code     Country Year Value
1   AFG Afghanistan 1950 20249
2   ALB     Albania 1950  8097
3   AFG Afghanistan 1951 21352
4   ALB     Albania 1951  8986
5   AFG Afghanistan 1952 22532
6   ALB     Albania 1952 10058
7   AFG Afghanistan 1953 23557
8   ALB     Albania 1953 11123
9   AFG Afghanistan 1954 24555
10  ALB     Albania 1954 12246
Run Code Online (Sandbox Code Playgroud)


M--*_*M-- 6

这是一个解决方案:

sqldf("Select Code, Country, '1950' As Year, `1950` As Value From wide
        Union All
       Select Code, Country, '1951' As Year, `1951` As Value From wide
        Union All
       Select Code, Country, '1952' As Year, `1952` As Value From wide
        Union All
       Select Code, Country, '1953' As Year, `1953` As Value From wide
        Union All
       Select Code, Country, '1954' As Year, `1954` As Value From wide;")
Run Code Online (Sandbox Code Playgroud)

要进行查询而不输入所有内容,您可以使用以下命令:

感谢 G. Grothendieck 的实施。

ValCol <- tail(names(wide), -2)

s <- sprintf("Select Code, Country, '%s' As Year, `%s` As Value from wide", ValCol, ValCol)
mquery <- paste(s, collapse = "\n Union All\n")

cat(mquery) #just to show the query
 #> Select Code, Country, '1950' As Year, `1950` As Value from wide
 #>  Union All
 #> Select Code, Country, '1951' As Year, `1951` As Value from wide
 #>  Union All
 #> Select Code, Country, '1952' As Year, `1952` As Value from wide
 #>  Union All
 #> Select Code, Country, '1953' As Year, `1953` As Value from wide
 #>  Union All
 #> Select Code, Country, '1954' As Year, `1954` As Value from wide

sqldf(mquery)
Run Code Online (Sandbox Code Playgroud)
 #>    Code     Country Year  Value
 #> 1   AFG Afghanistan 1950 20,249
 #> 2   ALB     Albania 1950  8,097
 #> 3   AFG Afghanistan 1951 21,352
 #> 4   ALB     Albania 1951  8,986
 #> 5   AFG Afghanistan 1952 22,532
 #> 6   ALB     Albania 1952 10,058
 #> 7   AFG Afghanistan 1953 23,557
 #> 8   ALB     Albania 1953 11,123
 #> 9   AFG Afghanistan 1954 24,555
 #> 10  ALB     Albania 1954 12,246
Run Code Online (Sandbox Code Playgroud)

不幸的是,我不认为这PIVOTUNPIVOTR SQLite. 如果您想以更复杂的方式编写查询,您还可以查看这些帖子: