mro*_*opa 142 r reshape dataframe r-faq
将我data.frame从宽表转换为长表时遇到一些麻烦.目前它看起来像这样:
Code Country 1950 1951 1952 1953 1954
AFG Afghanistan 20,249 21,352 22,532 23,557 24,555
ALB Albania 8,097 8,986 10,058 11,123 12,246
Run Code Online (Sandbox Code Playgroud)
现在我想把它data.frame变成一个长期的data.frame.像这样的东西:
Code Country Year Value
AFG Afghanistan 1950 20,249
AFG Afghanistan 1951 21,352
AFG Afghanistan 1952 22,532
AFG Afghanistan 1953 23,557
AFG Afghanistan 1954 24,555
ALB Albania 1950 8,097
ALB Albania 1951 8,986
ALB Albania 1952 10,058
ALB Albania 1953 11,123
ALB Albania 1954 12,246
Run Code Online (Sandbox Code Playgroud)
我已经看过并尝试了它melt()的reshape()功能,因为有些人提出了类似的问题.但是,到目前为止我只得到凌乱的结果.
如果有可能我想用这个reshape()功能来做,因为它看起来有点好处理.
Jaa*_*aap 129
三种替代解决方案
1:有 melt
library(data.table)
long <- melt(setDT(wide), id.vars = c("Code","Country"), variable.name = "year")
Run Code Online (Sandbox Code Playgroud)
赠送:
> long
Code Country year value
1: AFG Afghanistan 1950 20,249
2: ALB Albania 1950 8,097
3: AFG Afghanistan 1951 21,352
4: ALB Albania 1951 8,986
5: AFG Afghanistan 1952 22,532
6: ALB Albania 1952 10,058
7: AFG Afghanistan 1953 23,557
8: ALB Albania 1953 11,123
9: AFG Afghanistan 1954 24,555
10: ALB Albania 1954 12,246
Run Code Online (Sandbox Code Playgroud)
一些替代符号可以产生相同的结果:
melt(setDT(wide), id.vars = 1:2, variable.name = "year")
melt(setDT(wide), measure.vars = 3:7, variable.name = "year")
melt(setDT(wide), measure.vars = as.character(1950:1954), variable.name = "year")
Run Code Online (Sandbox Code Playgroud)
2:有 reshape2
您可以使用与包中相同的melt功能data.table(这是一个扩展和改进的实现).melt从中reshape2还有更多参数NA来自-function na.rm = TRUE.例如,您还可以指定变量列的名称:
library(tidyr)
long <- wide %>% gather(year, value, -c(Code, Country))
Run Code Online (Sandbox Code Playgroud)
一些替代符号:
wide %>% gather(year, value, -Code, -Country)
wide %>% gather(year, value, -1:-2)
wide %>% gather(year, value, -(1:2))
wide %>% gather(year, value, -1, -2)
wide %>% gather(year, value, 3:7)
wide %>% gather(year, value, `1950`:`1954`)
Run Code Online (Sandbox Code Playgroud)
3:有 melt
library(reshape2)
long <- melt(wide, id.vars = c("Code", "Country"))
Run Code Online (Sandbox Code Playgroud)
一些替代符号:
# you can also define the id-variables by column number
melt(wide, id.vars = 1:2)
# as an alternative you can also specify the measure-variables
# all other variables will then be used as id-variables
melt(wide, measure.vars = 3:7)
melt(wide, measure.vars = as.character(1950:1954))
Run Code Online (Sandbox Code Playgroud)
如果要排除gather值,可以添加,到函数gsub以及as.numeric函数.
数据的另一个问题是R将读取值作为字符值(作为数字的结果data.table).您可以使用dplyr和修复它melt:
long$value <- as.numeric(gsub(",", "", long$value))
Run Code Online (Sandbox Code Playgroud)
或直接使用reshape2或melt:
# data.table
long <- melt(setDT(wide),
id.vars = c("Code","Country"),
variable.name = "year")[, value := as.numeric(gsub(",", "", value))]
# tidyr and dplyr
long <- wide %>% gather(year, value, -c(Code,Country)) %>%
mutate(value = as.numeric(gsub(",", "", value)))
Run Code Online (Sandbox Code Playgroud)
数据:
wide <- read.table(text="Code Country 1950 1951 1952 1953 1954
AFG Afghanistan 20,249 21,352 22,532 23,557 24,555
ALB Albania 8,097 8,986 10,058 11,123 12,246", header=TRUE, check.names=FALSE)
Run Code Online (Sandbox Code Playgroud)
Ani*_*iko 82
reshape()需要一段时间才能习惯,就像melt/ cast.假设您的数据框被调用,这是一个重塑的解决方案d:
reshape(d,
direction = "long",
varying = list(names(d)[3:7]),
v.names = "Value",
idvar = c("Code", "Country"),
timevar = "Year",
times = 1950:1954)
Run Code Online (Sandbox Code Playgroud)
Sha*_*ane 33
使用重塑包:
#data
x <- read.table(textConnection(
"Code Country 1950 1951 1952 1953 1954
AFG Afghanistan 20,249 21,352 22,532 23,557 24,555
ALB Albania 8,097 8,986 10,058 11,123 12,246"), header=TRUE)
library(reshape)
x2 <- melt(x, id = c("Code", "Country"), variable_name = "Year")
x2[,"Year"] <- as.numeric(gsub("X", "" , x2[,"Year"]))
Run Code Online (Sandbox Code Playgroud)
akr*_*run 32
随着tidyr_1.0.0,另一个选择是pivot_longer
library(tidyr)
pivot_longer(df1, -c(Code, Country), values_to = "Value", names_to = "Year")
# A tibble: 10 x 4
# Code Country Year Value
# <fct> <fct> <chr> <fct>
# 1 AFG Afghanistan 1950 20,249
# 2 AFG Afghanistan 1951 21,352
# 3 AFG Afghanistan 1952 22,532
# 4 AFG Afghanistan 1953 23,557
# 5 AFG Afghanistan 1954 24,555
# 6 ALB Albania 1950 8,097
# 7 ALB Albania 1951 8,986
# 8 ALB Albania 1952 10,058
# 9 ALB Albania 1953 11,123
#10 ALB Albania 1954 12,246
Run Code Online (Sandbox Code Playgroud)
df1 <- structure(list(Code = structure(1:2, .Label = c("AFG", "ALB"), class = "factor"),
Country = structure(1:2, .Label = c("Afghanistan", "Albania"
), class = "factor"), `1950` = structure(1:2, .Label = c("20,249",
"8,097"), class = "factor"), `1951` = structure(1:2, .Label = c("21,352",
"8,986"), class = "factor"), `1952` = structure(2:1, .Label = c("10,058",
"22,532"), class = "factor"), `1953` = structure(2:1, .Label = c("11,123",
"23,557"), class = "factor"), `1954` = structure(2:1, .Label = c("12,246",
"24,555"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
Run Code Online (Sandbox Code Playgroud)
A5C*_*2T1 13
由于这个答案用r-faq标记,我觉得从基础R分享另一个选择是有用的:stack.
但是请注意,这stack不工作factor秒-如果它只能is.vector是TRUE,并从该文件is.vector中,我们发现:
is.vectorTRUE如果x是指定模式的向量,则返回,该模式不具有名称以外的属性.FALSE否则返回.
我正在使用来自@ Jaap的答案的样本数据,其中year列中factor的值是s.
这是stack方法:
cbind(wide[1:2], stack(lapply(wide[-c(1, 2)], as.character)))
## Code Country values ind
## 1 AFG Afghanistan 20,249 1950
## 2 ALB Albania 8,097 1950
## 3 AFG Afghanistan 21,352 1951
## 4 ALB Albania 8,986 1951
## 5 AFG Afghanistan 22,532 1952
## 6 ALB Albania 10,058 1952
## 7 AFG Afghanistan 23,557 1953
## 8 ALB Albania 11,123 1953
## 9 AFG Afghanistan 24,555 1954
## 10 ALB Albania 12,246 1954
Run Code Online (Sandbox Code Playgroud)
Mar*_*son 10
这是显示gatherfrom 的使用的另一个例子tidyr.您可以gather通过单独删除列(就像我在这里一样)或通过包含您想要的年份来选择列.
请注意,为了处理逗号(如果check.names = FALSE未设置则添加X ),我也使用dplyr带有parse_numberfrom 的mutate readr将文本值转换回数字.这些都是其中的一部分,tidyverse因此可以加载library(tidyverse)
wide %>%
gather(Year, Value, -Code, -Country) %>%
mutate(Year = parse_number(Year)
, Value = parse_number(Value))
Run Code Online (Sandbox Code Playgroud)
返回:
Code Country Year Value
1 AFG Afghanistan 1950 20249
2 ALB Albania 1950 8097
3 AFG Afghanistan 1951 21352
4 ALB Albania 1951 8986
5 AFG Afghanistan 1952 22532
6 ALB Albania 1952 10058
7 AFG Afghanistan 1953 23557
8 ALB Albania 1953 11123
9 AFG Afghanistan 1954 24555
10 ALB Albania 1954 12246
Run Code Online (Sandbox Code Playgroud)
这是一个sqldf解决方案:
sqldf("Select Code, Country, '1950' As Year, `1950` As Value From wide
Union All
Select Code, Country, '1951' As Year, `1951` As Value From wide
Union All
Select Code, Country, '1952' As Year, `1952` As Value From wide
Union All
Select Code, Country, '1953' As Year, `1953` As Value From wide
Union All
Select Code, Country, '1954' As Year, `1954` As Value From wide;")
Run Code Online (Sandbox Code Playgroud)
要进行查询而不输入所有内容,您可以使用以下命令:
感谢 G. Grothendieck 的实施。
ValCol <- tail(names(wide), -2)
s <- sprintf("Select Code, Country, '%s' As Year, `%s` As Value from wide", ValCol, ValCol)
mquery <- paste(s, collapse = "\n Union All\n")
cat(mquery) #just to show the query
#> Select Code, Country, '1950' As Year, `1950` As Value from wide
#> Union All
#> Select Code, Country, '1951' As Year, `1951` As Value from wide
#> Union All
#> Select Code, Country, '1952' As Year, `1952` As Value from wide
#> Union All
#> Select Code, Country, '1953' As Year, `1953` As Value from wide
#> Union All
#> Select Code, Country, '1954' As Year, `1954` As Value from wide
sqldf(mquery)
Run Code Online (Sandbox Code Playgroud)
#> Code Country Year Value
#> 1 AFG Afghanistan 1950 20,249
#> 2 ALB Albania 1950 8,097
#> 3 AFG Afghanistan 1951 21,352
#> 4 ALB Albania 1951 8,986
#> 5 AFG Afghanistan 1952 22,532
#> 6 ALB Albania 1952 10,058
#> 7 AFG Afghanistan 1953 23,557
#> 8 ALB Albania 1953 11,123
#> 9 AFG Afghanistan 1954 24,555
#> 10 ALB Albania 1954 12,246
Run Code Online (Sandbox Code Playgroud)
不幸的是,我不认为这PIVOT会UNPIVOT为R SQLite. 如果您想以更复杂的方式编写查询,您还可以查看这些帖子: