使用多组度量列将数据帧重新整形为长格式

use*_*854 3 r reshape dataframe reshape2

我有一个R数据帧,我readHTMLTable()XML包中使用从互联网上删除.该表看起来像以下摘录,包含人口和年份的多个变量/列.(请注意,年份不会跨列重复,并表示填充的唯一标识符.)

        year1   pop1      year2   pop2     year3   pop3     
1                                                        
2       16XX    4675,0    1900    6453,0    1930   9981,2       
3       17XX    4739,3    1901    6553,5    1931   ...      
4       17XX    4834,0    1902    6684,0    1932   
5       180X    4930,0    1903    6818,0    1933        
6       180X    5029,0    1904    6955,0    1934        
7       181X    5129,0    1905    7094,0    1935
8       181X    5231,9    1906    7234,7    1936
9       182X    5297,0    1907    7329,0    1937
10      182X    5362,0    1908    7422,0    1938
Run Code Online (Sandbox Code Playgroud)

我想将数据重新组织成两列,一列是一年,另一列是人口,如下所示:

        year    pop     
1                                                        
2       16XX    4675,0
3       17XX    4739,3  
4       17XX    4834,0  
5       180X    4930,0
6       180X    5029,0  
7       181X    5129,0
8       181X    5231,9  
9       182X    5297,0
10      182X    5362,0  
11      1900    6453,0
12      1901    6553,5
13      1902    6684,0
...     ...     ...
21      1930    9981,2
22      ... 
Run Code Online (Sandbox Code Playgroud)

从变量/列中的值year2year3在下面所附的year1,因为是相应人口值.

我考虑过以下几点:

(1)循环人口和年份列(n>2)并将这些值作为新观察值添加到year1和population1将起作用,但这似乎不必要地繁琐.

(2)我尝试过熔化如下,但要么它不能处理跨多列的id变量,要么我没有正确实现它.

df.melt <- melt(df, id=c("year1", "year2",...)
Run Code Online (Sandbox Code Playgroud)

(3)最后,我考虑将每年列作为自己的向量,并将每个向量附加到这里:

year.all <- c(df$year1, df$year2,...)
Run Code Online (Sandbox Code Playgroud)

但是,以上为year.all返回以下内容

[1]  1  2  3  3  4  4  5  5  6  6  7  8  8  9  9  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  1  1  2 ...
Run Code Online (Sandbox Code Playgroud)

而不是这个

[1] 16XX 17XX 17XX 180X 180X 181X 181X 182X 182X 1900 1901 1902...
Run Code Online (Sandbox Code Playgroud)

如果有一种直接的方式来完成这种重组,我很乐意学习它.非常感谢您的帮助.

Aru*_*run 7

使用新的功能,meltdata.table v1.9.5+:

require(data.table) # v1.9.5+
melt(setDT(df), measure = patterns("^year", "^pop"), value.name = c("year", "pop"))
Run Code Online (Sandbox Code Playgroud)

你可以在这里找到其余的小插曲.


akr*_*run 6

如果'year','pop',列是交替的,我们可以通过子集c(TRUE, FALSE)来获取列1,3,5,...等.并c(FALSE, TRUE)拿到2,4,6,...由于回收.然后,我们unlist列和创建一个新的'data.frame.

 df2 <- data.frame(year=unlist(df1[c(TRUE, FALSE)]), 
                  pop=unlist(df1[c(FALSE, TRUE)]))
 row.names(df2) <- NULL
 head(df2)
 #   year    pop
 #1            
 #2 16XX 4675,0
 #3 17XX 4739,3
 #4 17XX 4834,0
 #5 180X 4930,0
 #6 180X 5029,0
Run Code Online (Sandbox Code Playgroud)

或者另一个选择是

library(splitstackshape)
merged.stack(transform(df1, id=1:nrow(df1)), var.stubs=c('year', 'pop'), 
        sep='var.stubs')[order(.time_1), 3:4, with=FALSE]
Run Code Online (Sandbox Code Playgroud)

数据

df1 <- structure(list(year1 = c("", "16XX", "17XX", "17XX", "180X", 
"180X", "181X", "181X", "182X", "182X"), pop1 = c("", "4675,0", 
"4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", 
"5362,0"), year2 = c(NA, 1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 
1906L, 1907L, 1908L), pop2 = c("", "6453,0", "6553,5", "6684,0", 
"6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"), 
year3 = c(NA, 1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 
1937L, 1938L), pop3 = c("", "9981,2", "", "", "", "", "", 
"", "", "")), .Names = c("year1", "pop1", "year2", "pop2", 
"year3", "pop3"), class = "data.frame", row.names = c(NA, -10L))
Run Code Online (Sandbox Code Playgroud)