use*_*854 3 r reshape dataframe reshape2
我有一个R数据帧,我readHTMLTable()在XML包中使用从互联网上删除.该表看起来像以下摘录,包含人口和年份的多个变量/列.(请注意,年份不会跨列重复,并表示填充的唯一标识符.)
year1 pop1 year2 pop2 year3 pop3
1
2 16XX 4675,0 1900 6453,0 1930 9981,2
3 17XX 4739,3 1901 6553,5 1931 ...
4 17XX 4834,0 1902 6684,0 1932
5 180X 4930,0 1903 6818,0 1933
6 180X 5029,0 1904 6955,0 1934
7 181X 5129,0 1905 7094,0 1935
8 181X 5231,9 1906 7234,7 1936
9 182X 5297,0 1907 7329,0 1937
10 182X 5362,0 1908 7422,0 1938
Run Code Online (Sandbox Code Playgroud)
我想将数据重新组织成两列,一列是一年,另一列是人口,如下所示:
year pop
1
2 16XX 4675,0
3 17XX 4739,3
4 17XX 4834,0
5 180X 4930,0
6 180X 5029,0
7 181X 5129,0
8 181X 5231,9
9 182X 5297,0
10 182X 5362,0
11 1900 6453,0
12 1901 6553,5
13 1902 6684,0
... ... ...
21 1930 9981,2
22 ...
Run Code Online (Sandbox Code Playgroud)
从变量/列中的值year2和year3在下面所附的year1,因为是相应人口值.
我考虑过以下几点:
(1)循环人口和年份列(n>2)并将这些值作为新观察值添加到year1和population1将起作用,但这似乎不必要地繁琐.
(2)我尝试过熔化如下,但要么它不能处理跨多列的id变量,要么我没有正确实现它.
df.melt <- melt(df, id=c("year1", "year2",...)
Run Code Online (Sandbox Code Playgroud)
(3)最后,我考虑将每年列作为自己的向量,并将每个向量附加到这里:
year.all <- c(df$year1, df$year2,...)
Run Code Online (Sandbox Code Playgroud)
但是,以上为year.all返回以下内容
[1] 1 2 3 3 4 4 5 5 6 6 7 8 8 9 9 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 1 2 ...
Run Code Online (Sandbox Code Playgroud)
而不是这个
[1] 16XX 17XX 17XX 180X 180X 181X 181X 182X 182X 1900 1901 1902...
Run Code Online (Sandbox Code Playgroud)
如果有一种直接的方式来完成这种重组,我很乐意学习它.非常感谢您的帮助.
使用新的功能,在melt从data.table v1.9.5+:
require(data.table) # v1.9.5+
melt(setDT(df), measure = patterns("^year", "^pop"), value.name = c("year", "pop"))
Run Code Online (Sandbox Code Playgroud)
你可以在这里找到其余的小插曲.
如果'year','pop',列是交替的,我们可以通过子集c(TRUE, FALSE)来获取列1,3,5,...等.并c(FALSE, TRUE)拿到2,4,6,...由于回收.然后,我们unlist列和创建一个新的'data.frame.
df2 <- data.frame(year=unlist(df1[c(TRUE, FALSE)]),
pop=unlist(df1[c(FALSE, TRUE)]))
row.names(df2) <- NULL
head(df2)
# year pop
#1
#2 16XX 4675,0
#3 17XX 4739,3
#4 17XX 4834,0
#5 180X 4930,0
#6 180X 5029,0
Run Code Online (Sandbox Code Playgroud)
或者另一个选择是
library(splitstackshape)
merged.stack(transform(df1, id=1:nrow(df1)), var.stubs=c('year', 'pop'),
sep='var.stubs')[order(.time_1), 3:4, with=FALSE]
Run Code Online (Sandbox Code Playgroud)
df1 <- structure(list(year1 = c("", "16XX", "17XX", "17XX", "180X",
"180X", "181X", "181X", "182X", "182X"), pop1 = c("", "4675,0",
"4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0",
"5362,0"), year2 = c(NA, 1900L, 1901L, 1902L, 1903L, 1904L, 1905L,
1906L, 1907L, 1908L), pop2 = c("", "6453,0", "6553,5", "6684,0",
"6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"),
year3 = c(NA, 1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L,
1937L, 1938L), pop3 = c("", "9981,2", "", "", "", "", "",
"", "", "")), .Names = c("year1", "pop1", "year2", "pop2",
"year3", "pop3"), class = "data.frame", row.names = c(NA, -10L))
Run Code Online (Sandbox Code Playgroud)