此问题与具有相似标题的帖子相关(将R矢量中的NA替换为相邻值).我想扫描数据框中的列,并将NA替换为相邻单元格中的值.在上述帖子中,解决方案是将NA替换为不是来自相邻矢量的值(例如,数据矩阵中的相邻元素),而是固定值的条件替换.以下是我的问题的可重现的例子:
UNIT <- c(NA,NA, 200, 200, 200, 200, 200, 300, 300, 300,300)
STATUS <-c('ACTIVE','INACTIVE','ACTIVE','ACTIVE','INACTIVE','ACTIVE','INACTIVE','ACTIVE','ACTIVE',
'ACTIVE','INACTIVE')
TERMINATED <- c('1999-07-06' , '2008-12-05' , '2000-08-18' , '2000-08-18' ,'2000-08-18' ,'2008-08-18',
'2008-08-18','2006-09-19','2006-09-19' ,'2006-09-19' ,'1999-03-15')
START <- c('2007-04-23','2008-12-06','2004-06-01','2007-02-01','2008-04-19','2010-11-29','2010-12-30',
'2007-10-29','2008-02-05','2008-06-30','2009-02-07')
STOP <- c('2008-12-05','4712-12-31','2007-01-31','2008-04-18','2010-11-28','2010-12-29','4712-12-31',
'2008-02-04','2008-06-29','2009-02-06','4712-12-31')
Run Code Online (Sandbox Code Playgroud)
TEST < - data.frame(UNIT,STATUS,TERMINATED,START,STOP)测试
UNIT STATUS TERMINATED START STOP
1 NA ACTIVE 1999-07-06 2007-04-23 2008-12-05
2 NA INACTIVE 2008-12-05 2008-12-06 4712-12-31
3 200 ACTIVE 2000-08-18 2004-06-01 2007-01-31
4 200 ACTIVE 2000-08-18 2007-02-01 2008-04-18
5 200 INACTIVE 2000-08-18 2008-04-19 2010-11-28
6 200 …
Run Code Online (Sandbox Code Playgroud) 假设有许多数据帧需要对它们执行相同的操作.例如:
prefix <- c("Mrs.","Mrs.","Mr","Dr.","Mrs.","Mr.","Mrs.","Ms","Ms","Mr")
measure <- rnorm(10)
df1 <- data.frame(prefix,measure)
df1$gender[df1$prefix=="Mrs."] <- "F"
Run Code Online (Sandbox Code Playgroud)
当相邻行中的值为"Mrs."时,将创建一个名为gender的指示符变量.在R中循环字符串变量的一般方法是从这里改编而来,as.name()
添加了从"i"中删除引号的函数:
dflist <- c("df1","df2","df3","df4","df5")
for (i in dflist) {
as.name(i)$gender[as.name(i)$prefix=="Ms."] <- "F"
}
Run Code Online (Sandbox Code Playgroud)
不幸的是,这不起作用.有什么建议?
假设多年来动物园里每天都有动物活动的时间序列.非常大的数据集的子集可能如下所示:
library(data.table)
type <- c(rep('giraffe',90),rep('monkey',90),rep('anteater',90))
status <- as.factor(c(rep('display',31),rep('caged',28),rep('display',31),
rep('caged',25), rep('display',35),rep('caged',30),rep('caged',10),
rep('display',10),rep('caged',10),rep('display',60)))
date <- rep(seq.Date( as.Date("2001-01-01"), as.Date("2001-03-31"), "day" ),3)
Run Code Online (Sandbox Code Playgroud)
"类型"是动物类型,"状态"是动物当天所做事情的指示,例如,笼养或展示.
animals <- data.table(type,status,date);animals
type status date
1: giraffe display 2001-01-01
2: giraffe display 2001-01-02
3: giraffe display 2001-01-03
4: giraffe display 2001-01-04
5: giraffe display 2001-01-05
---
266: anteater display 2001-03-27
267: anteater display 2001-03-28
268: anteater display 2001-03-29
269: anteater display 2001-03-30
270: anteater display 2001-03-31
Run Code Online (Sandbox Code Playgroud)
假设我们想要将其汇总到月度系列中,该系列列出了动物的整个月状态信息.在新系列中,"状态"反映了该月初动物的状态."fullmonth"是一个二进制变量(1 = TRUE,0 = FALSE),表示此状态是否持续整个月,"anydisp"是否为二进制变量(1 = TRUE,0 = FALSE),表示动物是否开启在一个月内的任何时间显示(> = 1天).因此,因为长颈鹿在1月和3月的整个月展出,但在2月份被关在笼子里,因此得到了相应的标记.
date <- rep(seq.Date( …
Run Code Online (Sandbox Code Playgroud) 我正在尝试订购一个州的矢量.我明白这应该很简单,但我无法解决.我查看过使用vapply(...)提出复杂解决方案的其他帖子,但这似乎没必要.我有以下内容:
state.vec = c("AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "CD", "FL", "AG", "HI", "DI",
"IL", "IN", "AI", "KS", "KY", "AL", "EM", "DM", "AM", "IM", "MN", "MS", "MO", "MT", "EN", "NV",
"HN", "JN", "MN", "NY", "CN", "DN", "HO", "KO", "OR", "AP", "PR", "IR", "CS", "DS", "NT", "TX",
"TU", "TV", "IV", "AV", "AW","VW", "IW", "WY", "GU")
Run Code Online (Sandbox Code Playgroud)
不幸的是,order将值转换为整数顺序:
order(state.vec)
2 1 4 3 5 6 7 9 8 10 11 54 12 16 13 14 15 17 18 19 22 21 …
Run Code Online (Sandbox Code Playgroud) 我正在研究覆盖几个密度图以在ggplot2中创建一个图形.假设我在一系列年中每个12月都有数据点(在这种情况下是2004-2012),我想绘制每个12月月+年的密度函数并叠加它们.我想强调一个特定月份的密度线,使其虚线,其他所有密度线都是实心的.我有一个可重复的例子如下:
#Create vector of data for December
set.seed(12345)
dec_emas = as.matrix(rnorm(496, 122, 250))
#create indicators for Dec04 ... Dec11, then attach to data frame w/ estimates
declab = c('Dec04', 'Dec05', 'Dec06', 'Dec07', 'Dec08', 'Dec09', 'Dec10', 'Dec11')
declabs = rep(declab, 62)
rownames(dec_emas) = declabs
colnames(dec_emas) = 'EMA'
#add in factor ID for the 8 levels
dec04 = as.numeric(rownames(dec_emas) == 'Dec04')
dec05 = as.numeric(rownames(dec_emas) == 'Dec05')
dec06 = as.numeric(rownames(dec_emas) == 'Dec06')
dec07 = as.numeric(rownames(dec_emas) == 'Dec07')
dec08 = as.numeric(rownames(dec_emas) == 'Dec08')
dec09 …
Run Code Online (Sandbox Code Playgroud) 我正在研究一个有n个协变量的大型数据集.许多行都是重复的.为了识别重复项,我需要使用协变量的子集来创建标识变量.也就是说,(nx)协变量是无关紧要的.我想连接x协变量上的值,以唯一地识别观察结果并消除重复.
set.seed(1234)
UNIT <- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,6)
DATE <- c("1/1/2010","1/1/2010","1/1/2010","1/2/2012","1/2/2009","1/2/2004","1/2/2005","1/2/2005",
"1/1/2011","1/1/2011","1/1/2011","1/1/2009","1/1/2008","1/1/2008","1/1/2012","1/1/2013",
"1/1/2012")
OUT1 <- c(300,400,400,400,600,700,700,800,800,800,900,700,700,100,100,100,500)
JUNK1 <- c(rnorm(17,0,1))
JUNK2 <- c(rnorm(17,0,1))
test = data.frame(UNIT,DATE,OUT1,JUNK1,JUNK2)
Run Code Online (Sandbox Code Playgroud)
'test'是一个示例数据框.我需要用来唯一识别观察结果的变量是'UNIT','DATE'和'OUT1'.例如,
head(test)
UNIT DATE OUT1 JUNK1 JUNK2
1 1 1/1/2010 300 -1.2070657 -0.9111954
2 1 1/1/2010 400 0.2774292 -0.8371717
3 1 1/1/2010 400 1.0844412 2.4158352
4 1 1/2/2012 400 -2.3456977 0.1340882
5 2 1/2/2009 600 0.4291247 -0.4906859
6 2 1/2/2004 700 0.5060559 -0.4405479
Run Code Online (Sandbox Code Playgroud)
观察1和4在数据集中不重复.观察2和3是重复的.我想要创建的新数据集将保留观察1和4,并且只保留2和3中的一个.我尝试的解决方案是:
subset(test, !duplicated(c(UNIT,DATE,OUT1)))
Run Code Online (Sandbox Code Playgroud)
遗憾的是,这并不能解决问题:
UNIT DATE OUT1 JUNK1 JUNK2
1 1 …
Run Code Online (Sandbox Code Playgroud) 假设我有兴趣连接两个变量.我从这样的数据集开始:
#what I have
A <- rep(paste("125"),50)
B <- rep(paste("48593"),50)
C <- rep(paste("99"),50)
D <- rep(paste("1233"),50)
one <- append(A,C)
two <- append(B,D)
have <- data.frame(one,two); head(have)
one two
1 125 48593
2 125 48593
3 125 48593
4 125 48593
5 125 48593
6 125 48593
Run Code Online (Sandbox Code Playgroud)
一个简单的粘贴命令可以解决这个问题:
#half way there
half <- paste(one,two,sep="-");head(half)
[1] "125-48593" "125-48593" "125-48593" "125-48593" "125-48593" "125-48593"
Run Code Online (Sandbox Code Playgroud)
但我实际上想要一个看起来像这样的数据集:
#what I desire
E <- rep(paste("00125"),50)
F <- rep(paste("0048593"),50)
G <- rep(paste("00099"),50)
H <- rep(paste("0001233"),50)
three <- append(E,G)
four …
Run Code Online (Sandbox Code Playgroud) 假设我有一个单位数据集,可以将活动状态从活动状态更改为非活动状态.每次设备更改活动时,我想记录从活动切换到非活动状态.一个可重复的例子:
UNIT <- c(100,100, 200, 200, 200, 200, 200, 300, 300, 300,300)
STATUS <- c('ACTIVE','INACTIVE','ACTIVE','ACTIVE','INACTIVE','ACTIVE','INACTIVE','ACTIVE','ACTIVE',
'ACTIVE','INACTIVE')
TERMINATED <- c('1999-07-06' , '2008-12-05' , '2000-08-18' , '2000-08-18' ,'2000-08-18' ,'2008-08-18',
'2008-08-18','2006-09-19','2006-09-19' ,'2006-09-19' ,'1999-03-15')
START <- c('2007-04-23','2008-12-06','2004-06-01','2007-02-01','2008-04-19','2010-11-29','2010-12-30',
'2007-10-29','2008-02-05','2008-06-30','2009-02-07')
STOP <- c('2008-12-05','4712-12-31','2007-01-31','2008-04-18','2010-11-28','2010-12-29','4712-12-31',
'2008-02-04','2008-06-29','2009-02-06','4712-12-31')
DAT <- data.frame(UNIT,STATUS,TERMINATED,START,STOP)
DAT
UNIT STATUS TERMINATED START STOP
1 100 ACTIVE 1999-07-06 2007-04-23 2008-12-05
2 100 INACTIVE 2008-12-05 2008-12-06 4712-12-31
3 200 ACTIVE 2000-08-18 2004-06-01 2007-01-31
4 200 ACTIVE 2000-08-18 2007-02-01 2008-04-18
5 200 INACTIVE 2000-08-18 2008-04-19 2010-11-28
6 200 ACTIVE …
Run Code Online (Sandbox Code Playgroud) 很多帖子(比如这个)都声称这个ff
软件包优于,bigmemory
因为它可以处理具有原子和非原子组件的对象,但是如何?例如:
UNIT <- c(100,100, 200, 200, 200, 200, 200, 300, 300, 300,300)
STATUS <- c('ACTIVE','INACTIVE','ACTIVE','ACTIVE','INACTIVE','ACTIVE','INACTIVE','ACTIVE',
'ACTIVE','ACTIVE','INACTIVE')
TERMINATED <- as.Date(c('1999-07-06','2008-12-05','2000-08-18','2000-08-18','2000-08-18',
'2008-08-18','2008-08-18','2006-09-19','2006-09-19','2006-09-19',
'1999-03-15'))
START <- as.Date(c('2007-04-23','2008-12-06','2004-06-01','2007-02-01','2008-04-19',
'2010-11-29','2010-12-30','2007-10-29','2008-02-05','2008-06-30',
'2009-02-07'))
STOP <- as.Date(c('2008-12-05','2012-12-31','2007-01-31','2008-04-18','2010-11-28',
'2010-12-29','2012-12-31','2008-02-04','2008-06-29','2009-02-06',
'2012-12-31'))
TEST <- data.frame(UNIT,STATUS,TERMINATED,START,STOP)
TEST
#install.packages('ff')
library('ff')
TEST2 <- ffdf(TEST)
Error in ffdf(TEST) : ffdf components must be atomic ff objects
Run Code Online (Sandbox Code Playgroud)
我能做些什么来完成这项工作?
r ×9
bigdata ×2
replace ×2
character ×1
data.table ×1
dataframe ×1
duplicates ×1
dynamic ×1
for-loop ×1
ggplot2 ×1
missing-data ×1
na ×1
overlay ×1
sqldf ×1
zoo ×1