基于多类观察对r中的数据进行分区

Dan*_*nny 15 random partitioning r

我正在尝试对R中的数据集进行分区,2/3用于训练,1/3用于测试.我有一个分类变量和七个数值变量.每个观察被分类为A,B,C或D.

为简单起见,假设分类变量cl对于前100次观测是A,对于观察101到200是C,C到300,D到400.我正在尝试获得具有2/3的分区对于A,B,C和D中的每一个的观察结果(而不是简单地获得整个数据集的2/3的观察结果,因为它可能没有相同数量的每个分类).

当我尝试从数据的子集中进行采样时,例如sample(subset(data, cl=='A')),列被重新排序而不是行.

总而言之,我的目标是从A,B,C和D中的每一个随机观察67个作为我的训练数据,并将A,B,C和D中的每一个的剩余33个观测值存储为测试数据.我发现了一个与我非常相似的问题,但它没有考虑到多个变量.

Ste*_*son 17

实际上有一个很好的包插入处理机器学习问题,它包含一个函数createDataPartition(),它几乎从提供的因子的每个级别抽取2/3rds:

#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]
Run Code Online (Sandbox Code Playgroud)


Ant*_*ico 5

这可能会更长,但我认为它更直观,可以在基地R中完成;)

# create the data frame you've described
x <-
    data.frame(
        cl = 
            c( 
                rep( 'A' , 100 ) ,
                rep( 'B' , 100 ) ,
                rep( 'C' , 100 ) ,
                rep( 'D' , 100 ) 
            ) ,

        othernum1 = rnorm( 400 ) ,
        othernum2 = rnorm( 400 ) ,
        othernum3 = rnorm( 400 ) ,
        othernum4 = rnorm( 400 ) ,
        othernum5 = rnorm( 400 ) ,
        othernum6 = rnorm( 400 ) ,
        othernum7 = rnorm( 400 ) 
    )

# sample 67 training rows within classification groups
training.rows <-
    tapply( 
        # numeric vector containing the numbers
        # 1 to nrow( x )
        1:nrow( x ) , 

        # break the sample function out by
        # the classification variable
        x$cl , 

        # use the sample function within
        # each classification variable group
        sample , 

        # send the size = 67 parameter
        # through to the sample() function
        size = 67 
    )

# convert your list back to a numeric vector
tr <- unlist( training.rows )

# split your original data frame into two:

# all the records sampled as training rows
training.df <- x[ tr , ]

# all other records (NOT sampled as training rows)
testing.df <- x[ -tr , ]
Run Code Online (Sandbox Code Playgroud)