如何对大型数据库进行抽样并在R中实现K-means和K-nn?

eri*_*hfw 13 r machine-learning large-data knn k-means

我是R的新用户,试图摆脱SAS.我在这里问这个问题,因为我对R的所有软件包和源代码感到有点沮丧,我似乎无法让这个工作主要是由于数据大小.

我有以下内容:

在本地MySQL数据库中名为SOURCE的表,具有200个预测器功能和一个类变量.该表有300万条记录,大小为3GB.每个类的实例数不相等.

我想要:

  1. 随机对SOURCE数据库进行采样,以创建一个较小的数据集,每个类具有相同数量的实例.
  2. 将样本分为训练和测试集.
  3. 预制k-means聚类在训练集上以确定每个类的k个质心.
  4. 使用质心对测试数据进行k-NN分类.

Fra*_*Nut 0

我可以帮你解答两个问题。1-分层抽样2-分割训练和测试(即校准验证)

        n = c(2.23, 3.5, 12,2, 93, 57, 0.2,
 33, 5,2, 305, 5.3,2, 3.9, 4) 
     s = c("aa", "bb", "aa","aa", "bb", "cc","aa", "bb",
 "bb","aa", "aa","aa","aa","bb", "cc") 
         id = c(1, 2, 3,4, 5, 6,7, 8, 9,
10, 11, 12,13, 14, 15) 
         df = data.frame(id, n, s )       # df is a data frame

        source("http://news.mrdwab.com/stratified")
        sample<- stratified(df=df, 
                            id=1, #ID of your dataframe, 
                            #if there isn't you have to create it
                            group=3, #the position of your predictor features
                            size=2, #cardinality of selection
                            seed="NULL") 

        #then add a new column to your selection 
        sample["cal_val"]<- 1

        #now, you have a random selection of group 3, 
        #but you need to split it for cal and val, so:

        sample2<- stratified(df=sample, #use your previous selection
                             id=1, 
                             group=3, #sample on the same group used previously
                             size=1,#half of the previous selection
                             seed="NULL")

        sample2["val"]<- 1
        #merge the two selection
        merge<- merge(sample, sample2, all.x=T, by="id")
        merge[is.na(merge)] <- 0 #delete NA from merge
    #create a column where 1 is for calibration and 2 for validation    
    merge["calVal"]<- merge$cal_val.x + merge$cal_val.y 
#now "clean" you dataframe, because you have too many useless columns       
 id<- merge$id  
        n<- merge$n.x 
        s<- merge$s.x
        calval<- merge$calVal
        final_sample<- data.frame(id, n, s, calval)
Run Code Online (Sandbox Code Playgroud)