使用dplyr和扫帚计算训练和测试集上的kmeans

Question

使用dplyr和扫帚计算训练和测试集上的kmeans

我正在使用dplyr和扫帚为我的数据计算kmeans.我的数据包含X和Y坐标的测试和训练集,并按一些参数值(在本例中为lambda)分组:

mds.test = data.frame()
for(l in seq(0.1, 0.9, by=0.2)) {
  new.dist <- run.distance.model(x, y, lambda=l)
  mds <- preform.mds(new.dist, ndim=2)
  mds.test <- rbind(mds.test, cbind(mds$space, design[,c(1,3,4,5)], lambda=rep(l, nrow(mds$space)), data="test"))
}

> head(mds.test)
                        Comp1       Comp2 Transcripts Genes Timepoint Run lambda data
7A_0_AAGCCTAGCGAC -0.06690476 -0.25519106       68125  9324     Day 0  7A    0.1 test
7A_0_AAATGACTGGCC -0.15292848  0.04310200       28443  6746     Day 0  7A    0.1 test
7A_0_CATCTCGTTCTA -0.12529445  0.13022908       27360  6318     Day 0  7A    0.1 test
7A_0_ACCGGCACATTC -0.33015913  0.14647857       23038  5709     Day 0  7A    0.1 test
7A_0_TATGTCGGAATG -0.25826098  0.05424976       22414  5878     Day 0  7A    0.1 test
7A_0_GAAAAAGGTGAT -0.24349387  0.08071162       21907  6766     Day 0  7A    0.1 test

Run Code Online (Sandbox Code Playgroud)

我head上面有测试数据集,但我也有一个mds.train包含我的训练数据坐标的命名.我的最终目标是对由lambda分组的两个集合运行k-means,然后计算训练中心测试数据的within.ss,between.ss和total.ss.感谢扫帚的大量资源,我可以通过简单地执行以下操作为测试集运行每个lambda的kmeans:

test.kclusts  = mds.test %>% 
  group_by(lambda) %>% 
  do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))

Run Code Online (Sandbox Code Playgroud)

然后我可以为每个lambda中的每个簇计算这些数据的中心:

test.clusters = test.kclusts %>% 
  group_by(lambda) %>%  
  do(tidy(.$kclust[[1]]))

Run Code Online (Sandbox Code Playgroud)

这是我被困的地方.如何计算功能分配上为同样表示参考页(例如kclusts %>% group_by(k) %>% do(augment(.$kclust[[1]], points.matrix))),我的points.matrix就是mds.test它与data.frame length(unique(mds.test$lambda))倍多的行,应该是什么？有没有办法以某种方式使用训练集中心glance()根据测试任务计算统计数据？

任何帮助将不胜感激!谢谢!

编辑:更新进度.我已经想出如何聚合测试/培训任务,但仍然有问题尝试从两组计算kmeans统计数据(测试中心的培训任务和培训中心的测试任务).更新后的代码如下:

test.kclusts  = mds.test %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))
test.clusters = test.kclusts %>% group_by(lambda) %>%  do(tidy(.$kclust[[1]])) 
test.clusterings = test.kclusts %>% group_by(lambda) %>% do(glance(.$kclust[[1]]))
test.assignments = left_join(test.kclusts, mds.test) %>% group_by(lambda) %>% do(augment(.$kclust[[1]], cbind(.$Comp1, .$Comp2)))

train.kclusts  = mds.train %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))
train.clusters = train.kclusts %>% group_by(lambda) %>%  do(tidy(.$kclust[[1]])) 
train.clusterings = train.kclusts %>% group_by(lambda) %>% do(glance(.$kclust[[1]]))
train.assignments = left_join(train.kclusts, mds.train) %>% group_by(lambda) %>% do(augment(.$kclust[[1]], cbind(.$Comp1, .$Comp2)))

test.assignments$data = "test"
train.assignments$data = "train"
merge.assignments = rbind(test.assignments, train.assignments)
merge.assignments %>% filter(., data=='test') %>% group_by(lambda) ... ?

Run Code Online (Sandbox Code Playgroud)

我附上了一个图表,说明了我在这一点上的进展.重申一下,我想计算测试任务/坐标(中心看不见的图)的训练数据中心的kmeans统计数据(在平方和,平方和之间以及平方和之间):

Answer 1

Bry*_*way 3

一种方法是...

通过broom提取指定集群质心（基于训练集构建）的表。
计算测试集中每个点与使用训练集构建的每个簇质心的距离。可以通过fuzzyjoin包来做到这一点。
与测试点具有最短欧几里得距离的簇质心代表其分配的簇。
从那里您可以计算任何感兴趣的指标。

请参阅下文，使用从 tidymodels 的聚类示例中提取的更简单的数据集。

library(tidyverse)
library(rsample)
library(broom)
library(fuzzyjoin)

# data and train / test set-up
set.seed(27)
centers <- tibble(
  cluster = factor(1:3), 
  num_points = c(100, 150, 50),  # number points in each cluster
  x1 = c(5, 0, -3),              # x1 coordinate of cluster center
  x2 = c(-1, 1, -2)              # x2 coordinate of cluster center
)

labelled_points <- 
  centers %>%
  mutate(
    x1 = map2(num_points, x1, rnorm),
    x2 = map2(num_points, x2, rnorm)
  ) %>% 
  select(-num_points) %>% 
  unnest(cols = c(x1, x2))

points <- 
  labelled_points %>% 
  select(-cluster)

set.seed(1234)

split <- rsample::initial_split(points)
train <- rsample::training(split)
test <- rsample::testing(split)

# Fit kmeans on train then assign clusters to test
kclust <- kmeans(train, centers = 3)

clust_centers <- kclust %>% 
  tidy() %>% 
  select(-c(size, withinss))

test_clusts <- fuzzyjoin::distance_join(mutate(test, index = row_number()), 
                         clust_centers,
                         max_dist = Inf,
                         method = "euclidean",
                         distance_col = "dist") %>% 
  group_by(index) %>% 
  filter(dist == min(dist)) %>% 
  ungroup()
#> Joining by: c("x1", "x2")

# resulting table
test_clusts
#> # A tibble: 75 x 7
#>     x1.x    x2.x index  x1.y  x2.y cluster  dist
#>    <dbl>   <dbl> <int> <dbl> <dbl> <fct>   <dbl>
#>  1  4.24 -0.946      1  5.07 -1.10 3       0.847
#>  2  3.54  0.287      2  5.07 -1.10 3       2.06 
#>  3  3.71 -1.67       3  5.07 -1.10 3       1.47 
#>  4  5.03 -0.788      4  5.07 -1.10 3       0.317
#>  5  6.57 -2.49       5  5.07 -1.10 3       2.04 
#>  6  4.97  0.233      6  5.07 -1.10 3       1.34 
#>  7  4.43 -1.89       7  5.07 -1.10 3       1.01 
#>  8  5.34 -0.0705     8  5.07 -1.10 3       1.07 
#>  9  4.60  0.196      9  5.07 -1.10 3       1.38 
#> 10  5.68 -1.55      10  5.07 -1.10 3       0.758
#> # ... with 65 more rows

# calc within clusts SS on test
test_clusts %>% 
  group_by(cluster) %>% 
  summarise(size = n(),
            withinss = sum(dist^2),
            withinss_avg = withinss / size)
#> # A tibble: 3 x 4
#>   cluster  size withinss withinss_avg
#>   <fct>   <int>    <dbl>        <dbl>
#> 1 1          11     32.7         2.97
#> 2 2          35     78.9         2.26
#> 3 3          29     62.0         2.14

# compare to on train
tidy(kclust) %>% 
  mutate(withinss_avg = withinss / size)
#> # A tibble: 3 x 6
#>        x1    x2  size withinss cluster withinss_avg
#>     <dbl> <dbl> <int>    <dbl> <fct>          <dbl>
#> 1 -3.22   -1.91    40     76.8 1               1.92
#> 2  0.0993  1.06   113    220.  2               1.95
#> 3  5.07   -1.10    72    182.  3               2.53

# plot of test and train points
test_clusts %>% 
  select(x1 = x1.x, x2 = x2.x, cluster) %>% 
  mutate(type = "test") %>% 
  bind_rows(
    augment(kclust, train) %>% 
      mutate(type = "train") %>% 
      rename(cluster = .cluster)
    ) %>% 
  ggplot(aes(x = x1, 
             y = x2, 
             color = as.factor(cluster)))+
  geom_point()+
  facet_wrap(~fct_rev(as.factor(type)))+
  coord_fixed()+
  labs(title = "Cluster Assignment on Training and Holdout Datasets",
       color = "Cluster")+
  theme_bw()

Run Code Online (Sandbox Code Playgroud)

^{由reprex 包(v2.0.0)于 2021-08-19 创建}

（请参阅 OP 上的评论，获取有关在 tidymodels 中简化此操作的对话链接。）

归档时间：	9 年，3 月前
查看次数：	522 次
最近记录：	7 年，11 月前