如何在R中以NEWICK格式附加群集（树）节点的自举值

Question

如何在R中以NEWICK格式附加群集（树）节点的自举值

New*_*e_R 4 tree r cluster-analysis dendrogram pvclust

我想使用交互式生命树基于Web的工具（iTOL）来制作树（集群）。作为输入文件（或字符串），此工具使用Newick格式，这是一种使用括号和逗号来表示具有边长的图论树的方法。除此之外，还可能支持其他信息，例如群集节点的自举值。

例如，在这里我使用包创建了用于聚类分析的数据集clusterGeneration：

library(clusterGeneration)
set.seed(1)    
tmp1 <- genRandomClust(numClust=3, sepVal=0.3, numNonNoisy=5,
        numNoisy=3, numOutlier=5, numReplicate=2, fileName="chk1")
data <- tmp1$datList[[2]]

Run Code Online (Sandbox Code Playgroud)

之后，我使用程序包通过引导进行了群集分析并评估了对群集节点的支持pvclust：

set.seed(2)    
y <- pvclust(data=data,method.hclust="average",method.dist="correlation",nboot=100)
plot(y)

Run Code Online (Sandbox Code Playgroud)

这是集群值和引导值：集群值和自举值

为了制作Newick文件，我使用了ape软件包：

library(ape)
yy<-as.phylo(y$hclust)
write.tree(yy,digits=2)

Run Code Online (Sandbox Code Playgroud)

write.tree 函数将以Newick格式打印树：

（（x2：0.45，x6：0.45）：0.043，（（x7：0.26，（x4：0.14，（x1：0.14，x3：0.14）：0.0064）：0.12）：0.22，（x5：0.28，x8：0.28 ）：0.2）：0.011）;

这些数字表示分支长度（群集的边缘长度）。按照iTOL帮助页面上的说明（“上传和使用您自己的树”部分），我将自举值手动添加到了Newick文件中（下面的附加值）：

（（x2：0.45，x6：0.45）74：0.043，（（（x7：0.26，（x4：0.14，（x1：0.14，x3：0.14）55：0.0064）68：0.12）100：0.22，（x5：0.28），x8：0.28）100：0.2）63：0.011）;

当我将字符串上传到iTOL时，效果很好。但是，我有一个庞大的集群，手工完成似乎很乏味...

问题：可以执行而不是手动键入的代码是什么？

引导值可以通过以下方式获得：

(round(y$edges,2)*100)[,1:2]

Run Code Online (Sandbox Code Playgroud)

可以通过以下方式获得用于形成Newick文件的分支长度：

yy$edge.length

Run Code Online (Sandbox Code Playgroud)

我试图弄清楚write.tree功能在调试后如何工作。但是，我注意到它在内部调用函数，.write.tree2而且我不明白如何有效地更改原始代码并在Newick文件中的适当位置获取自举值。

任何建议都欢迎。

Answer 1

pla*_*pus 5

这是为您提供的一种解决方案：类的对象phylo具有一个可用的插槽，称为node.label该插槽，可以适当地为您提供节点的标签。您可以使用它来存储引导值。如您在以下代码中所见，将在适当位置的Newick文件中写入.write.tree2：

> .write.tree2
function (phy, digits = 10, tree.prefix = "") 
{
    brl <- !is.null(phy$edge.length)
    nodelab <- !is.null(phy$node.label)

...

    if (is.null(phy$root.edge)) {
        cp(")")
        if (nodelab) 
            cp(phy$node.label[1])
        cp(";")
    }
    else {
        cp(")")
        if (nodelab) 
            cp(phy$node.label[1])
        cp(":")
        cp(sprintf(f.d, phy$root.edge))
        cp(";")
    }

...

Run Code Online (Sandbox Code Playgroud)

真正的困难是找到节点的正确顺序。我进行了搜索，但找不到找到后继正确顺序的方法 ....这意味着我们必须在从类hclust的对象到类的对象的转换中获取这些信息phylo。

幸运的是，如果您研究函数as.phylo.hclust，则有一个向量，其中包含与上一个hclust对象相对应的正确顺序的节点索引：

> as.phylo.hclust
function (x, ...) 
{
    N <- dim(x$merge)[1]
    edge <- matrix(0L, 2 * N, 2)
    edge.length <- numeric(2 * N)
    node <- integer(N)              #<-This one
...

Run Code Online (Sandbox Code Playgroud)

这意味着只要参数与对象中节点的顺序相同，我们就可以使用自己as.phylo.hclust的nodenames参数hclust（在您的示例中就是这种情况，因为在pvclust内部保持一致的顺序，即，hclust中节点的顺序为与您选择引导程序的表相同）：

# NB: in the following function definition I only modified the commented lines
as.phylo.hclust.with.nodenames <- function (x, nodenames, ...) #We add a nodenames argument
{
    N <- dim(x$merge)[1]
    edge <- matrix(0L, 2 * N, 2)
    edge.length <- numeric(2 * N)
    node <- integer(N)
    node[N] <- N + 2L
    cur.nod <- N + 3L
    j <- 1L
    for (i in N:1) {
        edge[j:(j + 1), 1] <- node[i]
        for (l in 1:2) {
            k <- j + l - 1L
            y <- x$merge[i, l]
            if (y > 0) {
                edge[k, 2] <- node[y] <- cur.nod
                cur.nod <- cur.nod + 1L
                edge.length[k] <- x$height[i] - x$height[y]
            }
            else {
                edge[k, 2] <- -y
                edge.length[k] <- x$height[i]
            }
        }
        j <- j + 2L
    }
    if (is.null(x$labels)) 
        x$labels <- as.character(1:(N + 1))
    node.lab <- nodenames[order(node)] #Here we define our node labels
    obj <- list(edge = edge, edge.length = edge.length/2, tip.label = x$labels, 
        Nnode = N, node.label = node.lab) #And you put them in the final object
    class(obj) <- "phylo"
    reorder(obj)
}

Run Code Online (Sandbox Code Playgroud)

最后，这是在案例研究中如何使用此新功能的方法：

bootstraps <- (round(y$edges,2)*100)[,1:2]
yy<-as.phylo.hclust.with.nodenames(y$hclust, nodenames=bootstraps[,2])
write.tree(yy,tree.names=TRUE,digits=2)
[1] "((x5:0.27,x8:0.27)100:0.24,((x7:0.25,(x4:0.14,(x1:0.13,x3:0.13)61:0.014)99:0.11)100:0.23,(x2:0.46,x6:0.46)56:0.022)61:0.027)100;"
#See the bootstraps    ^^^ here for instance
plot(yy,show.node.label=TRUE) #To show that the order is correct
plot(y) #To compare with (here I used the yellow value)

Run Code Online (Sandbox Code Playgroud)

在此处输入图片说明

归档时间：	12 年，1 月前
查看次数：	2329 次
最近记录：	11 年，10 月前