为什么DFS在一棵树中较慢而在另一棵树中较快？

Question

为什么DFS在一棵树中较慢而在另一棵树中较快？

jsg*_*guy 30 c++ algorithm tree performance caching

更新:原来在解析器中有一个生成树的错误.更多在最终编辑.

我们T是一个二叉树,使得每一个内部节点正好有两个孩子.对于这棵树,我们要的代码,为每个节点的功能v中T发现了被定义子树的节点数目v.

例

输入

期望的输出

用红色表示我们想要计算的数字.树的节点将存储在一个数组中,让我们TreeArray按照预先排序布局调用它.

对于上面的示例,TreeArray将包含以下对象:

10, 11, 0, 12, 13, 2, 7, 3, 14, 1, 15, 16, 4, 8, 17, 18, 5, 9, 6

树的节点由以下结构描述:

struct tree_node{

    long long int id; //id of the node, randomly generated
    int numChildren; //number of children, it is 2 but for the leafs it's 0
    int size; //size of the subtree rooted at the current node,
    // what we want to compute

    int pos; //position in TreeArray where the node is stored
    int lpos; //position of the left child
    int rpos; //position of the right child

    tree_node(){
        id = -1;
        size = 1;
        pos = lpos = rpos = -1;
        numChildren = 0;
    }

};

Run Code Online (Sandbox Code Playgroud)

计算所有size值的函数如下:

void testCache(int cur){

    if(treeArray[cur].numChildren == 0){
        treeArray[cur].size = 1;
        return;
    }

    testCache(treeArray[cur].lpos);
    testCache(treeArray[cur].rpos);

    treeArray[cur].size = treeArray[treeArray[cur].lpos].size + 
    treeArray[treeArray[cur].rpos].size + 1;

}

Run Code Online (Sandbox Code Playgroud)

I would like to understand why this function is faster when T looks like this (almost like a left going chain):

and slower when T looks like this (almost like a right going chain):

The following experiments were run on Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB.

Each dot in the graphs is the result of the following for loop (the parameters are defined by the axis):

for (int i = 0; i < 100; i++) {
        testCache(0);
}

Run Code Online (Sandbox Code Playgroud)

n corresponds to the total number of nodes, and time is measured in seconds. As we can see it is clear that as n grows the function is much faster when the tree looks like a left going chain, even though the number of nodes is exactly the same in both cases.

现在让我们试着找出瓶颈所在.我使用PAPI库来计算有趣的硬件计数器.

第一个计数器是指令,我们实际花了多少指令？当树木看起来不同时有区别吗？

差异不大.看起来对于大输入,左向链需要较少的指令,但差异很小,所以我认为可以安全地假设它们都需要相同数量的指令.

看到我们已经将树存储在一个漂亮的预订单布局里面treeArray,看看缓存中发生了什么是有意义的.不幸的是,对于L1缓存我的计算机不提供任何计数器,但我有L2和L3.

Let's look at the accesses to L2 cache. The accesses to L2 cache happen when we get a miss in L1 cache, so that is an indirect counter for L1 misses as well.

As we can see the right going tree requires fewer L1 misses, so it seems that it uses the cache efficiently.

Same for L2 misses, the right going tree seems to be more efficient. Still nothing to indicate why the right going trees are so slower. Let's look at L3.

In L3 things explode for the right going trees. So the problem seems to be in L3 cache. Unfortunately I could not explain the reason behind this behavior. Why do things get messed up in L3 cache for the right going trees?

Here is the entire code together with the experiment:

#include <iostream>
#include <fstream>
#define BILLION  1000000000LL

using namespace std;


/*
 *
 * Timing functions
 *
 */

timespec startT, endT;

void startTimer(){
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startT);
}

double endTimer(){
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endT);
    return endT.tv_sec * BILLION + endT.tv_nsec - (startT.tv_sec * BILLION + startT.tv_nsec);
}

/*
 *
 * tree node
 *
 */

//this struct is used for creating the first tree after reading it from the external file, for this we need left and child pointers

struct tree_node_temp{

    long long int id; //id of the node, randomly generated
    int numChildren; //number of children, it is 2 but for the leafs it's 0
    int size; //size of the subtree rooted at the current node
    tree_node_temp *leftChild;
    tree_node_temp *rightChild;

    tree_node_temp(){
        id = -1;
        size = 1;
        leftChild = nullptr;
        rightChild = nullptr;
        numChildren = 0;
    }

};

struct tree_node{

    long long int id; //id of the node, randomly generated
    int numChildren; //number of children, it is 2 but for the leafs it's 0
    int size; //size of the subtree rooted at the current node

    int pos; //position in TreeArray where the node is stored
    int lpos; //position of the left child
    int rpos; //position of the right child

    tree_node(){
        id = -1;
        pos = lpos = rpos = -1;
        numChildren = 0;
    }

};

/*
 *
 * Tree parser. The input is a file containing the tree in the newick format.
 *
 */

string treeNewickStr; //string storing the newick format of a tree that we read from a file
int treeCurSTRindex; //index to the current position we are in while reading the newick string
int treeNumLeafs; //number of leafs in current tree
tree_node ** treeArrayReferences; //stack of references to free node objects
tree_node *treeArray; //array of node objects
int treeStackReferencesTop; //the top index to the references stack
int curpos; //used to find pos,lpos and rpos when creating the pre order layout tree


//helper function for readNewick
tree_node_temp* readNewickHelper() {

    int i;
    if(treeCurSTRindex == treeNewickStr.size())
        return nullptr;

    tree_node_temp * leftChild;
    tree_node_temp * rightChild;

    if(treeNewickStr[treeCurSTRindex] == '('){
        //create a left child
        treeCurSTRindex++;
        leftChild = readNewickHelper();
    }

    if(treeNewickStr[treeCurSTRindex] == ','){
        //create a right child
        treeCurSTRindex++;
        rightChild = readNewickHelper();
    }

    if(treeNewickStr[treeCurSTRindex] == ')' || treeNewickStr[treeCurSTRindex] == ';'){
        treeCurSTRindex++;
        tree_node_temp * cur = new tree_node_temp();
        cur->numChildren = 2;
        cur->leftChild = leftChild;
        cur->rightChild = rightChild;
        cur->size = 1 + leftChild->size + rightChild->size;
        return cur;
    }

    //we are about to read a label, keep reading until we read a "," ")" or "(" (we assume that the newick string has the right format)
    i = 0;
    char treeLabel[20]; //buffer used for the label
    while(treeNewickStr[treeCurSTRindex]!=',' && treeNewickStr[treeCurSTRindex]!='(' && treeNewickStr[treeCurSTRindex]!=')'){
        treeLabel[i] = treeNewickStr[treeCurSTRindex];
        treeCurSTRindex++;
        i++;
    }

    treeLabel[i] = '\0';
    tree_node_temp * cur = new tree_node_temp();
    cur->numChildren = 0;
    cur->id = atoi(treeLabel)-1;
    treeNumLeafs++;

    return cur;
}

//create the pre order tree, curRoot in the first call points to the root of the first tree that was given to us by the parser
void treeInit(tree_node_temp * curRoot){

    tree_node * curFinalRoot = treeArrayReferences[curpos];

    curFinalRoot->pos = curpos;

    if(curRoot->numChildren == 0) {
        curFinalRoot->id = curRoot->id;
        return;
    }

    //add left child
    tree_node * cnode = treeArrayReferences[treeStackReferencesTop];
    curFinalRoot->lpos = curpos + 1;
    curpos = curpos + 1;
    treeStackReferencesTop++;
    cnode->id = curRoot->leftChild->id;
    treeInit(curRoot->leftChild);

    //add right child
    curFinalRoot->rpos = curpos + 1;
    curpos = curpos + 1;
    cnode = treeArrayReferences[treeStackReferencesTop];
    treeStackReferencesTop++;
    cnode->id = curRoot->rightChild->id;
    treeInit(curRoot->rightChild);

    curFinalRoot->id = curRoot->id;
    curFinalRoot->numChildren = 2;
    curFinalRoot->size = curRoot->size;

}

//the ids of the leafs are deteremined by the newick file, for the internal nodes we just incrementally give the id determined by the dfs traversal
void updateInternalNodeIDs(int cur){

    tree_node* curNode = treeArrayReferences[cur];

    if(curNode->numChildren == 0){
        return;
    }
    curNode->id = treeNumLeafs++;
    updateInternalNodeIDs(curNode->lpos);
    updateInternalNodeIDs(curNode->rpos);

}

//frees the memory of the first tree generated by the parser
void treeFreeMemory(tree_node_temp* cur){

    if(cur->numChildren == 0){
        delete cur;
        return;
    }
    treeFreeMemory(cur->leftChild);
    treeFreeMemory(cur->rightChild);

    delete cur;

}

//reads the tree stored in "file" under the newick format and creates it in the main memory. The output (what the function returns) is a pointer to the root of the tree.
//this tree is scattered anywhere in the memory.

tree_node* readNewick(string& file){

    treeCurSTRindex = -1;
    treeNewickStr = "";
    treeNumLeafs = 0;

    ifstream treeFin;

    treeFin.open(file, ios_base::in);
    //read the newick format of the tree and store it in a string
    treeFin>>treeNewickStr;
    //initialize index for reading the string
    treeCurSTRindex = 0;
    //create the tree in main memory
    tree_node_temp* root = readNewickHelper();

    //store the tree in an array following the pre order layout
    treeArray = new tree_node[root->size];
    treeArrayReferences = new tree_node*[root->size];
    int i;
    for(i=0;i<root->size;i++)
        treeArrayReferences[i] = &treeArray[i];
    treeStackReferencesTop = 0;

    tree_node* finalRoot = treeArrayReferences[treeStackReferencesTop];
    curpos = treeStackReferencesTop;
    treeStackReferencesTop++;
    finalRoot->id = root->id;
    treeInit(root);

    //update the internal node ids (the leaf ids are defined by the ids stored in the newick string)
    updateInternalNodeIDs(0);
    //close the file
    treeFin.close();

    //free the memory of initial tree
    treeFreeMemory(root);
    //return the pre order tree
    return finalRoot;

}

/*
 *
 *
 * DOT FORMAT OUTPUT --- BEGIN
 *
 *
 */

void treeBstPrintDotAux(tree_node* node, ofstream& treeFout) {

    if(node->numChildren == 0) return;

    treeFout<<"    "<<node->id<<" -> "<<treeArrayReferences[node->lpos]->id<<";\n";
    treeBstPrintDotAux(treeArrayReferences[node->lpos], treeFout);

    treeFout<<"    "<<node->id<<" -> "<<treeArrayReferences[node->rpos]->id<<";\n";
    treeBstPrintDotAux(treeArrayReferences[node->rpos], treeFout);

}

void treePrintDotHelper(tree_node* cur, ofstream& treeFout){
    treeFout<<"digraph BST {\n";
    treeFout<<"    node [fontname=\"Arial\"];\n";

    if(cur == nullptr){
        treeFout<<"\n";
    }
    else if(cur->numChildren == 0){
        treeFout<<"    "<<cur->id<<";\n";
    }
    else{
        treeBstPrintDotAux(cur, treeFout);
    }

    treeFout<<"}\n";
}

void treePrintDot(string& file, tree_node* root){

    ofstream treeFout;
    treeFout.open(file, ios_base::out);
    treePrintDotHelper(root, treeFout);
    treeFout.close();

}

/*
 *
 *
 * DOT FORMAT OUTPUT --- END
 *
 *
 */

/*
 * experiments
 *
 */

tree_node* T;
int n;

void testCache(int cur){

    if(treeArray[cur].numChildren == 0){
        treeArray[cur].size = 1;
        return;
    }

    testCache(treeArray[cur].lpos);
    testCache(treeArray[cur].rpos);

    treeArray[cur].size = treeArray[treeArray[cur].lpos].size + treeArray[treeArray[cur].rpos].size + 1;

}


int main(int argc, char* argv[]){

    string Tnewick = argv[1];
    T = readNewick(Tnewick);

    n = T->size;
    double tt;

    startTimer();
    for (int i = 0; i < 100; i++) {
        testCache(0);
    }

    tt = endTimer();
    cout << tt / BILLION << '\t' << T->size;
    cout<<endl;

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

Compile by typing g++ -O3 -std=c++11 file.cpp Run by typing ./executable tree.txt. In tree.txt we store the tree in the newick format.

这是一棵左叶树,有10 ^ 5片叶子

这是一棵10 ^ 5叶子的正确树

我得到的运行时间:左侧树木约0.07秒,右侧树木约0.12秒

我为长篇大论道歉,但鉴于问题似乎有多么狭窄,我找不到更好的方式来描述它.

先感谢您!

编辑:

这是MrSmith42回答之后的后续编辑.我知道地方扮演着非常重要的角色,但我不确定我是否明白这就是这种情况.

对于上面的两个示例树,让我们看看我们如何随着时间的推移访问内存.

对于左转树:

对于正确的树:

对我来说,似乎在这两种情况下我们都有本地访问模式.

编辑:

这是关于条件分支数量的图:

这是一个关于分支错误预测数量的图:

Here is a left going tree with 10^6 leafs

Here is a right going tree with 10^6 leafs

FINAL EDIT:

I would like to apologize for wasting everyone's time, the parser I was using had a parameter for how "left" or "right" going I would like to make my tree look like. That was a floating number, it had to be close to 0 to make it left going and close to 1 to make it right going. However to make it look like a chain it had to be very small, like 0.000000001 or 0.999999999. For small inputs the tree looked like a chain even for values like 0.0001. I thought this number was small enough and that it would also give a chain for larger trees, however as I will show isn't the case. If you use numbers like 0.000000001 the parser stops working due to floating point problems.

vadikrobot's answer showed that we have locality issues. Inspired by his experiment I decided to generalize the access pattern diagram above to see how it behaves not just in the example trees, but in any trees.

I modified vadikrobot's code to look like this:

void testCache(int cur, FILE *f) {

    if(treeArray[cur].numChildren == 0){
        fprintf(f, "%d\t", tim++);
        fprintf (f, "%d\n", cur);
        treeArray[cur].size = 1;
        return;
    }

    fprintf(f, "%d\t", tim++);
    fprintf (f, "%d\n", cur);
    testCache(treeArray[cur].lpos, f);
    fprintf(f, "%d\t", tim++);
    fprintf (f, "%d\n", cur);
    testCache(treeArray[cur].rpos, f);
    fprintf(f, "%d\t", tim++);
    fprintf (f, "%d\n", cur);
    fprintf(f, "%d\t", tim++);
    fprintf (f, "%d\n", treeArray[cur].lpos);
    fprintf(f, "%d\t", tim++);
    fprintf (f, "%d\n", treeArray[cur].rpos);
    treeArray[cur].size = treeArray[treeArray[cur].lpos].size + 
    treeArray[treeArray[cur].rpos].size + 1;
}

Run Code Online (Sandbox Code Playgroud)

Access patterns generated by the wrong parser

Let's look at a left tree with 10 leafs.

Looks very nice, as predicted in the diagrams above (I only forgot in the above diagrams the fact that when we find the size of a node, we also access the size parameter of that node, cur in the source code above).

Let's look at a left tree with 100 leafs.

Looks as expected. What about 1000 leafs?

This is definitely not expected. There is a small triangle in the top right corner. And the reason for that is because the tree doesn't look like a left going chain, there is a small subtree hanging out somewhere in the end. The problem becomes even larger when the leafs are 10^4.

Let's look at what happens with right trees. When the leafs are 10:

Looks nice, what about 100 leafs?

Looks good too. This is why I questioned the locality of right trees, to me both seemed at least theory local. Now if you try increasing the size something interesting happens:

For 1000 leafs:

For 10^4 leafs things get even messier:

Access patterns generated by the correct parser

Instead of using that general parser I created one for this specific question:

#include <iostream>
#include <fstream>

using namespace std;

int main(int argc, char* argv[]){

    if(argc!=4){
        cout<<"type ./executable n{number of leafs} type{l:left going, r:right going} outputFile"<<endl;
        return 0;
    }

    int i;

    int n = atoi(argv[1]);

    if(n <= 2){cout<<"leafs must be at least 3"<<endl; return 0;}

    char c = argv[2][0];

    ofstream fout;
    fout.open(argv[3], ios_base::out);

    if(c == 'r'){

        for(i=0;i<n-1;i++){

            fout<<"("<<i<<",";

        }
        fout<<i;
        for(i=0;i<n;i++){
            fout<<")";
        }
        fout<<";"<<endl;

    }
    else{

        for(i=0;i<n-1;i++){
            fout<<"(";
        }

        fout<<1<<","<<n<<")";

        for(i=n-1;i>1;i--){
            fout<<","<<i<<")";
        }
        fout<<";"<<endl;

    }

    fout.close();


return 0;
}

Run Code Online (Sandbox Code Playgroud)

Now the access patterns look as expected.

For left trees with 10^4 leafs:

in the black part we go from a low place to a high place, but the distance between the previous low and the current low is small, same for previous high and current high. Hence the cache must be smart enough to hold two blocks, one for the low places and one for the high places, giving a small amount of cache misses.

For right trees with 10^4 leafs:

The original experiments again. This time I could only try up to 10^5 leafs, because as Mysticial noticed, we will get a stack overflow because of the height of the trees, which wasn't the case in the previous experiments since the height was smaller than the one expected.

Time wise they seem to perform the same, however cache and branch wise not. The right trees beat the left trees in branch predictions, the left trees beat the right trees in cache.

也许我的PAPI用法错了,从perf输出:

左树:

正确的树木:

我可能再次搞砸了一些事情,我为此道歉.我把我的尝试包括在内,以防万一有人想继续调查.

Answer 1

vad*_*bot 2

更新：

我及时绘制了数组中访问元素的数量

void testCache(int cur, FILE *f) {
   if(treeArray[cur].numChildren == 0){
       fprintf (f, "%d\n", cur);
       treeArray[cur].size = 1;
       return;
   }

   fprintf (f, "%d\n", cur);
   testCache(treeArray[cur].lpos, f);
   fprintf (f, "%d\n", cur);
   testCache(treeArray[cur].rpos, f);

   fprintf (f, "%d\n", treeArray[cur].lpos);
   fprintf (f, "%d\n", treeArray[cur].rpos);
   treeArray[cur].size = treeArray[treeArray[cur].lpos].size + treeArray[treeArray[cur].rpos].size + 1;
}

Run Code Online (Sandbox Code Playgroud)

结果我绘制了结果文本文件的 999990 个元素：

您可以看到，对于左侧树，所有元素都是本地访问的，但对于右侧树，存在访问不均匀性。

老的：

我尝试使用 valgrind 计算内存读取次数。为右一

valgrind --tool=callgrind --cache-sim ./a.out right
==11493== I   refs:      427,444,674
==11493== I1  misses:          2,288
==11493== LLi misses:          2,068
==11493== I1  miss rate:        0.00%
==11493== LLi miss rate:        0.00%
==11493== 
==11493== D   refs:      213,159,341  (144,095,416 rd + 69,063,925 wr)
==11493== D1  misses:     15,401,346  ( 12,737,497 rd +  2,663,849 wr)
==11493== LLd misses:        329,337  (      7,935 rd +    321,402 wr)
==11493== D1  miss rate:         7.2% (        8.8%   +        3.9%  )
==11493== LLd miss rate:         0.2% (        0.0%   +        0.5%  )
==11493== 
==11493== LL refs:        15,403,634  ( 12,739,785 rd +  2,663,849 wr)
==11493== LL misses:         331,405  (     10,003 rd +    321,402 wr)
==11493== LL miss rate:          0.1% (        0.0%   +        0.5%  )

Run Code Online (Sandbox Code Playgroud)

左边一个

valgrind --tool=callgrind --cache-sim=yes ./a.out left

==11496== I   refs:      418,204,722
==11496== I1  misses:          2,327
==11496== LLi misses:          2,099
==11496== I1  miss rate:        0.00%
==11496== LLi miss rate:        0.00%
==11496== 
==11496== D   refs:      204,114,971  (135,076,947 rd + 69,038,024 wr)
==11496== D1  misses:     19,470,268  ( 12,661,123 rd +  6,809,145 wr)
==11496== LLd misses:        306,948  (      7,935 rd +    299,013 wr)
==11496== D1  miss rate:         9.5% (        9.4%   +        9.9%  )
==11496== LLd miss rate:         0.2% (        0.0%   +        0.4%  )
==11496== 
==11496== LL refs:        19,472,595  ( 12,663,450 rd +  6,809,145 wr)
==11496== LL misses:         309,047  (     10,034 rd +    299,013 wr)
==11496== LL miss rate:          0.0% (        0.0%   +        0.4%  )

Run Code Online (Sandbox Code Playgroud)

正如您所看到的，“右”案例中读取“rd”的内存数量大于左侧

归档时间：	9 年，5 月前
查看次数：	562 次
最近记录：	9 年，5 月前