迭代C++中的链表比使用类似内存访问的Go慢

ram*_*owl 10 c++ optimization go

在各种情况下,我观​​察到链表迭代在C++中始终比在Go中慢10-15%.我在Stack Overflow上解决这个谜团的第一次尝试就在这里.我编码的例子有问题,因为:

1)由于堆分配,内存访问是不可预测的,并且

2)因为没有实际工作,一些人的编译器正在优化主循环.

为解决这些问题,我有一个新程序,包含C++和Go中的实现.C++版本需要1.75秒,而Go版本需要1.48秒.这次,我在计时开始之前做了一个大的堆分配,并用它来操作一个对象池,我从该对象池中释放并获取链表的节点.这样,内存访问应该在两个实现之间完全类似.

希望这使得神秘更具可重复性!

C++:

#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
#include <boost/timer.hpp>

using namespace std;

struct Node {
    Node *next; // 8 bytes
    int age;   // 4 bytes
};

// Object pool, where every free slot points to the previous free slot
template<typename T, int n>
struct ObjPool
{
    typedef T*       pointer;
    typedef pointer* metapointer;

    ObjPool() :
        _top(NULL),
        _size(0)
    {
        pointer chunks = new T[n];
        for (int i=0; i < n; i++) {
            release(&chunks[i]);
        }
    }

    // Giver an available pointer to the object pool
    void release(pointer ptr)
    {
        // Store the current pointer at the given address
        *(reinterpret_cast<metapointer>(ptr)) = _top;

        // Advance the pointer
        _top = ptr;

        // Increment the size
        ++_size;
    }

    // Pop an available pointer off the object pool for program use
    pointer acquire(void)
    {
        if(_size == 0){throw std::out_of_range("");}

        // Pop the top of the stack
        pointer retval = _top;

        // Step back to the previous address
        _top = *(reinterpret_cast<metapointer>(_top));

        // Decrement the size
        --_size;

        // Return the next free address
        return retval;
    }

    unsigned int size(void) const {return _size;}

protected:
    pointer _top;

    // Number of free slots available
    unsigned int _size;
};

Node *nodes = nullptr;
ObjPool<Node, 1000> p;

void processAge(int age) {
    // If the object pool is full, pop off the head of the linked list and release
    // it from the pool
    if (p.size() == 0) {
        Node *head = nodes;
        nodes = nodes->next;
        p.release(head);
    }

    // Insert the new Node with given age in global linked list. The linked list is sorted by age, so this requires iterating through the nodes.
    Node *node = nodes;
    Node *prev = nullptr;
    while (true) {
        if (node == nullptr || age < node->age) {
            Node *newNode = p.acquire();
            newNode->age = age;
            newNode->next = node;

            if (prev == nullptr) {
                nodes = newNode;
            } else {
                prev->next = newNode;
            }

            return;
        }

        prev = node;
        node = node->next;
    }
}

int main() {
    Node x = {};
    std::cout << "Size of struct: " << sizeof(x) << "\n"; // 16 bytes

    boost::timer t;
    for (int i=0; i<1000000; i++) {
        processAge(i);
    }

    std::cout << t.elapsed() << "\n";
}
Run Code Online (Sandbox Code Playgroud)

走:

package main

import (
    "time"
    "fmt"
    "unsafe"
)

type Node struct {
    next *Node // 8 bytes
    age int32 // 4 bytes
}

// Every free slot points to the previous free slot
type NodePool struct {
    top *Node
    size int
}

func NewPool(n int) NodePool {
    p := NodePool{nil, 0}
    slots := make([]Node, n, n)
    for i := 0; i < n; i++ {
        p.Release(&slots[i])
    }

    return p
}

func (p *NodePool) Release(l *Node) {
    // Store the current top at the given address
    *((**Node)(unsafe.Pointer(l))) = p.top
    p.top = l
    p.size++
}

func (p *NodePool) Acquire() *Node {
    if p.size == 0 {
        fmt.Printf("Attempting to pop from empty pool!\n")
    }
    retval := p.top

    // Step back to the previous address in stack of addresses
    p.top = *((**Node)(unsafe.Pointer(p.top)))
    p.size--
    return retval
}

func processAge(age int32) {
    // If the object pool is full, pop off the head of the linked list and release
    // it from the pool
    if p.size == 0 {
        head := nodes
        nodes = nodes.next
        p.Release(head)
    }

    // Insert the new Node with given age in global linked list. The linked list is sorted by age, so this requires iterating through the nodes.
    node := nodes
    var prev *Node = nil
    for true {
        if node == nil || age < node.age {
            newNode := p.Acquire()
            newNode.age = age
            newNode.next = node

            if prev == nil {
                nodes = newNode
            } else {
                prev.next = newNode
            }
            return
        }

        prev = node
        node = node.next
    }
}

// Linked list of nodes, in ascending order by age
var nodes *Node = nil
var p NodePool = NewPool(1000)

func main() {
    x := Node{};
    fmt.Printf("Size of struct: %d\n", unsafe.Sizeof(x)) // 16 bytes

    start := time.Now()
    for i := 0; i < 1000000; i++ {
        processAge(int32(i))
    }

    fmt.Printf("Time elapsed: %s\n", time.Since(start))
}
Run Code Online (Sandbox Code Playgroud)

输出:

clang++ -std=c++11 -stdlib=libc++ minimalPool.cpp -O3; ./a.out
Size of struct: 16
1.7548

go run minimalPool.go
Size of struct: 16
Time elapsed: 1.487930629s
Run Code Online (Sandbox Code Playgroud)

aba*_*ert 13

您的两个程序之间的最大区别在于,您的Go代码会忽略错误(如果您很幸运,如果清空池,则会出现恐慌或段错误),而C++代码会通过异常传播错误.相比:

if p.size == 0 {
    fmt.Printf("Attempting to pop from empty pool!\n")
}
Run Code Online (Sandbox Code Playgroud)

if(_size == 0){throw std::out_of_range("");}
Run Code Online (Sandbox Code Playgroud)

至少有三种方法1,使比较公平:

  1. 可以更改C++代码以忽略错误,就像在Go中一样
  2. 将两个版本更改为panic/ abort出错.
  3. 更改Go版本以惯用处理错误2,就像在C++中一样.

那么,让我们做所有这些并比较结果3:

  • C++忽略错误:1.059329s wall,1.050000s user + 0.000000s system = 1.050000s CPU(99.1%)
  • C++中止错误:1.081585s wall,1.060000s user + 0.000000s system = 1.060000s CPU(98.0%)
  • 惊慌失措:时间流逝:1.152942427s
  • 忽略错误:已过去的时间:1.196426068s
  • 去惯用语错误处理:经过的时间:1.322005119s
  • C++异常:1.373458s wall,1.360000s user + 0.000000s system = 1.360000s CPU(99.0%)

所以:

  • 没有错误处理,C++比Go快.
  • 恐慌,Go变得更快,4但仍然没有C++快.
  • 通过惯用的错误处理,C++比Go慢得多.

为什么?在您的测试运行中实际上不会发生此异常,因此实际的错误处理代码永远不会以任何一种语言运行.但clang无法证明它不会发生.并且,由于您从未catch在任何地方发生异常,这意味着它必须为堆栈中的每个非删除帧发出异常处理程序和堆栈unwinders.所以,它做更多的工作,每个函数调用和返回,没有多少更多的工作,但那么你的函数是做了不必要的额外工作增加了那么一点实际的工作.


1.您还可以更改C++版本以执行C样式的错误处理,或使用Option类型,以及可能的其他可能性.

2.这当然需要更多的变化:您需要导入errors,改变的返回类型Acquire(*Node, error),改变的返回类型processAgeerror,改变所有的return报表,并添加至少两个if err != nil { … }检查.但这对Go来说应该是一件好事,对吗?

3.当我参与其中时,我用你的遗产取代boost::timerboost::auto_cpu_timer,所以我们现在看到挂钟时间(和Go一样)以及CPU时间.

我不会试图解释原因,因为我不明白.从快速浏览一下装配,它显然优化了一些检查,但我不明白为什么它不能优化那些相同的检查没有panic.