优化算术编码器

Question

优化算术编码器

Yrl*_*lec 18 c++ optimization performance x86 assembly

我正在优化名为PackJPG的C++库的编码步骤

我用英特尔VTune描述了代码,发现当前的瓶颈是PackJPG使用的算术编码器中的以下功能:

void aricoder::encode( symbol* s )
{   
    // update steps, low count, high count
    unsigned int delta_plus_one = ((chigh - clow) + 1);
    cstep = delta_plus_one / s->scale;
    chigh = clow + ( cstep * s->high_count ) - 1;
    clow  = clow + ( cstep * s->low_count );

    // e3 scaling is performed for speed and to avoid underflows
    // if both, low and high are either in the lower half or in the higher half
    // one bit can be safely shifted out
    while ( ( clow >= CODER_LIMIT050 ) || ( chigh < CODER_LIMIT050 ) ) {
        if ( chigh < CODER_LIMIT050 ) { // this means both, high and low are below, and 0 can be safely shifted out
            // write 0 bit
            write_zero();
            // shift out remaing e3 bits
            write_nrbits_as_one();

        }
        else { // if the first wasn't the case, it's clow >= CODER_LIMIT050
            // write 1 bit
            write_one();
            clow  &= CODER_LIMIT050 - 1;
            chigh &= CODER_LIMIT050 - 1;
            // shift out remaing e3 bits

            write_nrbits_as_zeros();
        }
        clow  <<= 1;
        chigh = (chigh << 1) | 1;

    }

    // e3 scaling, to make sure that theres enough space between low and high
    while ( ( clow >= CODER_LIMIT025 ) && ( chigh < CODER_LIMIT075 ) ) {
        ++nrbits;
        clow  &= CODER_LIMIT025 - 1;
        chigh ^= CODER_LIMIT025 + CODER_LIMIT050;
        // clow  -= CODER_LIMIT025;
        // chigh -= CODER_LIMIT025;
        clow  <<= 1;
        chigh = (chigh << 1) | 1;

    }
}

Run Code Online (Sandbox Code Playgroud)

这个函数似乎借用了一些想法:http://paginas.fe.up.pt/~vinhoza/itpa/bodden-07-arithmetic-TR.pdf.我已经设法在某种程度上优化了功能(主要是通过加快写入位),但现在我被卡住了.

现在最大的瓶颈似乎是开始时的分裂.来自VTune的此屏幕截图显示了结果所需的时间以及创建的程序集(右侧的蓝色程序集对应于左侧所选源代码中的行).

s-> scale不一定是2的偶数幂,因此除法不能用模运算代替.

代码是使用MSVC(来自Visual Studio 2013)编译的,具有以下设置:

/GS /Qpar- /GL /analyze- /W3 /Gy- /Zc:wchar_t /Zi /Gm- /Ox /sdl /Fd"Release\vc120.pdb" /fp:precise /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_USRDLL" /D "PACKJPG_EXPORTS" /D "_CRT_SECURE_NO_WARNINGS" /D "BUILD_DLL" /D "_WINDLL" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:IA32 /Gd /Oy- /Oi /MT /Fa"Release\" /EHsc /nologo /Fo"Release\" /Ot /Fp"Release\PackJPG.pch"

Run Code Online (Sandbox Code Playgroud)

关于如何进一步优化这一点的任何想法？

更新1 我现在已经尝试了所有建议,这是现在最快的版本:

void aricoder::encode( symbol* s )
{   
    unsigned int clow_copy = clow;
    unsigned int chigh_copy = chigh;
    // update steps, low count, high count
    unsigned int delta_plus_one = ((chigh_copy - clow_copy) + 1);
    unsigned register int cstep = delta_plus_one / s->scale;

    chigh_copy = clow_copy + (cstep * s->high_count) - 1;
    clow_copy = clow_copy + (cstep * s->low_count);

    // e3 scaling is performed for speed and to avoid underflows
    // if both, low and high are either in the lower half or in the higher half
    // one bit can be safely shifted out
    while ((clow_copy >= CODER_LIMIT050) || (chigh_copy < CODER_LIMIT050)) {
        if (chigh_copy < CODER_LIMIT050) {  // this means both, high and low are below, and 0 can be safely shifted out
            // write 0 bit
            write_zero();
            // shift out remaing e3 bits
            write_nrbits_as_one();

        }
        else { // if the first wasn't the case, it's clow >= CODER_LIMIT050
            // write 1 bit
            write_one();
            clow_copy &= CODER_LIMIT050 - 1;
            chigh_copy &= CODER_LIMIT050 - 1;
            // shift out remaing e3 bits

            write_nrbits_as_zeros();
        }
        clow_copy <<= 1;
        chigh_copy = (chigh_copy << 1) | 1;

    }

    // e3 scaling, to make sure that theres enough space between low and high
    while ((clow_copy >= CODER_LIMIT025) & (chigh_copy < CODER_LIMIT075)){
        ++nrbits;
        clow_copy &= CODER_LIMIT025 - 1;
        chigh_copy ^= CODER_LIMIT025 + CODER_LIMIT050;
        // clow  -= CODER_LIMIT025;
        // chigh -= CODER_LIMIT025;
        clow_copy <<= 1;
        chigh_copy = (chigh_copy << 1) | 1;

    }
    clow = clow_copy;
    chigh = chigh_copy;
}

Run Code Online (Sandbox Code Playgroud)

以下是此版本的更新VTune结果: 此新版本包括以下更改:

在最后一个while循环中使用&而不是&&来避免一个分支(该技巧在第一个循环中没有帮助).
将类字段复制到局部变量.

下面的建议可惜没没有提高性能:

用带有goto语句的开关替换第一个while循环.
使用定点算术进行除法(它创建了舍入误差).
在s->比例上进行切换并进行位移而不是除以2的偶数幂.

@example表明,除了该分区的操作数之一的内存访问速度之外,它不是分区.这似乎是正确的.根据VTune的说法,我们经常会遇到缓存丢失问题.关于如何解决这个问题的任何建议？

Answer 1

Man*_*mar 4

根据 VTune 的说法，我们经常在这里遇到缓存未命中的情况。关于如何解决这个问题有什么建议吗？

我们组织数据的方式直接影响数据局部性的性能，因此缓存机制的行为方式取决于此。因此，为了实现这一目标，我们的程序应该尝试尽可能进行线性内存访问，并且应该避免任何间接内存读/写（基于指针的数据结构）。这确实会受到缓存机制的喜爱，因为内存拥有 L1 缓存的概率会明显更高。

在查看代码和 VTune 报告时，看起来最重要的数据是传递给该特定函数的参数。该对象的各种数据成员正在该特定函数中使用（内存读取）。

void aricoder::encode( symbol* s )

Run Code Online (Sandbox Code Playgroud)

现在，有以下代码，程序正在访问该对象的数据成员：

s->scale
s->high_count
s->low_count

Run Code Online (Sandbox Code Playgroud)

从两个 VTune 报告中，我们可以验证所有三个内存访问都有不同的时序。这表明这些数据位于该特定对象的不同偏移处。当访问其中一个（s->high_count）时，它会从 L1 缓存中出去，因此需要更多时间，因为它必须将数据带入缓存。因此，s->low_count受益，因为它现在位于 L1 缓存中。从这些数据我可以想到以下几点：

将最常访问的数据成员放入对象内的热区。这意味着我们应该将所有这些成员放在对象的第一个/顶部。通过这种方式，我们将更有可能将我们的对象放入对象的第一个缓存行中。因此，我们应该尝试根据其数据成员访问来重新组织对象内存布局。我假设您没有处理该对象中的虚拟表，因为它们从缓存机制来看不太好。
您的整个程序可能以这样的方式组织：围绕这一点（即该函数的执行），L1 缓存已满，因此程序尝试从 L2 访问它，并且此转换，将会有更多CPU 周期（尖峰）。在这种情况下，我认为我们无能为力，因为这是机器的限制，从某种意义上说，我们过度扩展了边界并试图处理太低级别的东西。
您的对象s似乎是 POD 类型，因此会有线性访问。这很好，没有改进的余地。然而我们的分配方式可能会对缓存机制产生影响。如果每次都分配它，则在当前函数内执行时可能会产生影响。

除此之外，我认为我们还应该参考以下 SO 帖子，其中详细讨论了这些概念（数据缓存/指令缓存）。这些帖子也有很好的链接，其中有关于此的深入分析和信息。

什么是“缓存友好”代码？

如何用C++编写指令缓存友好的程序？

我建议，您应该尝试参考这些帖子。它们对于理解这些概念的内部结构非常有帮助，尽管它可能无法帮助您优化当前的代码。可能你的程序已经优化了，我们对此无能为力:)。

归档时间：	11 年，7 月前
查看次数：	1616 次
最近记录：	11 年，7 月前