我们能以多快的速度制作特定的tr？

Question

我们能以多快的速度制作特定的tr？

我不得不用另一个字符(我任意选择@)替换文件中的所有空字节,并且非常惊讶,tr '\00' '@'大约是速度的1/4 gzip:

$ pv < lawl | gzip > /dev/null
^C13MiB 0:00:04 [28.5MiB/s] [====>                             ] 17% ETA 0:00:18
$ pv < lawl | tr '\00' '@' > /dev/null
^C58MiB 0:00:08 [7.28MiB/s] [==>                               ]  9% ETA 0:01:20

Run Code Online (Sandbox Code Playgroud)

我的真实数据文件是3GB gzip压缩,耗时50分钟tr,我实际上需要在许多这样的文件上执行此操作,因此这不是一个完全学术问题.请注意,从磁盘读取(这里pv是一个相当快的SSD),或者,在任何一种情况下都不是瓶颈; 双方gzip并tr使用100%的CPU,并且cat速度要快得多:

$ pv < lawl | cat > /dev/null
 642MiB 0:00:00 [1.01GiB/s] [================================>] 100%

Run Code Online (Sandbox Code Playgroud)

这段代码:

#include <stdio.h>

int main() {
    int ch;
    while ((ch = getchar()) != EOF) {
        if (ch == '\00') {
            putchar('@');
        } else {
            putchar(ch);
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

编译时clang -O3有点快:

$ pv < lawl | ./stupidtr > /dev/null
^C52MiB 0:00:06 [ 8.5MiB/s] [=>                                ]  8% ETA 0:01:0

Run Code Online (Sandbox Code Playgroud)

使用gcc -O4 -mtune=native -march=native(4.8.4)进行编译是可比的,可能会稍微快一些.添加-march=native到clang(Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn))会生成相同的二进制文件.

这可能只是因为替换的通用处理代码tr被常量替换,并且可以编译检查.LLVM IR(clang -S -O3 stupidtr.c)看起来很不错.

我想gzip必须更快,因为它正在做一些SIMD指令或其他东西.有可能达到gzip速度吗？

一些规范,如果它们是相关的:

该文件是CSV; 空字节只能出现在某个字段中,但其他一些字段是可变长度的,因此您不能随意搜索.大多数行在该字段中都有一个空字节.我想这意味着你可以进行Boyer-Moore搜索,\00,,如果有帮助的话.一旦找到一个空字节,它也保证不会有一百个字节左右的另一个字节.
一个典型的文件大约是20 GiB未压缩,但如果相关,则在磁盘上压缩bz2.
您可以根据需要进行并行化,但是gzip这样做可以并行化,因此不需要.我将在运行OSX的四核i7或运行Linux的两个vCPU云服务器上运行它.
我可能运行的两台机器都有16GB的RAM.

Answer 1

chq*_*lie 4

将各种答案的想法与一些额外的技巧相结合，这是一个优化版本：

#include <errno.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define BUFFER_SIZE  16384
#define REPLACE_CHAR  '@'

int main(void) {
    /* define buffer as uint64_t to force alignment */
    /* make it one slot longer to allow for loop guard */
    uint64_t buffer[BUFFER_SIZE/8 + 1];
    ssize_t size, chunk;
    uint64_t *p, *p_end;
    uint64_t rep8 = (uint8_t)REPLACE_CHAR * 0x0101010101010101ULL;

    while ((size = read(0, buffer, BUFFER_SIZE)) != 0) {
        if (size < 0) {
            if (errno == EINTR) continue;
            fprintf(stderr, "read error: %s\n", strerror(errno));
            return 1;
        }
        p = buffer;
        p_end = p + ((size + 7) >> 3);
        *p_end = 0ULL; /* force a 0 at the end */
        for (;; p++) {
#define LOWBITS   0x0101010101010101ULL
#define HIGHBITS  0x8080808080808080ULL
            uint64_t m = ((*p - LOWBITS) & ~*p & HIGHBITS);
            if (m != 0) {
                if (p >= p_end) break;
                m |= m >> 1;
                m |= m >> 2;
                m |= m >> 4;
                *p |= m & rep8;
            }
        }
        for (unsigned char *pc = (unsigned char *)buffer;
             (chunk = write(1, pc, (size_t)size)) != size;
             pc += chunk, size -= chunk) {
            if (chunk < 0) {
                if (errno == EINTR) continue;
                fprintf(stderr, "write error: %s\n", strerror(errno));
                return 2;
            }
        }
    }
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，4 月前
查看次数：	130 次
最近记录：	9 年，8 月前