Boyer Moore 算法简单版本中的未定义行为

pap*_*uma 2 c undefined-behavior boyer-moore

我正在尝试编写没有前缀函数的博耶摩尔算法的更简单版本。它必须打印与模式进行比较的符号的所有位置。在本地它通过了测试,但是当我将其提交到gitlab时它失败了。我在这里无法发现未定义的行为。我最后有垃圾。

#include <stdio.h>

#define MAX_PATTERN_LEN 16
#define BUF_SIZE 69
#define ALPH_SIZE 128

int read_str(int max_len, unsigned char *str_place) {
    int count_read = 0;
    for (int i = 0, ch; i < max_len; i++) {
        if ((ch = getchar()) == '\n') {
            str_place[i] = '\0';
            break;
        }
        str_place[i] = (char)ch;
        count_read++;
    }
    return count_read;
}

void calculate_shifts(const unsigned char *str, int len_str, int *badchar) {
    for (int i = 0; i < ALPH_SIZE; i++)
        badchar[i] = len_str;
    for (int i = 0; i < len_str - 1; i++)
        badchar[str[i]] = len_str - 1 - i;
}

void search(const unsigned char *str, const unsigned char *patt, int len_patt, int len_str) {
    int badchar[ALPH_SIZE];
    calculate_shifts(patt, len_patt, badchar);
    int shift = 0;
    while (shift <= (len_str - len_patt)) {
        int j = len_patt - 1;
        for (; j >= 0 && patt[j] == str[shift + j]; j--)
            printf("%d ", shift + j + 1);
        if (j < 0) {
            shift += ((shift + len_patt) < len_str) ? badchar[patt[len_patt - 1]] : 1;
        } else {
            printf("%d ", shift + j + 1);
            int shift_addition = badchar[str[shift + j]];
            if ((shift_addition == len_patt) && (j < len_patt - 1) && (patt[len_patt - 1] == patt[0]))
                shift_addition--;
            shift += shift_addition;
        }
    }
}

int main(void) {
    unsigned char str[BUF_SIZE + 1];
    unsigned char patt[MAX_PATTERN_LEN + 1];
    int len_patt = read_str(MAX_PATTERN_LEN + 1, patt);
    int len_str = read_str(BUF_SIZE + 1, str);
    if (!len_patt || !len_str)
        return 0;
    search(str, patt, len_patt, len_str);
    return 0;
} 
Run Code Online (Sandbox Code Playgroud)

考试:

example
this is simple example
Run Code Online (Sandbox Code Playgroud)

正确的输出:

7 14 13 12 11 10 20 22 21 20 19 18 17 16
Run Code Online (Sandbox Code Playgroud)

实际输出:

7 14 13 12 11 10 20 22 21 20 19 18 17 16 28 ..
Run Code Online (Sandbox Code Playgroud)

pts*_*pts 5

差异(输出末尾的额外数字 28)可能是由于\n输入最后一行末尾缺少换行符 ( ) 造成的。我能够在本地重现您的两个输出(带和不带 28)。

$ gcc -g -O2 -W -Wall -o t t.c
$ printf 'example\nthis is simple example; \n' | ./t; echo
7 14 13 12 11 10 20 22 21 20 19 18 17 16
$ printf 'example\nthis is simple example; ' | ./t; echo
7 14 13 12 11 10 20 22 21 20 19 18 17 16 28 
Run Code Online (Sandbox Code Playgroud)

TL;DR(ch = getchar()) == '\n'通过更改为(ch = getchar()) == EOF || ch == '\n'和 更改#define ALPH_SIZE 128为 来修复代码中的错误#define ALPH_SIZE 256

可能还有别的bug,我没检查过。如果您遇到任何其他错误,请在 StackOverflow 上的单独问题中提出。


这个答案的其余部分解释了我如何诊断这个错误。

gcc -fsanitize=address如果换行符丢失,AddressSanitizer )会显示内存访问问题:

$ gcc -g -O2 -W -Wall -fsanitize=address -o t t.c
$ printf 'example\nthis is simple example; \n' | ./t; echo
7 14 13 12 11 10 20 22 21 20 19 18 17 16 
$ printf 'example\nthis is simple example; ' | ./t; echo
ASAN:DEADLYSIGNAL
=================================================================
==19294==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffe25fbe264 (pc 0x563d6542b204 bp 0x7ffe25fbe264 sp 0x7ffe9a1bbc70 T0)
==19294==The signal is caused by a READ memory access.
    #0 0x563d6542b203 in search /tmp/t.c:33
    #1 0x563d6542acb1 in main /tmp/t.c:54
    #2 0x7f617ff96c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
    #3 0x563d6542ad99 in _start (/tmp/t+0xd99)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/t.c:33 in search
==19294==ABORTING
Run Code Online (Sandbox Code Playgroud)

请注意,我必须多次重新运行它才能触发 AddressSanitizer 错误消息。

大多数情况下,这是由程序中的错误引起的,并且通常是未定义的行为。in search /tmp/t.c:33AddressSanitizer 输出中的子句指示源代码中发生错误读取的位置(即第 33 行, function )search

我在您的代码中添加了一些额外的printfs 来查看第 33 行中发生的情况。输出为:

$ printf 'example\nthis is simple example; ' | ./t; echo
patt[6] == 101
str[6] == 115
patt[6] == 101
str[13] == 101
patt[5] == 108
str[12] == 108
patt[4] == 112
str[11] == 112
patt[3] == 109
str[10] == 109
patt[2] == 97
str[9] == 105
patt[6] == 101
str[19] == 112
patt[6] == 101
str[21] == 101
patt[5] == 108
str[20] == 108
patt[4] == 112
str[19] == 112
patt[3] == 109
str[18] == 109
patt[2] == 97
str[17] == 97
patt[1] == 120
str[16] == 120
patt[0] == 101
str[15] == 101
patt[6] == 101
str[27] == 255
patt[6] == 101
str[-369011759] == ??
ASAN:DEADLYSIGNAL
=================================================================
==19906==ERROR: AddressSanitizer: SEGV on unknown address 0x7ffc6e5c0c71 (pc 0x5621e2363094 bp 0x0000ea0153d1 sp 0x7ffc845ab540 T0)
==19906==The signal is caused by a READ memory access.
    #0 0x5621e2363093 in gett /tmp/t.c:33
    #1 0x5621e236349b in search /tmp/t.c:42
    #2 0x5621e2362e61 in main /tmp/t.c:63
    #3 0x7f7063ee8c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
    #4 0x5621e2362f49 in _start (/tmp/t+0xf49)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/t.c:33 in gett
==19906==ABORTING
Run Code Online (Sandbox Code Playgroud)

罪魁祸首是最后一次读取访问:str[-369011759] == ??,这表明shift + j变为-369011759,这显然是不正确的,因为正确的值是小的非负整数。

请注意,我必须多次重新运行它才能触发 Address Sanitizer 错误消息,并且数字 -369011759 每次都不同。

它看起来j有一个合理的值 6。要找出实际的错误,下一步是检查从哪里shift得到它的大负值。

此时输出线str[27] == 255对我来说是可疑的。我检查了你的代码是否正确处理这么大的值str,然后我发现了一个错误。

修复这个新发现的错误的一种方法是ALPH_SIZE从更改128256. (我已经验证它可以使地址消毒器错误消失。)原因如下。第 39 行包含 的读取badchar[str[shift + j]]。为了使其有效,我们需要0 <= str[shift + j] && str[shift + j] < ALPH_SIZE,因为ALPH_SIZE是数组的大小badchar。然而, 的最大值str[...]是 255(因为它是一个无符号字符),因此我们需要ALPH_SIZE >= 256.

事实上,越界数组访问(读或写)在 C 中是未定义的行为。为了证明这确实发生,我将第 39 行更改为

        printf("%d ", shift + j + 1); fprintf(stderr, "RBC %d == %d\n", shift + j, str[shift + j]);
Run Code Online (Sandbox Code Playgroud)

,我还更改了第 28 行以使用 malloc:

    int *badchar = malloc(sizeof(int) * ALPH_SIZE);
Run Code Online (Sandbox Code Playgroud)

,我free(badchar)后来也补充了。凭借我所拥有的一切:

$ gcc -g -O2 -W -Wall -include stdlib.h -fsanitize=address -o t t.c
$ printf 'example\nthis is simple example; \n' | ./t; echo
RBC 6 == 115
RBC 9 == 105
RBC 19 == 112
7 14 13 12 11 10 20 22 21 20 19 18 17 16 
$ printf 'example\nthis is simple example; ' | ./t; echo
RBC 6 == 115
RBC 9 == 105
RBC 19 == 112
RBC 27 == 255
=================================================================
==23463==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61500000047c at pc 0x563915b3754a bp 0x7fff5f340710 sp 0x7fff5f340700
READ of size 4 at 0x61500000047c thread T0
    #0 0x563915b37549 in search /tmp/t.c:39
    #1 0x563915b36dc1 in main /tmp/t.c:54
    #2 0x7f1a63d6dc86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
    #3 0x563915b36ea9 in _start (/tmp/t+0xea9)

Address 0x61500000047c is a wild pointer.
SUMMARY: AddressSanitizer: heap-buffer-overflow /tmp/t.c:39 in search
Shadow bytes around the buggy address:
  0x0c2a7fff8030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c2a7fff8040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c2a7fff8050: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2a7fff8060: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2a7fff8070: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c2a7fff8080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]
  0x0c2a7fff8090: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2a7fff80a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2a7fff80b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2a7fff80c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c2a7fff80d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==23463==ABORTING
Run Code Online (Sandbox Code Playgroud)

事实上,AddressSanitizer 的输出表明了第 39 行中的错误,并且额外调用的输出fprintf确认了它是越界数组访问:读取数组的第 255 个元素badchar,该数组只有 128 个元素。

我们最终如何str[27] == 255(因此阅读badchar[255])?这很简单:如果读取 时没有\n在文件末尾终止str,则getchar()返回 EOF,当分配给unsigned char. (如上所述,明显的修复方法是在 EOF 处停止,并更改ALPH_SIZE为 256。)

badchar为什么在更改为使用(报告为堆缓冲区溢出)之前,AddressSanitizer 没有在第 39 行报告读取越界数组索引问题( malloc作为堆栈缓冲区溢出)?我还不太明白这一点。AddressSanitizer 应该能够可靠地报告堆栈缓冲区溢出。可能这取决于 GCC 和 Clang 版本,因为更改命令gcc可以clang使堆栈缓冲区溢出错误可靠地出现。升级到最新的 GCC 和 Clang 可能会有所帮助。Valgrind 在检测堆栈缓冲区溢出读取方面不如 AddressSanitizer 可靠。