实现我自己的“字符串”工具——GNU 字符串找到的缺失序列

Question

实现我自己的“字符串”工具——GNU 字符串找到的缺失序列

Phi*_*lip 0 c linux string binary shell

我想以编程方式读取二进制文件中的文本/字符串。

我的目标的确切替代方案是stringsLinux 中的shell 命令。

当我运行strings -n 4 /bin/ddshell 命令时，它会打印 818 行文本。

如何像strings命令一样找到二进制中的所有字符串？

我的代码在找到 EOF 后使用read代替fgetc并为其余文本添加了打印块。

它可以在中找到 813 个单词/bin/dd，但仍然strings可以找到 818 个单词。有什么区别？

另一个问题; 您能建议此代码的性能改进吗？我想read(1)不是最快的方法。

最新更新代码

#include <stdio.h>
#include <stdbool.h>
#include <unistd.h>
#include <fcntl.h>

bool isPrintable(unsigned char c)
{
    if(c >= 0x20 && c <= 0x7e || c == 0x09)
    {
        return true;
    }
    return false;
}

int main(int argc, char * argv [])
{
    char buffer[300];
    char *p = buffer;
    char ch;
    int fd;

    if(argc < 2)
    {
        printf("Usage: %s file", argv[0]);
        return 1;
    }

    fd = open(argv[1], O_RDONLY);
    if(0 <= fd)
    {
        while(1 == read(fd, &ch, 1))
        {
            if(isPrintable(ch) && (p - buffer < sizeof(buffer) - 3))
            {
                *p++ = ch;
            }
            else
            {
                if(p - buffer >= 4) // print collected text
                {
                    *p++ = '\n';
                    *p++ = '\0';
                    printf("%s", buffer);
                }
                p = buffer;
            }
        }
        if(p - buffer >= 4) // print the rest, if any
        {
            *p++ = '\n';
            *p++ = '\0';
            printf("%s", buffer);
        }
        close(fd);
    }
    else
    {
        printf("Could not open %s\n", argv[1]);
        return 1;
    }

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

下面是一个性能测量mystrings和strings。strings可以在更短的时间内找到更多的文字。

$ time ./mystrings /lib/i386-linux-gnu/libc-2.27.so | wc -l
11852
real    0m0,917s
user    0m0,271s
sys 0m0,629s

$ time strings /lib/i386-linux-gnu/libc-2.27.so | wc -l
12026
real    0m0,028s
user    0m0,027s
sys 0m0,000s

Run Code Online (Sandbox Code Playgroud)

即使我使用fopen, fread，fclose也没有那么快：

$ time ./mystrings2 /lib/i386-linux-gnu/libc-2.27.so | wc -l
11852
real    0m0,084s
user    0m0,070s
sys 0m0,004s

Run Code Online (Sandbox Code Playgroud)

我也愿意接受任何有关性能改进的建议。

Answer 1

tha*_*guy 5

您必须包含制表符。它们的十六进制代码为 0x09。

您可以通过将其添加到可打印测试中来修复它：

if(c >= 0x20 && c <= 0x7e || c == 0x09)

Run Code Online (Sandbox Code Playgroud)

十分钟前：

哦，哇，我不知道为什么这个程序在这个人的单词中找到了 813 个单词，/bin/dd而strings找到了 818 个单词。为什么会有人认为我会呢？

但是，我确实有一个编译器和一个 Unix 系统，所以我可以做一些研究来尝试找出答案。

首先我在我的系统上试了一下：

$ ./yourprogram /bin/dd > yours && wc -l yours
807 yours

$ strings -n 4 /bin/dd > theirs && wc -l theirs
812 theirs

Run Code Online (Sandbox Code Playgroud)

好吧，不同的数字，但还是有区别的。然后我查看了差异：

$ diff -u yours theirs
--- yours       2018-07-17 15:13:27.188357492 -0700
+++ theirs      2018-07-17 15:13:56.905429280 -0700
@@ -182,7 +182,7 @@
 ATUH
 t9[]A\
 []A\
-[]A\
+8      []A\
 AUAT1
 []A\A]
 HiD$
@@ -210,7 +210,9 @@
 XZL;t$
 \$ I
 AUATI
+;'u    H
 []A\A]
+       v*H

Run Code Online (Sandbox Code Playgroud)

它很乱，但它表明您找到[]A\while stringsfinds 8 []A\。检查文件表明这是一个制表符。然后我可以创建一个测试用例：

$ printf 'hello\tworld' > file

$ strings file
hello    world

$ ./yourprogram file
hello
world

Run Code Online (Sandbox Code Playgroud)

所以程序似乎不能识别 Tab，而strings可以。为什么程序不认为它是可打印的？

我查了一下man ascii：

Oct   Dec   Hex   Char
???????????????????????????????????????
011   9     09    HT  '\t' (horizontal tab)

Run Code Online (Sandbox Code Playgroud)

我将其与代码查找的内容进行了比较。我可以在调试器中运行它或添加printf语句来尝试确定它为什么不能识别 0x09，但我可以看到它要求字符至少为 0x20 才能认为它是可打印的。

我更新isPrintable以将此添加为特殊情况：

    if(c >= 0x20 && c <= 0x7e || c == 0x09)

Run Code Online (Sandbox Code Playgroud)

并重新编译并重新运行：

$ ./yourprogram /bin/dd | wc -l
812

Run Code Online (Sandbox Code Playgroud)

现在计数匹配，我可以将此作为答案发布并假装我使用了一些哈利波特修补符或秘密级别锁定能力，而不仅仅是研究和测试。

归档时间：	7 年，6 月前
查看次数：	330 次
最近记录：	7 年，6 月前