根据空格或"双引号字符串"将字符串解析为数组

Air*_*lla 3 c parsing split strtok

我试图取一个用户输入字符串并解析为一个名为char*entire_line [100]的数组; 其中每个单词放在数组的不同索引处,但如果字符串的一部分由引号封装,则应将其放在单个索引中.所以,如果我有

char buffer[1024]={0,};
fgets(buffer, 1024, stdin);
Run Code Online (Sandbox Code Playgroud)

示例输入:"word filename.txt"这是一个字符串,shoudl占用输出数组中的一个索引";

tokenizer=strtok(buffer," ");//break up by spaces
        do{
            if(strchr(tokenizer,'"')){//check is a word starts with a "
            is_string=YES;
            entire_line[i]=tokenizer;// if so, put that word into current index
            tokenizer=strtok(NULL,"\""); //should get rest of string until end "
            strcat(entire_line[i],tokenizer); //append the two together, ill take care of the missing space once i figure out this issue

              }  
        entire_line[i]=tokenizer;
        i++;
        }while((tokenizer=strtok(NULL," \n"))!=NULL);
Run Code Online (Sandbox Code Playgroud)

这显然是行不通的,只有当双引号封装的字符串是输入字符串的结束,但我可以有输入靠拢:单词"这是文本,这将是用户输入" FILENAME.TXT一直试图弄清楚这一点有一段时间,总是卡在某个地方.谢谢

tor*_*rek 9

strtok函数是在C中进行标记化的一种可怕方式,除了一个(公认的常见)情况:简单的空格分隔的单词.(即使这样,由于缺乏重新进入和递归能力,它仍然不是很好,这就是我们strsep为BSD 发明的原因.)

在这种情况下,您最好的选择是构建自己的简单状态机:

char *p;
int c;
enum states { DULL, IN_WORD, IN_STRING } state = DULL;

for (p = buffer; *p != '\0'; p++) {
    c = (unsigned char) *p; /* convert to unsigned char for is* functions */
    switch (state) {
    case DULL: /* not in a word, not in a double quoted string */
        if (isspace(c)) {
            /* still not in a word, so ignore this char */
            continue;
        }
        /* not a space -- if it's a double quote we go to IN_STRING, else to IN_WORD */
        if (c == '"') {
            state = IN_STRING;
            start_of_word = p + 1; /* word starts at *next* char, not this one */
            continue;
        }
        state = IN_WORD;
        start_of_word = p; /* word starts here */
        continue;

    case IN_STRING:
        /* we're in a double quoted string, so keep going until we hit a close " */
        if (c == '"') {
            /* word goes from start_of_word to p-1 */
            ... do something with the word ...
            state = DULL; /* back to "not in word, not in string" state */
        }
        continue; /* either still IN_STRING or we handled the end above */

    case IN_WORD:
        /* we're in a word, so keep going until we get to a space */
        if (isspace(c)) {
            /* word goes from start_of_word to p-1 */
            ... do something with the word ...
            state = DULL; /* back to "not in word, not in string" state */
        }
        continue; /* either still IN_WORD or we handled the end above */
    }
}
Run Code Online (Sandbox Code Playgroud)

请注意,这并未考虑单词内部双引号的可能性,例如:

"some text in quotes" plus four simple words p"lus something strange"
Run Code Online (Sandbox Code Playgroud)

通过上面的状态机,您将看到"some text in quotes"转换为单个令牌(忽略双引号),但p"lus也是单个令牌(包括引号),something是单个令牌,并且strange"是令牌.无论您是想要这个,还是想要如何处理它,都取决于您.对于更复杂但彻底的词法标记化,您可能希望使用类似的代码构建工具flex.

此外,当for循环退出时,如果state不是DULL,则需要处理最后的单词(我将其从上面的代码中删除)并决定如果stateIN_STRING(如果没有close-double-quote)该怎么办.


Hil*_*ill 7

Torek的解析代码部分非常出色,但需要做的工作很少。

为了我自己的目的,我完成了c函数。
在这里,我分享了基于Torek的代码的工作

#include <stdio.h>
#include <string.h>
#include <ctype.h>
size_t split(char *buffer, char *argv[], size_t argv_size)
{
    char *p, *start_of_word;
    int c;
    enum states { DULL, IN_WORD, IN_STRING } state = DULL;
    size_t argc = 0;

    for (p = buffer; argc < argv_size && *p != '\0'; p++) {
        c = (unsigned char) *p;
        switch (state) {
        case DULL:
            if (isspace(c)) {
                continue;
            }

            if (c == '"') {
                state = IN_STRING;
                start_of_word = p + 1; 
                continue;
            }
            state = IN_WORD;
            start_of_word = p;
            continue;

        case IN_STRING:
            if (c == '"') {
                *p = 0;
                argv[argc++] = start_of_word;
                state = DULL;
            }
            continue;

        case IN_WORD:
            if (isspace(c)) {
                *p = 0;
                argv[argc++] = start_of_word;
                state = DULL;
            }
            continue;
        }
    }

    if (state != DULL && argc < argv_size)
        argv[argc++] = start_of_word;

    return argc;
}
void test_split(const char *s)
{
    char buf[1024];
    size_t i, argc;
    char *argv[20];

    strcpy(buf, s);
    argc = split(buf, argv, 20);
    printf("input: '%s'\n", s);
    for (i = 0; i < argc; i++)
        printf("[%u] '%s'\n", i, argv[i]);
}
int main(int ac, char *av[])
{
    test_split("\"some text in quotes\" plus four simple words p\"lus something strange\"");
    return 0;
}
Run Code Online (Sandbox Code Playgroud)

查看程序输出:

输入:“带引号的一些文字”加上四个简单的单词p“有奇怪的东西”
[0]“带引号的一些文字”
[1]“加”
[2]“四个”
[3]“简单”
[4] '单词'
[5]'p'lus'
[6]'事物'
[7]'奇怪'