Air*_*lla 3 c parsing split strtok
我试图取一个用户输入字符串并解析为一个名为char*entire_line [100]的数组; 其中每个单词放在数组的不同索引处,但如果字符串的一部分由引号封装,则应将其放在单个索引中.所以,如果我有
char buffer[1024]={0,};
fgets(buffer, 1024, stdin);
Run Code Online (Sandbox Code Playgroud)
示例输入:"word filename.txt"这是一个字符串,shoudl占用输出数组中的一个索引";
tokenizer=strtok(buffer," ");//break up by spaces
do{
if(strchr(tokenizer,'"')){//check is a word starts with a "
is_string=YES;
entire_line[i]=tokenizer;// if so, put that word into current index
tokenizer=strtok(NULL,"\""); //should get rest of string until end "
strcat(entire_line[i],tokenizer); //append the two together, ill take care of the missing space once i figure out this issue
}
entire_line[i]=tokenizer;
i++;
}while((tokenizer=strtok(NULL," \n"))!=NULL);
Run Code Online (Sandbox Code Playgroud)
这显然是行不通的,只有当双引号封装的字符串是输入字符串的结束,但我可以有输入靠拢:单词"这是文本,这将是用户输入" FILENAME.TXT一直试图弄清楚这一点有一段时间,总是卡在某个地方.谢谢
该strtok函数是在C中进行标记化的一种可怕方式,除了一个(公认的常见)情况:简单的空格分隔的单词.(即使这样,由于缺乏重新进入和递归能力,它仍然不是很好,这就是我们strsep为BSD 发明的原因.)
在这种情况下,您最好的选择是构建自己的简单状态机:
char *p;
int c;
enum states { DULL, IN_WORD, IN_STRING } state = DULL;
for (p = buffer; *p != '\0'; p++) {
c = (unsigned char) *p; /* convert to unsigned char for is* functions */
switch (state) {
case DULL: /* not in a word, not in a double quoted string */
if (isspace(c)) {
/* still not in a word, so ignore this char */
continue;
}
/* not a space -- if it's a double quote we go to IN_STRING, else to IN_WORD */
if (c == '"') {
state = IN_STRING;
start_of_word = p + 1; /* word starts at *next* char, not this one */
continue;
}
state = IN_WORD;
start_of_word = p; /* word starts here */
continue;
case IN_STRING:
/* we're in a double quoted string, so keep going until we hit a close " */
if (c == '"') {
/* word goes from start_of_word to p-1 */
... do something with the word ...
state = DULL; /* back to "not in word, not in string" state */
}
continue; /* either still IN_STRING or we handled the end above */
case IN_WORD:
/* we're in a word, so keep going until we get to a space */
if (isspace(c)) {
/* word goes from start_of_word to p-1 */
... do something with the word ...
state = DULL; /* back to "not in word, not in string" state */
}
continue; /* either still IN_WORD or we handled the end above */
}
}
Run Code Online (Sandbox Code Playgroud)
请注意,这并未考虑单词内部双引号的可能性,例如:
"some text in quotes" plus four simple words p"lus something strange"
Run Code Online (Sandbox Code Playgroud)
通过上面的状态机,您将看到"some text in quotes"转换为单个令牌(忽略双引号),但p"lus也是单个令牌(包括引号),something是单个令牌,并且strange"是令牌.无论您是想要这个,还是想要如何处理它,都取决于您.对于更复杂但彻底的词法标记化,您可能希望使用类似的代码构建工具flex.
此外,当for循环退出时,如果state不是DULL,则需要处理最后的单词(我将其从上面的代码中删除)并决定如果state是IN_STRING(如果没有close-double-quote)该怎么办.
Torek的解析代码部分非常出色,但需要做的工作很少。
为了我自己的目的,我完成了c函数。
在这里,我分享了基于Torek的代码的工作。
#include <stdio.h>
#include <string.h>
#include <ctype.h>
size_t split(char *buffer, char *argv[], size_t argv_size)
{
char *p, *start_of_word;
int c;
enum states { DULL, IN_WORD, IN_STRING } state = DULL;
size_t argc = 0;
for (p = buffer; argc < argv_size && *p != '\0'; p++) {
c = (unsigned char) *p;
switch (state) {
case DULL:
if (isspace(c)) {
continue;
}
if (c == '"') {
state = IN_STRING;
start_of_word = p + 1;
continue;
}
state = IN_WORD;
start_of_word = p;
continue;
case IN_STRING:
if (c == '"') {
*p = 0;
argv[argc++] = start_of_word;
state = DULL;
}
continue;
case IN_WORD:
if (isspace(c)) {
*p = 0;
argv[argc++] = start_of_word;
state = DULL;
}
continue;
}
}
if (state != DULL && argc < argv_size)
argv[argc++] = start_of_word;
return argc;
}
void test_split(const char *s)
{
char buf[1024];
size_t i, argc;
char *argv[20];
strcpy(buf, s);
argc = split(buf, argv, 20);
printf("input: '%s'\n", s);
for (i = 0; i < argc; i++)
printf("[%u] '%s'\n", i, argv[i]);
}
int main(int ac, char *av[])
{
test_split("\"some text in quotes\" plus four simple words p\"lus something strange\"");
return 0;
}
Run Code Online (Sandbox Code Playgroud)
查看程序输出:
输入:“带引号的一些文字”加上四个简单的单词p“有奇怪的东西”
[0]“带引号的一些文字”
[1]“加”
[2]“四个”
[3]“简单”
[4] '单词'
[5]'p'lus'
[6]'事物'
[7]'奇怪'