我正在使用ocamllex为脚本语言编写词法分析器,但我面临与我的评论规则的冲突.
我希望允许我的命令参数不加引号,只要它们只包含字母数字字符和斜杠"/"即可.例如:
echo "quoted argument !@#%" /this/second/argument/is/unquoted
Run Code Online (Sandbox Code Playgroud)
另外,我的一个先决条件是带有"//"的C++样式注释
//this is a comment
echo hello world
Run Code Online (Sandbox Code Playgroud)
这带来的问题是这样的
echo foo//comment
Run Code Online (Sandbox Code Playgroud)
我希望我的词法分析器产生一个"foo"令牌,同时也保持"//"不变,以便在下次我向词法分析器询问令牌时使用它.那可能吗?这样做的原因是输入缓冲区可能还没有到达注释的末尾,我宁愿立即返回"foo"令牌而不是不必要地阻止尝试急切地使用注释.
以下是一个小词法分析器,只匹配echo
,引用和不带引号的字符串,注释和打印出结果标记:
{
type token = NEWLINE | ECHO | QUOTED of string | UNQUOTED of string | COMMENT of string
exception Eof
type state = CODE | LINE_COMMENT
let state = ref CODE
}
let newline = '\n'
let alphanum = [ 'A'-'Z' 'a'-'z' '0'-'9' '_' ]
let comment_line = "//"([^ '\n' ]+)
let space = [ ' ' '\t' ]
let quoted = '"'([^ '"' ]+)'"'
let unquoted = ('/'?(alphanum+'/'?)+)
rule code = parse
space+ { code lexbuf }
| newline { code lexbuf }
| "echo" { ECHO }
| quoted { QUOTED (Lexing.lexeme lexbuf) }
| "//" { line_comment "" lexbuf }
| ('/'|alphanum+) { unquoted (Lexing.lexeme lexbuf) lexbuf }
| eof { raise Eof }
and unquoted buff = parse
newline { UNQUOTED buff }
| "//" { state := LINE_COMMENT; if buff = "" then line_comment "" lexbuf else UNQUOTED buff }
| ('/'|alphanum+) { unquoted (buff ^ Lexing.lexeme lexbuf) lexbuf }
| space+ { UNQUOTED buff }
| eof { raise Eof }
and line_comment buff = parse
newline { state := CODE; COMMENT buff }
| _ { line_comment (buff ^ Lexing.lexeme lexbuf) lexbuf }
{
let lexer lb =
match !state with
CODE -> code lb
| LINE_COMMENT -> line_comment "" lb
let _ =
try
let lexbuf = Lexing.from_channel stdin in
while true do
let () =
match lexer lexbuf with
ECHO -> Printf.printf "ECHO\n"
| QUOTED s -> Printf.printf "QUOTED(%s)\n" s
| UNQUOTED s -> Printf.printf "UNQUOTED(%s)\n" s
| COMMENT s -> Printf.printf "COMMENT(%s)\n" s
| NEWLINE -> Printf.printf "\n"
in flush stdout
done
with Eof -> exit 0
}
Run Code Online (Sandbox Code Playgroud)
这是我在我的一个项目中使用的一个技巧,以克服ocamllex中的相同限制(与原始的C lex程序相比,它让一个匹配模式处于"向前看模式").基本上,它将不明确的规则分成不同的基础,并相应地将词法分析器切换到不同的解析器.它还跟踪当前使用的解析器和下一个入口点.
在您的情况下,它需要跟踪的唯一状态是默认的one(CODE
)和注释模式(LINE_COMMENT
).如果需要,可以扩展以支持其他州.