Windows命令解释器(CMD.EXE)如何解析脚本？

Question

Windows命令解释器(CMD.EXE)如何解析脚本？

Ben*_*oit 132 windows parsing cmd batch-file variable-expansion

我遇到了ss64.com,它为如何编写Windows命令解释器将运行的批处理脚本提供了很好的帮助.

但是,我一直无法找到批处理脚本的语法,扩展或不扩展的方法,以及如何逃避事情的良好解释.

以下是我无法解决的示例问题:

如何管理报价系统？我制作了一个TinyPerl脚本
(foreach $i (@ARGV) { print '*' . $i ; }),编译它并以这种方式调用它:
- my_script.exe "a ""b"" c"？输出是 *a "b*c
- my_script.exe """a b c"""？输出它*"a*b*c"
内部echo命令如何工作？在那个命令中扩展了什么？
为什么我必须for [...] %%I在文件脚本中使用,但for [...] %I在交互式会话中？
什么是转义字符,以及在什么情况下？如何逃脱百分号？例如,我怎么能%PROCESSOR_ARCHITECTURE%字面回声？我发现echo.exe %""PROCESSOR_ARCHITECTURE%有效,有没有更好的解决方案？
如何%配对？例:
- set b=a,echo %a %b% c%？%a a c%
- set a =b,echo %a %b% c%？bb c%
如果此变量包含双引号,如何确保变量作为单个参数传递给命令？
使用set命令时如何存储变量？例如,如果我这样做set a=a" b,那么echo.%a%我获得a" b.但是,如果我使用echo.exeUnxUtils,我会得到a b.怎么%a%以不同的方式扩展？

谢谢你的灯.

Answer 1

jeb*_*jeb 179

I performed many experiments to investigate the grammar of batch scripts. I also investigated the differences between batch and command line mode.

The Batch Line Parser:

Processing a line of code in a batch file involves multiple phases.

Here is a brief overview of the various phases:

Phase 0) Read Line:

Phase 1) Percent Expansion:

Phase 1.5) Remove <CR>: Remove all Carriage Return (0x0D) characters

Phase 2) Process special characters, tokenize, and build a cached command block: This is a complex process that is affected by things such as quotes, special characters, token delimiters, and caret escapes.

Phase 3) Echo the parsed command(s) Only if the command block did not begin with @, and ECHO was ON at the start of the preceding step.

阶段4)FOR %X变量扩展:仅当FOR命令处于活动状态且正在处理DO之后的命令时.

阶段5)延迟扩展:仅在启用延迟扩展时

阶段5.3)管道处理:仅当命令位于管道的任何一侧时

阶段5.5)执行重定向:

阶段6)CALL处理/插入符号加倍:仅当命令令牌为CALL时

阶段7)执行:执行命令

以下是每个阶段的详细信息:

请注意,下面描述的阶段只是批处理解析器工作方式的模型.实际的cmd.exe内部可能无法反映这些阶段.但是这种模型可以有效地预测批处理脚本的行为.

阶段0)读取线:读取输入线.

When reading a line to be parsed as a command, <LF> (0x1A) is read as <Ctrl-Z> (LineFeed 0x0A)
When GOTO or CALL reads lines while scanning for a :label, <LF>, is treated as itself - it is not converted to <Ctrl-Z>

Phase 1) Percent Expansion:

A double <LF> is replaced by a single %%
Expansion of argument variables (%, %*, etc.)
Expansion of %1, if var does not exists replace it with nothing
For a complete explanation read the first half of this from dbenham Same thread: Percent Phase

Phase 1.5) Remove %2: Remove all Carriage Returns (0x0D) from the line

Phase 2) Process special characters, tokenize, and build a cached command block: This is a complex process that is affected by things such as quotes, special characters, token delimiters, and caret escapes. What follows is an approximation of this process.

There are some concepts that are important throughout this phase.

A token is simply a string of characters that is treated as a unit.
Tokens are separated by token delimiters. The standard token delimiters are %var% <LF> %var% <CR> <space> <tab> ; and ,
Consecutive token delimiters are treated as one - there are no empty tokens between token delimiters
There are no token delimiters within a quoted string. The entire quoted string is always treated as part of a single token. A single token may consist of a combination of quoted strings and unquoted characters.

The following characters may have special meaning in this phase, depending on context: = <0x0B> <0x0C> <0xFF> ^ ( @ & | < > <LF> <space> <tab> ; ,

Look at each character from left to right:

If it is a caret (=), the next character is escaped, and the escaping caret is removed. Escaped characters lose all special meaning (except for <0x0B>).
If it is a quote (<0x0C>), toggle the quote flag. If the quote flag is active, then only <0xFF> and ^ are special. All other characters lose their special meaning until the next quote toggles the quote flag off. It is not possible to escape the closing quote. All quoted characters are always within the same token.
<LF> always turns off the quote flag. Other behaviors vary depending on context, but quotes never alter the behavior of ".
- Escaped "
  - <LF> is stripped
  - The next character is escaped. If at the end of line buffer, then the next line is read and appended to the current one before escaping the next character. If the next character is <LF>, then it is treated as a literal, meaning this process is not recursive.
- Unescaped <LF> not within parentheses
  - <LF> is stripped and parsing of the current line is terminated.
  - Any remaining characters in the line buffer are simply ignored.
- Unescaped <LF> within a FOR IN parenthesized block
  - <LF> is converted into a <LF>
  - If at the end of the line buffer, then the next line is read and appended to the current one.
- Unescaped <LF> within a parenthesized command block
  - <LF> is converted into <LF>, and the <space> is treated as part of the next line of the command block.
  - If at the end of line buffer, then the next line is read and appended to the space.
If it is one of the special characters <LF> <LF> <LF><space> or <space>, split the line at this point in order to handle pipes, command concatenation, and redirection.
- In the case of a pipe (&), each side is a separate command (or command block) that gets special handling in phase 5.3
- In the case of |, <, or > command concatenation, each side of the concatenation is treated as a separate command.
- In the case of |, &, &&, or || redirection, the redirection clause is parsed, temporarily removed, and then appended to the end of the current command. A redirection clause consists of an optional file handle digit, the redirection operator, and the redirection destination token.
  - If the token that precedes the redirection operator is a single digit, then the digit specifies the file handle to be redirected. If the handle token is not found, then output redirection defaults to 1 (stdout), and input redirection defaults to 0 (stdin).
If the very first token for this command (prior to moving redirection to the end) begins with <, then the << has special meaning. (> is not special in any other context)
- The special >> is removed.
- If ECHO is ON, then this command, along with any following concatenated commands on this line, are excluded from the phase 3 echo. If the @ is before an opening @, then the entire parenthesized block is excluded from the phase 3 echo.
Process parenthesis (provides for compound statements across multiple lines):
- If the parser is not looking for a command token, then @ is not special.
- If the parser is looking for a command token and finds @, then start a new compound statement and increment the parenthesis counter
- If the parenthesis counter is > 0 then @ terminates the compound statement and decrements the parenthesis counter.
- If the line end is reached and the parenthesis counter is > 0 then the next line will be appended to the compound statement (starts again with phase 0)
- If the parenthesis counter is 0 and the parser is looking for a command, then ( functions similar to a ( statement as long as it is immediately followed by a token delimiter, special character, newline, or end-of-file
  - All special characters lose their meaning except ( (line concatenation is possible)
  - Once the end of the logical line is reached, the entire "command" is discarded.
Each command is parsed into a series of tokens. The first token is always treated as a command token (after special ) have been stripped and redirection moved to the end).
- Leading token delimiters prior to the command token are stripped
- When parsing the command token, ) functions as a command token delimiter, in addition to the standard token delimiters
- The handling of subsequent tokens depends on the command.
Most commands simply concatenate all arguments after the command token into a single argument token. All argument token delimiters are preserved. Argument options are typically not parsed until phase 7.
Three commands get special handling - IF, FOR, and REM
- IF is split into two or three distinct parts that are processed independently. A syntax error in the IF construction will result in a fatal syntax error.
  - The comparison operation is the actual command that flows all the way through to phase 7
    - All IF options are fully parsed in phase 2.
    - Consecutive token delimiters collapse into a single space.
    - Depending on the comparison operator, there will be one or two value tokens that are identified.
  - The True command block is the set of commands after the condition, and is parsed like any other command block. If ELSE is to be used, then the True block must be parenthesized.
  - The optional False command block is the set of commands after ELSE. Again, this command block is parsed normally.
  - The True and False command blocks do not automatically flow into the subsequent phases. Their subsequent processing is controled by phase 7.
- FOR is split in two after the DO. A syntax error in the FOR construction will result in a fatal syntax error.
  - The portion through DO is the actual FOR iteration command that flows all the way through phase 7
    - All FOR options are fully parsed in phase 2.
    - The IN parenthesized clause treats REM as ^. After the IN clause is parsed, all tokens are concatenated together to form a single token.
    - Consecutive unescaped/unquoted token delimiters collapse into a single space throughout the FOR command through DO.
  - The portion after DO is a command block that is parsed normally. Subsequent processing of the DO command block is controled by the iteration in phase 7.
- REM detected in phase 2 is treated dramatically different than all other commands.
  - Only one argument token is parsed - the parser ignores characters after the first argument token.
  - If there is only one argument token that ends with an unescaped @ that ends the line, then the argument token is thrown away, and the subsequent line is parsed and appended to the REM. This repeats until there is more than one token, or the last character is not (.
  - The REM command may appear in phase 3 output, but the command is never executed, and the original argument text is echoed - escaping carets are not removed.
If the command token begins with <LF>, and this is the first round of phase 2 (not a restart due to CALL in phase 6) then
- The token is normally treated as an Unexecuted Label.
  - The remainder of the line is parsed, however <space>, ^, ^, : and ) no longer have special meaning. The entire remainder of the line is considered to be part of the label "command".
  - The < continues to be special, meaning that line continuation can be used to append the subsequent line to the label.
  - An Unexecuted Label within a parenthesized block will result in a fatal syntax error unless it is immediately followed by a command or Executed Label on the next line.
    - Note that > no longer has special meaning for the first command that follows the Unexecuted Label in this context.
  - The command is aborted after label parsing is complete. Subsequent phases do not take place for the label
- There are three exceptions that can cause a label found in phase 2 to be treated as an Executed Label that continues parsing through phase 7.
  - There is redirection that precedes the label token, and there is a & pipe or |, ^, or ( command concatenation on the line.
  - There is redirection that precedes the label token, and the command is within a parenthesized block.
  - The label token is the very first command on a line within a parenthesized block, and the line above ended with an Unexecuted Label.
- The following occurs when an Executed Label is discovered in phase 2
  - The label, its arguments, and its redirection are all excluded from any echo output in phase 3
  - Any subsequent concatenated commands on the line are fully parsed and executed.
- For more information about Executed Labels vs. Unexecuted Labels, see https://www.dostips.com/forum/viewtopic.php?f=3&t=3803&p=55405#p55405

Phase 3) Echo the parsed command(s) Only if the command block did not begin with |, and ECHO was ON at the start of the preceding step.

Phase 4) FOR & variable expansion: Only if a FOR command is active and the commands after DO are being processed.

At this point, phase 1 of batch processing will have already converted a FOR variable like && into ||. The command line has different percent expansion rules for phase 1. This is the reason that command lines use @ but batch files use %X for FOR variables.
FOR variable names are case sensitive, but %%X are not case sensitive.
%X take precedence over variable names. If a character following %X is both a modifier and a valid FOR variable name, and there exists a subsequent character that is an active FOR variable name, then the character is interpreted as a modifier.
FOR variable names are global, but only within the context of a DO clause. If a routine is CALLed from within a FOR DO clause, then the FOR variables are not expanded within the CALLed routine. But if the routine has its own FOR command, then all currently defined FOR variables are accessible to the inner DO commands.
FOR variable names can be reused within nested FORs. The inner FOR value takes precedence, but once the INNER FOR closes, then the outer FOR value is restored.
If ECHO was ON at the start of this phase, then phase 3) is repeated to show the parsed DO commands after the FOR variables have been expanded.

---- From this point onward, each command identified in phase 2 is processed separately.
---- Phases 5 through 7 are completed for one command before moving on to the next.

Phase 5) Delayed Expansion: Only if delayed expansion is on

If the command is within a parenthesized block on either side of a pipe, then skip this step.
Each token for a command is parsed for delayed expansion independently.
- Most commands parse two or more tokens - the command token, the arguments token, and each redirection destination token.
- The FOR command parses the IN clause token only.
- The IF command parses the comparison values only - either one or two, depending on the comparison operator.
For each parsed token, first check if it contains any %%X. If not, then the token is not parsed - important for ~modifiers characters. If the token does contain ~modifiers, then scan each character from left to right:
- If it is a caret (~) the next character has no special meaning, the caret itself is removed
- If it is an exclamation mark, search for the next exclamation mark (carets are not observed anymore), expand to the value of the variable.
  - Consecutive opening ! are collapsed into a single ^
  - Any remaining ! that cannot be paired is removed
- Important: At this phase quotes and other special characters are ignored
- Expanding vars at this stage is "safe", because special characters are not detected anymore (even ^ or !)
- For a more complete explanation, read the 2nd half of this from dbenham same thread - Exclamation Point Phase
- There are some edge cases where these rules seem to fail:
  See Delayed expansion fails in some cases

Phase 5.3) Pipe processing: Only if commands are on either side of a pipe
Each side of the pipe is processed independently.

If dealing with a parenthesized command block, then all ! with a command before and after are converted to !. Other <CR> are stripped.
The command (or command block) is executed asynchronously in a new cmd.exe thread via
<LF>. This means the command block gets a phase restart, but this time in command line mode.
This is the end of processing for the pipe commands.
For more info on how pipes are parsed and processed, look at this question and answers: Why does delayed expansion fail when inside a piped block of code?

Phase 5.5) Execute Redirection: Any redirection that was discovered in phase 2 is now executed.

The results of phases 4 and 5 can impact the redirection that was discovered in phase 2.
If the redirection fails, then the remainder of the command is aborted. Note that failed redirection does not set ERRORLEVEL to 1 unless %comspec% /S /D /c" commandBlock" is used.

Phase 6) CALL processing/Caret doubling: Only if the command token is CALL, or if the text before the first occurring standard token delimiter is CALL. If CALL is parsed from a larger command token, then the unused portion is prepended to the arguments token bef

你好jeb,谢谢你的见解......可能很难理解,但我会试着去思考它!你似乎已经进行了很多测试!感谢您的翻译(http://www.administrator.de/Die_Geheimnisse_des_Batch_Zeilen_Interpreters.html) (4认同)
杰布 - 也许第0阶段可以移动并与第6阶段结合？这对我来说更有意义,还是有理由将它们分开？ (3认同)
更新了第2和第5阶段 (3认同)
批处理阶段 5) - %%a 将在阶段 1 中已更改为 %a，因此 for 循环扩展确实扩展了 %a。另外，我在下面的答案中添加了对批处理阶段 1 的更详细说明（我没有编辑权限） (2认同)
@dbenham - 你是对的,我从来都不喜欢0阶段 (2认同)

Answer 2

Mik*_*ark 60

从命令窗口调用命令时,命令行参数的标记化不是由cmd.exe(也称为"shell")完成的.大多数情况下,标记化是由新形成的进程的C/C++运行时完成的,但这不一定是这样 - 例如,如果新进程不是用C/C++编写的,或者新进程选择忽略argv和处理自己的原始命令行(例如,使用GetCommandLine()).在操作系统级别,Windows将未命名的命令行作为单个字符串传递给新进程.这与大多数*nix shell形成对比,其中shell在将参数传递给新形成的进程之前以一致,可预测的方式对参数进行标记.所有这些意味着您可能会在Windows上的不同程序中遇到极为不同的参数标记化行为,因为单个程序通常会将参数标记化放在自己手中.

如果它听起来像无政府状态,那就是它.但是,由于大量Windows程序确实使用了Microsoft C/C++运行时argv,因此了解MSVCRT如何标记参数通常很有用.这是一段摘录:

参数由空格分隔,空格可以是空格或制表符.
由双引号括起的字符串被解释为单个参数,而不管其中包含的空格.带引号的字符串可以嵌入参数中.请注意,插入符号(^)不会被识别为转义字符或分隔符.
带有反斜杠的双引号""被解释为文字双引号(").
反斜杠按字面解释,除非它们紧跟在双引号之前.
如果偶数个反斜杠后面跟一个双引号,那么每个反斜杠(\)都会在argv数组中放置一个反斜杠(),双引号(")将被解释为字符串分隔符.
如果奇数个反斜杠后面跟一个双引号,那么每个反斜杠对都会在argv数组中放置一个反斜杠(),并且双引号会被剩余的反斜杠解释为转义序列,从而导致要放在argv中的文字双引号(").

Microsoft"批处理语言"(.bat)也不例外,它已经开发了自己独特的标记化和转义规则.在将参数传递给新执行的进程之前,它看起来像cmd.exe的命令提示符确实对命令行参数进行了一些预处理(主要用于变量替换和转义).您可以在本页的jeb和dbenham的优秀答案中阅读有关批处理语言和cmd转义的低级详细信息的更多信息.

让我们在C中构建一个简单的命令行实用程序,看看它对你的测试用例的描述:

int main(int argc, char* argv[]) {
    int i;
    for (i = 0; i < argc; i++) {
        printf("argv[%d][%s]\n", i, argv[i]);
    }
    return 0;
}

归档时间：	14 年，11 月前
查看次数：	60639 次
最近记录：	6 年，1 月前