为我的语法编写递归后代解析器

Ste*_*TNT 3 java parsing builder recursive-descent

我必须使用Recursive Descendent Parser Builder构建表达式.

这是我的语法:

----RULES----
<cond> ? <termb> [OR <termb>]*
<termb>?<factb>[AND <factb>]*
<factb>?<expr> RELOP <expr> | NOT <factb> | OPAR <cond> CPAR
<expr> ? [PLUS | MINUS] <term> [(PLUS <term>) | (MINUS <term>)]*
<term> ? <termp> [(MULT <termp>) | (DIV <termp>) | (REM <termp>)]*
<termp> ? <fact> [POWER <fact>]*
<fact> ? ID | NUM | OPAR1 <expr> CPAR1
----TERMINALS----
ID ? ("A" | ... | "Z" | "a" | ...| "z") [("A"| ... | "Z" | "a" | ...| "z" | "0" | ... | "9")]*
NUM ? ("0" | ... | "9") [("0" | ... | "9")]*
OPAR ? "("
CPAR ? ")"
OPAR1 ? "["
CPAR1 ? "]"
RELOP ? EQ | NEQ | GT | GE | LT | LE
EQ ? "= ="
NEQ ? "!="
GT ? ">"
GE ? ">="
LT ? "<"
LE ? "<="
POWER ? "^"
DIV ? "/"
REM ? "%"
MULT ? "*"
MINUS ? "?"
PLUS ? "+"
AND ? “and” or “&&”
OR ? “or” or “||”
NOT ? “not” or “!”
Run Code Online (Sandbox Code Playgroud)

现在,我有一个复合结构,每个终端都有一个类,我正在尝试按照上面的语法规则来构建它.

我的想法是每个语法规则都需要一个方法,每个方法都必须构建表达式树的一部分.

虽然它使用更简单的语法(例如只允许布尔表达式的语法),但我遇到了一些问题.

主要问题来自于expr规则,即使使用我的+和 - 的一元版本,我也可以通过允许表达式来使用它+3-4.这需要我找出何时必须将操作数视为一元和二进制!

这是我的Builder类的代码.请注意,有些东西是意大利语,但我已经使用评论解释了所有问题,甚至是我的问题.另请注意,有一行我使用了伪代码,所以这个不可编译!

package espressioniCondizionali;

import espressioniCondizionali.espressioni.*;

/**
 *
 * @author Stefano
 */
public class Builder6 {

    /**
     * This one is used just to parse the input String.
     * It has a prossimoSimbolo() method (something like a nextSymbol()) that returns a String with the next Symbol in the String
     */
    private GestoreInput gestoreInput;
    /**
     * Espressione is the Composite structure that represents and expression
     */
    private Espressione root = null;
    private String symbol = null;

    public Builder6(GestoreInput gestoreInput) {
        this.gestoreInput = gestoreInput;
    }

    public Espressione build() {
        buildCond();
        return root;
    }

    private void buildCond() {
        buildTermb();
        //I'm unsing a class named Simboli that holds every terminal symbol of my grammar. Symbol names are the same defined in the grammar.
        while (symbol.equalsIgnoreCase(Simboli.OR1) || symbol.equalsIgnoreCase(Simboli.OR2)) {
            Or or = new Or();
            or.setLeft(root);
            buildTermb();
            or.setRight(root);
            root = or;
        }
    }

    private void buildTermb() {
        buildFactb();
        while (symbol.equalsIgnoreCase(Simboli.AND1) || symbol.equalsIgnoreCase(Simboli.AND2)) {
            And and = new And();
            and.setLeft(root);
            buildFactb();
            and.setRight(root);
            root = and;
        }
    }

    private void buildFactb() {
        buildExpr();
        if (symbol.equalsIgnoreCase(Simboli.EQ) || symbol.equalsIgnoreCase(Simboli.NEQ) || symbol.equalsIgnoreCase(Simboli.GT) || symbol.equalsIgnoreCase(Simboli.LT) || symbol.equalsIgnoreCase(Simboli.LE) || symbol.equalsIgnoreCase(Simboli.GE)) {
            Operatore op = null;
            switch (symbol) {
                case Simboli.EQ: {
                    op = new Eq();
                    break;
                }
                case Simboli.NEQ: {
                    op = new Neq();
                    break;
                }
                case Simboli.GT: {
                    op = new Gt();
                    break;
                }
                case Simboli.GE: {
                    op = new Ge();
                    break;
                }
                case Simboli.LT: {
                    op = new Lt();
                    break;
                }
                case Simboli.LE: {
                    op = new Le();
                    break;
                }
                default: {
                    //Symbol not recognized, abort!
                    throw new RuntimeException("\"" + symbol + "\" non è un simbolo valido.");
                }
            }
            op.setLeft(root);
            buildExpr();
            op.setRight(root);
            root = op;
        } else if (symbol.equalsIgnoreCase(Simboli.NOT1) || symbol.equals(Simboli.NOT2)) {
            Not not = new Not();
            buildFactb();
            not.setRight(root);
            root = not;
        } else if (symbol.equalsIgnoreCase(Simboli.OPAR)) {
            buildCond();
            if (!symbol.equalsIgnoreCase(Simboli.CPAR)) { //If there's no closgin square bracket it means that our expression is not well formed
                throw new RuntimeException("Espressione non valida. Attesa una \")\", trovato \"" + symbol + "\"");
            }
        }
    }

    private void buildExpr() {
        Operatore op = null;
        if (symbol != null) { //Let's check if our expression starts with a + or a -
            if (symbol.equalsIgnoreCase(Simboli.PLUS)) {
                op = new Plus();
            } else if (symbol.equalsIgnoreCase(Simboli.MINUS)) {
                op = new Minus();
            }
        }
        buildTerm();
        //If our op is still null, we didn't found a + or a - so our operand will be a binary one
        //If op != null, our + or - is unary and we've got to manage it! A unary operand doesn't have a left son!
        if (op != null) {
            op.setRight(root);
            root = op;
        }
        //Since we can have something like -3+2+s-x we've got to check it
        while (symbol.equalsIgnoreCase(Simboli.PLUS) || symbol.equalsIgnoreCase(Simboli.MINUS)) {
            Operatore op1 = null; //We define a new Operatore that will be a + or a -
            switch (symbol) {
                case Simboli.PLUS: {
                    op1 = new Plus();
                    break;
                }
                case Simboli.MINUS: {
                    op1 = new Minus();
                    break;
                }
            }
            /*
             * Here comes a BIG problem. We used the first if/else to check if
             * our unary operand was at the beginning of the string, but now
             * we've got to see if our current operand is either binary or
             * unary! That's because we can have both a==-1+d and a==3-1+d. In
             * the first case, the - is unary, while is binary in the second.
             * So, how do we choose it?
             * 
             * EXAMPLE : (-a>2 || -b-12>0)
             * This one will be evaluated to -a > 2 || -12 > 0 that's clearly wrong!
             * -b is missing before -12. That's because the -12 is used as unary 
             * and so it won't have a left child (the -b part)
             */
            //--PSEUDOCODE
            if (op1 is not unary) {
                op1.setLeft(root);
            }
            //--END PSEUDOCODE
            //CURRENT IMPLEMENTATION OF THE PSEUDOCODE PART
            if (root != null && (root.getClass().equals(Num.class) || root.getClass().equals(Id.class))) { //It is binary if the previous value is a constant or a variable but not if it is an operand!                  
                op1.setLeft(root);
            }
            //END OF CURRENT IMPLEMENTATION OF THE PSEUDOCODE PART
            //Setting the right child must be done in both cases
            buildTerm();
            op1.setRight(root);
            root = op1;
        }
    }

    private void buildTerm() {
        buildTermp();
        while (symbol.equalsIgnoreCase(Simboli.MULT) || symbol.equalsIgnoreCase(Simboli.DIV) || symbol.equalsIgnoreCase(Simboli.REM)) {
            Operatore op = null;
            switch (symbol) {
                case Simboli.MULT: {
                    op = new Mult();
                    break;
                }
                case Simboli.DIV: {
                    op = new Div();
                    break;
                }
                case Simboli.REM: {
                    op = new Rem();
                    break;
                }
            }
            op.setLeft(root);
            buildTermp();
            op.setRight(root);
            root = op;
        }
    }

    private void buildTermp() {
        buildFact();
        while (symbol.equalsIgnoreCase(Simboli.POWER)) {
            Power p = new Power();
            p.setLeft(root);
            buildFact();
            p.setRight(root);
            root = p;
        }
    }

    private void buildFact() {
        symbol = gestoreInput.prossimoSimbolo();
        if (symbol.equalsIgnoreCase(Simboli.OPAR1)) { //Sottoespressione            
            buildExpr();
            if (!symbol.equalsIgnoreCase(Simboli.CPAR1)) { //Se non c'è una parentesi chiusa vuol dire che l'espressione non valida
                throw new RuntimeException("Espressione non valida. Attesa una \"]\", trovato \"" + symbol + "\"");
            }
        } else if (symbol.matches("[A-Za-z][A-Za-z | 0-9]*")) { //Nome di una variabile!
            root = new Id(symbol);
            symbol = gestoreInput.prossimoSimbolo();
        } else if (symbol.matches("[0-9]*")) { //E' una costante!
            root = new Num(symbol);
            symbol = gestoreInput.prossimoSimbolo();
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

已知问题:

(a<=b && c>1) || a==4 评估为 a <= b && c > 1

a==[-4] 评估为 a == a - 4

-4+3>c-2 评估为 +3 > c - 2

可能还有一些错误,但这些错误最常见.

所以这是我的问题:

首先,您认为此代码存在一些逻辑问题吗?我的意思是,它是做了语法所说的,还是做了一些非常错误的事情?您如何修复该expr方法以使其适用于一元和二元+或 - 操作数?

如果我的代码完全错误,是否有人可以帮助我编写工作版本?正如您在类名中看到的,这是我编写的第六个实现,我真的不知道下一步该做什么!

谢谢.

Dao*_*Wen 5

第1步:以不同的方式思考语法

我认为你在实现你的递归下降解析器时遇到了问题,因为你的语法非常复杂.[ ... ]*表单所代表的任意迭代很难解决.试着这样思考:

<cond> ? <termb> <cond-tail>
<cond-tail> ? OR <termb> <cond-tail> | EPSILON
<termb> ? <factb> <termb-tail>
<termb-tail> ? AND <factb> <termb-tail> | EPSILON
<factb> ? <expr> RELOP <expr> | NOT <factb> | OPAR <cond> CPAR
<expr> ? <unary-op> <term> <expr-tail>
<unary-op> ? PLUS | MINUS | EPSILON
<expr-tail> ? ((PLUS <term>) | (MINUS <term>)) <expr-tail> | EPSILON
<term> ? <termp> <term-tail>
<term-tail> ? ((MULT <termp>) | (DIV <termp>) | (REM <termp>)) <term-tail> | EPSILON
<termp> ? <fact> <termp-tail>
<termp-tail> ? POWER <fact> <termp-tail> | EPSILON
<fact> ? ID | NUM | OPAR1 <expr> CPAR1
EPSILON ? ""
Run Code Online (Sandbox Code Playgroud)

这个语法与你发布的语法相同,但我把[ ... ]*规则分成了自己的非终端.为了做到这一点,我添加了EPSILON规则,允许非终端匹配""(空字符串).这基本上与您的[ ... ]规则相同,可能会或可能不会实际匹配某些内容.epsilon规则只是最后一种选择,即"现在停止尝试匹配".例如,如果您当前正在解析a,cond-tail那么您首先要检查OR输入.如果你没有看到,OR那么你将转向下一个替代方案,即EPSILON.这意味着OR此条件链中没有更多条件,因此cond-tail在空字符串上匹配.

您可以通过在运营商一起编组进一步简化语法expr-tailterm-tail:

<cond> ? <termb> <cond-tail>
<cond-tail> ? OR <termb> <cond-tail> | EPSILON
<termb> ? <factb> <termb-tail>
<termb-tail> ? AND <factb> <termb-tail> | EPSILON
<factb> ? <expr> RELOP <expr> | NOT <factb> | OPAR <cond> CPAR
<expr> ? <unary-op> <term> <expr-tail>
<unary-op> ? PLUS | MINUS | EPSILON
<expr-tail> ? (PLUS | MINUS) <term> <expr-tail> | EPSILON
<term> ? <termp> <term-tail>
<term-tail> ? (MULT | DIV | REM) <termp> <term-tail> | EPSILON
<termp> ? <fact> <termp-tail>
<termp-tail> ? POWER <fact> <termp-tail> | EPSILON
<fact> ? ID | NUM | OPAR1 <expr> CPAR1
EPSILON ? ""
Run Code Online (Sandbox Code Playgroud)

现在,您应该返回并重构您的解决方案,为每个新规则添加新方法.(您可以选择将新规则的代码保留在原始规则中,但至少应使用注释清楚地标记不同的部分,以帮助您跟踪代码的哪些部分正在执行哪些操作.)

现在,应该是在何种情况下非常明显+,并-有一元运算符,因为他们会通过新的解析buildUnaryOp方法(对应于新unary-op的语法规则).

至于新*-tail规则,你可以选择严格递归地实现它们,或者你可以在函数体内用循环实现它们,如果你担心在长表达式上烧掉你的堆栈.

第2步:修复状态跟踪

这是你的主要问题:

/**
 * Espressione is the Composite structure that represents and expression
 */
private Espressione root = null;
private String symbol = null;
Run Code Online (Sandbox Code Playgroud)

您正在使用实例级字段来存储递归下降解析器的状态.我从来没有见过一个递归下降的解析器,其方法的返回类型为void-at至少,如果它实际上正在构建一个抽象语法树.维基百科上的示例void返回类型,但这是因为它所做的只是检查输入并丢弃结果,而不是实际从输入构建AST.

由于您将状态存储在类级别字段中,因此在每个兄弟调用中都会覆盖该字段,buildExpr()并且您将丢失数据.无论如何,那是我的理论.你可以通过制作一长串exprs 来测试它"1+2+3+4+5".它应该最终放弃前面的所有条款.

我建议在功能上构建解析器(不使用任何.set*()方法),其中每个build*()方法从AST返回解析的节点.这将涉及更改所有构造函数以接受子节点.如果您对此更清楚,您仍然可以使用变异.这是重构的buildCond:

private Espressione buildCond() {
      Espressione leftHandSide = buildTermb();
      if (symbol.equalsIgnoreCase(Simboli.OR1) || symbol.equalsIgnoreCase(Simboli.OR2)) {
          Or or = new Or();
          or.setLeft(leftHandSide);
          or.setRight(buildTermb());
          return or;
      }
      return leftHandSide;
  }
Run Code Online (Sandbox Code Playgroud)

这是新的build()样子:

public Espressione build() {
    return buildCond();
}
Run Code Online (Sandbox Code Playgroud)

如果您重新使用其余代码以这种方式构建树(而不是使用实例级字段来保存您的状态),那么它应该工作得更好.

祝好运!