如何在Java中浏览Java字符串文字?

zig*_*tar 68 java string escaping

我正在使用Java处理一些Java源代码.我正在提取字符串文字并将它们提供给一个带字符串的函数.问题是我需要将未转义的String版本传递给函数(即这意味着转换\n为换行符,转换为\\单行\等).

Java API中是否有一个函数可以执行此操作?如果没有,我可以从某个库获得这样的功能吗?显然,Java编译器必须进行此转换.

如果有人想知道,我试图在反编译的混淆Java文件中取消混淆字符串文字.

tch*_*ist 97

问题

这里org.apache.commons.lang.StringEscapeUtils.unescapeJava()给出的另一个答案实际上是很少的帮助.

  • 它忘记\0了null.
  • 它不处理八.
  • 它无法处理由其java.util.regex.Pattern.compile()使用的各种逃逸,包括\a,\e尤其是\cX.
  • 它不支持按编号分配逻辑Unicode代码点,仅适用于UTF-16.
  • 这看起来像UCS-2代码,而不是UTF-16代码:它们使用折旧的charAt接口而不是codePoint接口,从而揭示了Java char保证拥有Unicode字符的错觉.不是.他们只是逃避这一点,因为没有UTF-16代理人会找他们正在寻找的任何东西.

解决方案

我写了一个字符串unescaper,解决了OP的问题,没有Apache代码的所有烦恼.

/*
 *
 * unescape_perl_string()
 *
 *      Tom Christiansen <tchrist@perl.com>
 *      Sun Nov 28 12:55:24 MST 2010
 *
 * It's completely ridiculous that there's no standard
 * unescape_java_string function.  Since I have to do the
 * damn thing myself, I might as well make it halfway useful
 * by supporting things Java was too stupid to consider in
 * strings:
 * 
 *   => "?" items  are additions to Java string escapes
 *                 but normal in Java regexes
 *
 *   => "!" items  are also additions to Java regex escapes
 *   
 * Standard singletons: ?\a ?\e \f \n \r \t
 * 
 *      NB: \b is unsupported as backspace so it can pass-through
 *          to the regex translator untouched; I refuse to make anyone
 *          doublebackslash it as doublebackslashing is a Java idiocy
 *          I desperately wish would die out.  There are plenty of
 *          other ways to write it:
 *
 *              \cH, \12, \012, \x08 \x{8}, \u0008, \U00000008
 *
 * Octal escapes: \0 \0N \0NN \N \NN \NNN
 *    Can range up to !\777 not \377
 *    
 *      TODO: add !\o{NNNNN}
 *          last Unicode is 4177777
 *          maxint is 37777777777
 *
 * Control chars: ?\cX
 *      Means: ord(X) ^ ord('@')
 *
 * Old hex escapes: \xXX
 *      unbraced must be 2 xdigits
 *
 * Perl hex escapes: !\x{XXX} braced may be 1-8 xdigits
 *       NB: proper Unicode never needs more than 6, as highest
 *           valid codepoint is 0x10FFFF, not maxint 0xFFFFFFFF
 *
 * Lame Java escape: \[IDIOT JAVA PREPROCESSOR]uXXXX must be
 *                   exactly 4 xdigits;
 *
 *       I can't write XXXX in this comment where it belongs
 *       because the damned Java Preprocessor can't mind its
 *       own business.  Idiots!
 *
 * Lame Python escape: !\UXXXXXXXX must be exactly 8 xdigits
 * 
 * TODO: Perl translation escapes: \Q \U \L \E \[IDIOT JAVA PREPROCESSOR]u \l
 *       These are not so important to cover if you're passing the
 *       result to Pattern.compile(), since it handles them for you
 *       further downstream.  Hm, what about \[IDIOT JAVA PREPROCESSOR]u?
 *
 */

public final static
String unescape_perl_string(String oldstr) {

    /*
     * In contrast to fixing Java's broken regex charclasses,
     * this one need be no bigger, as unescaping shrinks the string
     * here, where in the other one, it grows it.
     */

    StringBuffer newstr = new StringBuffer(oldstr.length());

    boolean saw_backslash = false;

    for (int i = 0; i < oldstr.length(); i++) {
        int cp = oldstr.codePointAt(i);
        if (oldstr.codePointAt(i) > Character.MAX_VALUE) {
            i++; /****WE HATES UTF-16! WE HATES IT FOREVERSES!!!****/
        }

        if (!saw_backslash) {
            if (cp == '\\') {
                saw_backslash = true;
            } else {
                newstr.append(Character.toChars(cp));
            }
            continue; /* switch */
        }

        if (cp == '\\') {
            saw_backslash = false;
            newstr.append('\\');
            newstr.append('\\');
            continue; /* switch */
        }

        switch (cp) {

            case 'r':  newstr.append('\r');
                       break; /* switch */

            case 'n':  newstr.append('\n');
                       break; /* switch */

            case 'f':  newstr.append('\f');
                       break; /* switch */

            /* PASS a \b THROUGH!! */
            case 'b':  newstr.append("\\b");
                       break; /* switch */

            case 't':  newstr.append('\t');
                       break; /* switch */

            case 'a':  newstr.append('\007');
                       break; /* switch */

            case 'e':  newstr.append('\033');
                       break; /* switch */

            /*
             * A "control" character is what you get when you xor its
             * codepoint with '@'==64.  This only makes sense for ASCII,
             * and may not yield a "control" character after all.
             *
             * Strange but true: "\c{" is ";", "\c}" is "=", etc.
             */
            case 'c':   {
                if (++i == oldstr.length()) { die("trailing \\c"); }
                cp = oldstr.codePointAt(i);
                /*
                 * don't need to grok surrogates, as next line blows them up
                 */
                if (cp > 0x7f) { die("expected ASCII after \\c"); }
                newstr.append(Character.toChars(cp ^ 64));
                break; /* switch */
            }

            case '8':
            case '9': die("illegal octal digit");
                      /* NOTREACHED */

    /*
     * may be 0 to 2 octal digits following this one
     * so back up one for fallthrough to next case;
     * unread this digit and fall through to next case.
     */
            case '1':
            case '2':
            case '3':
            case '4':
            case '5':
            case '6':
            case '7': --i;
                      /* FALLTHROUGH */

            /*
             * Can have 0, 1, or 2 octal digits following a 0
             * this permits larger values than octal 377, up to
             * octal 777.
             */
            case '0': {
                if (i+1 == oldstr.length()) {
                    /* found \0 at end of string */
                    newstr.append(Character.toChars(0));
                    break; /* switch */
                }
                i++;
                int digits = 0;
                int j;
                for (j = 0; j <= 2; j++) {
                    if (i+j == oldstr.length()) {
                        break; /* for */
                    }
                    /* safe because will unread surrogate */
                    int ch = oldstr.charAt(i+j);
                    if (ch < '0' || ch > '7') {
                        break; /* for */
                    }
                    digits++;
                }
                if (digits == 0) {
                    --i;
                    newstr.append('\0');
                    break; /* switch */
                }
                int value = 0;
                try {
                    value = Integer.parseInt(
                                oldstr.substring(i, i+digits), 8);
                } catch (NumberFormatException nfe) {
                    die("invalid octal value for \\0 escape");
                }
                newstr.append(Character.toChars(value));
                i += digits-1;
                break; /* switch */
            } /* end case '0' */

            case 'x':  {
                if (i+2 > oldstr.length()) {
                    die("string too short for \\x escape");
                }
                i++;
                boolean saw_brace = false;
                if (oldstr.charAt(i) == '{') {
                        /* ^^^^^^ ok to ignore surrogates here */
                    i++;
                    saw_brace = true;
                }
                int j;
                for (j = 0; j < 8; j++) {

                    if (!saw_brace && j == 2) {
                        break;  /* for */
                    }

                    /*
                     * ASCII test also catches surrogates
                     */
                    int ch = oldstr.charAt(i+j);
                    if (ch > 127) {
                        die("illegal non-ASCII hex digit in \\x escape");
                    }

                    if (saw_brace && ch == '}') { break; /* for */ }

                    if (! ( (ch >= '0' && ch <= '9')
                                ||
                            (ch >= 'a' && ch <= 'f')
                                ||
                            (ch >= 'A' && ch <= 'F')
                          )
                       )
                    {
                        die(String.format(
                            "illegal hex digit #%d '%c' in \\x", ch, ch));
                    }

                }
                if (j == 0) { die("empty braces in \\x{} escape"); }
                int value = 0;
                try {
                    value = Integer.parseInt(oldstr.substring(i, i+j), 16);
                } catch (NumberFormatException nfe) {
                    die("invalid hex value for \\x escape");
                }
                newstr.append(Character.toChars(value));
                if (saw_brace) { j++; }
                i += j-1;
                break; /* switch */
            }

            case 'u': {
                if (i+4 > oldstr.length()) {
                    die("string too short for \\u escape");
                }
                i++;
                int j;
                for (j = 0; j < 4; j++) {
                    /* this also handles the surrogate issue */
                    if (oldstr.charAt(i+j) > 127) {
                        die("illegal non-ASCII hex digit in \\u escape");
                    }
                }
                int value = 0;
                try {
                    value = Integer.parseInt( oldstr.substring(i, i+j), 16);
                } catch (NumberFormatException nfe) {
                    die("invalid hex value for \\u escape");
                }
                newstr.append(Character.toChars(value));
                i += j-1;
                break; /* switch */
            }

            case 'U': {
                if (i+8 > oldstr.length()) {
                    die("string too short for \\U escape");
                }
                i++;
                int j;
                for (j = 0; j < 8; j++) {
                    /* this also handles the surrogate issue */
                    if (oldstr.charAt(i+j) > 127) {
                        die("illegal non-ASCII hex digit in \\U escape");
                    }
                }
                int value = 0;
                try {
                    value = Integer.parseInt(oldstr.substring(i, i+j), 16);
                } catch (NumberFormatException nfe) {
                    die("invalid hex value for \\U escape");
                }
                newstr.append(Character.toChars(value));
                i += j-1;
                break; /* switch */
            }

            default:   newstr.append('\\');
                       newstr.append(Character.toChars(cp));
           /*
            * say(String.format(
            *       "DEFAULT unrecognized escape %c passed through",
            *       cp));
            */
                       break; /* switch */

        }
        saw_backslash = false;
    }

    /* weird to leave one at the end */
    if (saw_backslash) {
        newstr.append('\\');
    }

    return newstr.toString();
}

/*
 * Return a string "U+XX.XXX.XXXX" etc, where each XX set is the
 * xdigits of the logical Unicode code point. No bloody brain-damaged
 * UTF-16 surrogate crap, just true logical characters.
 */
 public final static
 String uniplus(String s) {
     if (s.length() == 0) {
         return "";
     }
     /* This is just the minimum; sb will grow as needed. */
     StringBuffer sb = new StringBuffer(2 + 3 * s.length());
     sb.append("U+");
     for (int i = 0; i < s.length(); i++) {
         sb.append(String.format("%X", s.codePointAt(i)));
         if (s.codePointAt(i) > Character.MAX_VALUE) {
             i++; /****WE HATES UTF-16! WE HATES IT FOREVERSES!!!****/
         }
         if (i+1 < s.length()) {
             sb.append(".");
         }
     }
     return sb.toString();
 }

private static final
void die(String foa) {
    throw new IllegalArgumentException(foa);
}

private static final
void say(String what) {
    System.out.println(what);
}
Run Code Online (Sandbox Code Playgroud)

如果它对其他人有帮助,那么欢迎你这样做 - 没有任何附加条件.如果你改进它,我会很乐意将你的增强功能邮寄给我,但你当然不必这样做.

  • 我看到你的坏男孩态度为你赢得了赞成,但这并不能使你的解决方案变得更好.不正确处理它意味着它不适合生产,这种情况与其他任何情况一样,无论你喜欢与否 (5认同)
  • 为什么你的例程被称为unescape_*perl*_string?另外,对规范未定义的事物进行所有额外的转义难道不是一个错误吗,因为 java 本身不会以这种方式解释文字?只是确保我没有遗漏这里的任何内容 - 代码足够复杂,我有点担心所有额外的位。 (3认同)
  • Apache Commons Lang3 修复了其中的一些问题(至少是八进制位,这让我很满意)https://issues.apache.org/jira/browse/LANG-646 (3认同)

pol*_*nts 48

您可以使用String unescapeJava(String)的方法StringEscapeUtils阿帕奇共享郎.

这是一个示例代码段:

    String in = "a\\tb\\n\\\"c\\\"";

    System.out.println(in);
    // a\tb\n\"c\"

    String out = StringEscapeUtils.unescapeJava(in);

    System.out.println(out);
    // a    b
    // "c"
Run Code Online (Sandbox Code Playgroud)

实用程序类具有用于Java,Java脚本,HTML,XML和SQL的转义和转义字符串的方法.它还具有直接写入a的重载java.io.Writer.


注意事项

看起来像StringEscapeUtils处理Unicode转义与一个u,但不是八进制转义,或Unicode转义与无关us.

    /* Unicode escape test #1: PASS */

    System.out.println(
        "\u0030"
    ); // 0
    System.out.println(
        StringEscapeUtils.unescapeJava("\\u0030")
    ); // 0
    System.out.println(
        "\u0030".equals(StringEscapeUtils.unescapeJava("\\u0030"))
    ); // true

    /* Octal escape test: FAIL */

    System.out.println(
        "\45"
    ); // %
    System.out.println(
        StringEscapeUtils.unescapeJava("\\45")
    ); // 45
    System.out.println(
        "\45".equals(StringEscapeUtils.unescapeJava("\\45"))
    ); // false

    /* Unicode escape test #2: FAIL */

    System.out.println(
        "\uu0030"
    ); // 0
    System.out.println(
        StringEscapeUtils.unescapeJava("\\uu0030")
    ); // throws NestableRuntimeException:
       //   Unable to parse unicode value: u003
Run Code Online (Sandbox Code Playgroud)

来自JLS的引用:

提供了用于使用C兼容性八路逃逸,但是只能表示Unicode值\u0000通过\u00FF,所以Unicode转义通常是优选的.

如果您的字符串可以包含八进制转义符,您可能希望首先将它们转换为Unicode转义符,或使用其他方法.

无关的u内容也记录如下:

Java编程语言指定了一种将用Unicode编写的程序转换为ASCII的标准方法,该程序将程序更改为可由基于ASCII的工具处理的形式.改造涉及转换任何Unicode在程序的源文本以ASCII逸出通过增加一个额外u-例如,\uxxxx变得\uuxxxx-while同时将非ASCII字符中的源文本以Unicode转义含有单一u各自.

这个转换版本对于Java编程语言的编译器同样可以接受,并且代表完全相同的程序.通过将u存在多个转义序列的每个转义序列转换为一个较少的Unicode字符序列u,同时将每个转义序列转换u为相应的单个Unicode字符,可以从此ASCII格式恢复确切的Unicode源.

如果您的字符串可以包含无关的Unicode转义u,那么您可能还需要在使用前对其进行预处理StringEscapeUtils.

或者,您可以尝试从头开始编写自己的Java字符串文字unescaper,确保遵循确切的JLS规范.

参考


Udo*_*ski 11

遇到了类似的问题,对提出的解决方案并不满意,并自己实现了这个.

也可以在Github上作为Gist :

/**
 * Unescapes a string that contains standard Java escape sequences.
 * <ul>
 * <li><strong>&#92;b &#92;f &#92;n &#92;r &#92;t &#92;" &#92;'</strong> :
 * BS, FF, NL, CR, TAB, double and single quote.</li>
 * <li><strong>&#92;X &#92;XX &#92;XXX</strong> : Octal character
 * specification (0 - 377, 0x00 - 0xFF).</li>
 * <li><strong>&#92;uXXXX</strong> : Hexadecimal based Unicode character.</li>
 * </ul>
 * 
 * @param st
 *            A string optionally containing standard java escape sequences.
 * @return The translated string.
 */
public String unescapeJavaString(String st) {

    StringBuilder sb = new StringBuilder(st.length());

    for (int i = 0; i < st.length(); i++) {
        char ch = st.charAt(i);
        if (ch == '\\') {
            char nextChar = (i == st.length() - 1) ? '\\' : st
                    .charAt(i + 1);
            // Octal escape?
            if (nextChar >= '0' && nextChar <= '7') {
                String code = "" + nextChar;
                i++;
                if ((i < st.length() - 1) && st.charAt(i + 1) >= '0'
                        && st.charAt(i + 1) <= '7') {
                    code += st.charAt(i + 1);
                    i++;
                    if ((i < st.length() - 1) && st.charAt(i + 1) >= '0'
                            && st.charAt(i + 1) <= '7') {
                        code += st.charAt(i + 1);
                        i++;
                    }
                }
                sb.append((char) Integer.parseInt(code, 8));
                continue;
            }
            switch (nextChar) {
            case '\\':
                ch = '\\';
                break;
            case 'b':
                ch = '\b';
                break;
            case 'f':
                ch = '\f';
                break;
            case 'n':
                ch = '\n';
                break;
            case 'r':
                ch = '\r';
                break;
            case 't':
                ch = '\t';
                break;
            case '\"':
                ch = '\"';
                break;
            case '\'':
                ch = '\'';
                break;
            // Hex Unicode: u????
            case 'u':
                if (i >= st.length() - 5) {
                    ch = 'u';
                    break;
                }
                int code = Integer.parseInt(
                        "" + st.charAt(i + 2) + st.charAt(i + 3)
                                + st.charAt(i + 4) + st.charAt(i + 5), 16);
                sb.append(Character.toChars(code));
                i += 5;
                continue;
            }
            i++;
        }
        sb.append(ch);
    }
    return sb.toString();
}
Run Code Online (Sandbox Code Playgroud)

  • 为了格外小心,首先进行空检查,然后在这种情况下返回空。不过谢谢! (3认同)

Las*_*olt 10

http://commons.apache.org/lang/看到:

StringEscapeUtils

StringEscapeUtils.unescapeJava(String str)


Dao*_*Wen 8

我知道这个问题很老,但是我想要一个不包含JRE6之外的库的解决方案(即Apache Commons是不可接受的),我想出了一个使用内置的简单解决方案java.io.StreamTokenizer:

import java.io.*;

// ...

String literal = "\"Has \\\"\\\\\\\t\\\" & isn\\\'t \\\r\\\n on 1 line.\"";
StreamTokenizer parser = new StreamTokenizer(new StringReader(literal));
String result;
try {
  parser.nextToken();
  if (parser.ttype == '"') {
    result = parser.sval;
  }
  else {
    result = "ERROR!";
  }
}
catch (IOException e) {
  result = e.toString();
}
System.out.println(result);
Run Code Online (Sandbox Code Playgroud)

输出:

Has "\  " & isn't
 on 1 line.
Run Code Online (Sandbox Code Playgroud)


Ale*_*lex 7

Java 13 添加了一个执行此操作的方法:String#translateEscapes.

它是 Java 13 和 14 中的预览功能,但在 Java 15 中升级为完整功能。


Cha*_*etz 6

我在这方面有点晚了,但我想我会提供我的解决方案,因为我需要相同的功能.我决定使用Java Compiler API,这会让它变慢,但会使结果准确.基本上我活着创建一个类然后返回结果.这是方法:

public static String[] unescapeJavaStrings(String... escaped) {
    //class name
    final String className = "Temp" + System.currentTimeMillis();
    //build the source
    final StringBuilder source = new StringBuilder(100 + escaped.length * 20).
            append("public class ").append(className).append("{\n").
            append("\tpublic static String[] getStrings() {\n").
            append("\t\treturn new String[] {\n");
    for (String string : escaped) {
        source.append("\t\t\t\"");
        //we escape non-escaped quotes here to be safe 
        //  (but something like \\" will fail, oh well for now)
        for (int i = 0; i < string.length(); i++) {
            char chr = string.charAt(i);
            if (chr == '"' && i > 0 && string.charAt(i - 1) != '\\') {
                source.append('\\');
            }
            source.append(chr);
        }
        source.append("\",\n");
    }
    source.append("\t\t};\n\t}\n}\n");
    //obtain compiler
    final JavaCompiler compiler = ToolProvider.getSystemJavaCompiler();
    //local stream for output
    final ByteArrayOutputStream out = new ByteArrayOutputStream();
    //local stream for error
    ByteArrayOutputStream err = new ByteArrayOutputStream();
    //source file
    JavaFileObject sourceFile = new SimpleJavaFileObject(
            URI.create("string:///" + className + Kind.SOURCE.extension), Kind.SOURCE) {
        @Override
        public CharSequence getCharContent(boolean ignoreEncodingErrors) throws IOException {
            return source;
        }
    };
    //target file
    final JavaFileObject targetFile = new SimpleJavaFileObject(
            URI.create("string:///" + className + Kind.CLASS.extension), Kind.CLASS) {
        @Override
        public OutputStream openOutputStream() throws IOException {
            return out;
        }
    };
    //file manager proxy, with most parts delegated to the standard one 
    JavaFileManager fileManagerProxy = (JavaFileManager) Proxy.newProxyInstance(
            StringUtils.class.getClassLoader(), new Class[] { JavaFileManager.class },
            new InvocationHandler() {
                //standard file manager to delegate to
                private final JavaFileManager standard = 
                    compiler.getStandardFileManager(null, null, null); 
                @Override
                public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
                    if ("getJavaFileForOutput".equals(method.getName())) {
                        //return the target file when it's asking for output
                        return targetFile;
                    } else {
                        return method.invoke(standard, args);
                    }
                }
            });
    //create the task
    CompilationTask task = compiler.getTask(new OutputStreamWriter(err), 
            fileManagerProxy, null, null, null, Collections.singleton(sourceFile));
    //call it
    if (!task.call()) {
        throw new RuntimeException("Compilation failed, output:\n" + 
                new String(err.toByteArray()));
    }
    //get the result
    final byte[] bytes = out.toByteArray();
    //load class
    Class<?> clazz;
    try {
        //custom class loader for garbage collection
        clazz = new ClassLoader() { 
            protected Class<?> findClass(String name) throws ClassNotFoundException {
                if (name.equals(className)) {
                    return defineClass(className, bytes, 0, bytes.length);
                } else {
                    return super.findClass(name);
                }
            }
        }.loadClass(className);
    } catch (ClassNotFoundException e) {
        throw new RuntimeException(e);
    }
    //reflectively call method
    try {
        return (String[]) clazz.getDeclaredMethod("getStrings").invoke(null);
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}
Run Code Online (Sandbox Code Playgroud)

它需要一个数组,因此您可以分批进行.所以下面的简单测试成功:

public static void main(String[] meh) {
    if ("1\02\03\n".equals(unescapeJavaStrings("1\\02\\03\\n")[0])) {
        System.out.println("Success");
    } else {
        System.out.println("Failure");
    }
}
Run Code Online (Sandbox Code Playgroud)


Jen*_*gsa 6

org.apache.commons.lang3.StringEscapeUtils来自 commons-lang3 的版本现已标记为已弃用。你可以用org.apache.commons.text.StringEscapeUtils#unescapeJava(String)它代替。它需要额外的Maven 依赖项

        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-text</artifactId>
            <version>1.4</version>
        </dependency>
Run Code Online (Sandbox Code Playgroud)

并且似乎可以处理一些更特殊的情况,例如它会转义:

  • 转义反斜杠、单引号和双引号
  • 转义八进制和 unicode 值
  • \\b, \\n, \\t, \\f,\\r


Tva*_*roh 5

作为记录,如果您使用 Scala,您可以执行以下操作:

StringContext.treatEscapes(escaped)
Run Code Online (Sandbox Code Playgroud)