对字符串进行标记但忽略引号内的分隔符

Question

对字符串进行标记但忽略引号内的分隔符

我希望有以下字符串

!cmd 45 90 "An argument" Another AndAnother "Another one in quotes"

Run Code Online (Sandbox Code Playgroud)

成为以下阵列

{ "!cmd", "45", "90", "An argument", "Another", "AndAnother", "Another one in quotes" }

Run Code Online (Sandbox Code Playgroud)

我试过了

new StringTokenizer(cmd, "\"")

Run Code Online (Sandbox Code Playgroud)

但这将返回"另一个"和"和另一个"另一个和另一个"这不是预期的效果.

谢谢.

编辑:我再次改变了这个例子,这次我认为它解释了最好的情况,尽管它与第二个例子没有什么不同.

Answer 1

pol*_*nts 52

在这些场景中使用a java.util.regex.Matcher并执行find()而不是任何类型更容易split.

也就是说,不是为标记之间的分隔符定义模式,而是定义标记本身的模式.

这是一个例子:

    String text = "1 2 \"333 4\" 55 6    \"77\" 8 999";
    // 1 2 "333 4" 55 6    "77" 8 999

    String regex = "\"([^\"]*)\"|(\\S+)";

    Matcher m = Pattern.compile(regex).matcher(text);
    while (m.find()) {
        if (m.group(1) != null) {
            System.out.println("Quoted [" + m.group(1) + "]");
        } else {
            System.out.println("Plain [" + m.group(2) + "]");
        }
    }

Run Code Online (Sandbox Code Playgroud)

以上打印(如ideone.com上所示):

Plain [1]
Plain [2]
Quoted [333 4]
Plain [55]
Plain [6]
Quoted [77]
Plain [8]
Plain [999]

Run Code Online (Sandbox Code Playgroud)

模式基本上是:

"([^"]*)"|(\S+)
 \_____/  \___/
    1       2

Run Code Online (Sandbox Code Playgroud)

有2个替代品:

第一个备用匹配开头双引号,除了双引号(在第1组中捕获)之外的任何序列,然后是结束双引号
第二个备用匹配在第2组中捕获的任何非空白字符序列
替代品的顺序在这种模式中很重要

请注意,这不会处理引用段中的转义双引号.如果您需要这样做,那么模式会变得更复杂,但Matcher解决方案仍然有效.

参考

regular-expressions.info/Brackets for Grouping and Capturing,Alternation with Vertical Bar,Character Class,Repetition with Star and Plus

也可以看看

regular-expressions.info/Examples - Programmer - Strings - 用于带有转义引号的模式

附录

请注意,这StringTokenizer是一个遗留类.建议使用java.util.Scanner或者String.split,或者当然java.util.regex.Matcher是最灵活的.

相关问题

Deprecated和Legacy API之间的区别？
Scanner vs. StringTokenizer vs. String.Split
使用java.util.Scanner验证输入 - 有很多例子

Answer 2

Gra*_*erB 7

这是老式的方式.创建一个查看for循环中每个字符的函数.如果角色是空格,请将所有内容(不包括空格)取出并添加为数组的条目.注意位置,并再次执行相同的操作,将空格后的下一部分添加到数组中.遇到双引号时,将名为'inQuote'的布尔值标记为true,并在inQuote为true时忽略空格.当inQuote为true时命中引号时,将其标记为false并在遇到空格时返回到破坏状态.然后,您可以根据需要扩展它以支持转义字符等.

这可以用正则表达式完成吗？我想,我不知道.但是整个功能的写入要比这个回复少.

Answer 3

mik*_*ent 5

Apache Commons 来救援！

import org.apache.commons.text.StringTokenizer
import org.apache.commons.text.matcher.StringMatcher
import org.apache.commons.text.matcher.StringMatcherFactory
@Grab(group='org.apache.commons', module='commons-text', version='1.3')

def str = /is this   'completely "impossible"' or """slightly"" impossible" to parse?/

StringTokenizer st = new StringTokenizer( str )
StringMatcher sm = StringMatcherFactory.INSTANCE.quoteMatcher()
st.setQuoteMatcher( sm )

println st.tokenList

Run Code Online (Sandbox Code Playgroud)

输出：

[是，这，完全“不可能”，还是“稍微”不可能，解析？]

一些注意事项：

这是用 Groovy 编写的……它实际上是一个 Groovy 脚本。该 @Grab行提供了您需要的依赖行类型的线索（例如 in build.gradle）......或者当然只是在您的类路径中包含 .jar
StringTokenizer这里不是 java.util.StringTokenizer......正如该import行所示 org.apache.commons.text.StringTokenizer
该def str = ... 行是一种String在 Groovy 中生成包含单引号和双引号而无需转义的方法
StringMatcherFactory在 apache commons-text 1.3 中可以在这里找到：如您所见，它INSTANCE可以为您提供一堆不同的StringMatchers。您甚至可以推出自己的：但您需要检查StringMatcherFactory源代码以了解它是如何完成的。
是的！您不仅可以包含“其他类型的引用”，而且它被正确解释为不是标记边界……但您甚至可以通过将标记化中的引用加倍来逃避用于关闭标记化的实际引用- 字符串的保护位！尝试用几行代码来实现它……或者更确切地说，不要！

PS 为什么使用 Apache Commons 比任何其他解决方案更好？除了没有必要重新发明轮子这一事实之外，我至少可以想到两个原因：

可以指望 Apache 工程师已经预见到所有问题并开发出健壮的、经过全面测试的、可靠的代码
这意味着你不会用笨拙的实用方法把你漂亮的代码弄得乱七八糟——你只是有一段漂亮、干净的代码，它完全按照它在罐头上说的做，让你继续做，嗯，有趣的东西.. .

PPS 没有什么迫使您将 Apache 代码视为神秘的“黑匣子”。源代码是开放的，并用通常完全“可访问”的 Java 编写。因此，您可以自由地检查事情是如何做到心满意足的。这样做通常很有启发性。

之后

对 ArtB 的问题非常感兴趣，我查看了来源：

在 StringMatcherFactory.java 中我们看到：

private static final AbstractStringMatcher.CharSetMatcher QUOTE_MATCHER = new AbstractStringMatcher.CharSetMatcher(
            "'\"".toCharArray());

Run Code Online (Sandbox Code Playgroud)

……比较沉闷……

因此，我们可以查看 StringTokenizer.java：

public StringTokenizer setQuoteMatcher(final StringMatcher quote) {
        if (quote != null) {
            this.quoteMatcher = quote;
        }
        return this;
}

Run Code Online (Sandbox Code Playgroud)

好的...然后，在同一个 java 文件中：

private int readWithQuotes(final char[] srcChars ...

Run Code Online (Sandbox Code Playgroud)

其中包含评论：

// If we've found a quote character, see if it's followed by a second quote. If so, then we need to actually put the quote character into the token rather than end the token.

Run Code Online (Sandbox Code Playgroud)

......我懒得继续追踪线索了。您有一个选择：要么是您的“hackish”解决方案，您在提交字符串以进行标记化之前系统地预处理您的字符串，将 |\\\"|s 转换为 |\"\"|s ...（即您在哪里替换每个 | \" | with | "" |)...
或者...您检查 org.apache.commons.text.StringTokenizer.java 以找出如何调整代码。这是一个小文件。我不认为这会那么困难。然后进行编译，本质上是创建 Apache 代码的一个分支。

我不认为它可以配置。但是，如果您找到了一个有意义的代码调整解决方案，您可能会将其提交给 Apache，然后它可能会被接受用于代码的下一次迭代，并且您的名字至少会出现在 Apache 的“功能请求”部分中：可能是kleos 的一种形式，通过它您可以实现编程永生......

归档时间：	15 年，3 月前
查看次数：	25010 次
最近记录：	6 年，7 月前