如何从给定的字符串列表中自动生成正则表达式?

pat*_*rit 24 regex algorithm

您将获得2个字符串列表 - A和B.找到匹配A中所有字符串的最短正则表达式,而不是B中的所有字符串.请注意,此正则表达式可以匹配/不匹配不在A中而不在B中的其他字符串.简单,我们可以假设我们的字母大小只有2个字符 - 0和1.也只允许这些运算符:

* - 0或更多
?- 0或1
+ - 1或更多
() - 括号

为简单起见,不允许使用正则表达式运算符.我不知道是否允许或运算符(|)会简化问题.A和B的路线没有共同的元素.这里有些例子:

A=[00,01,10]
B=[11]
answer = 1*0+1*
Run Code Online (Sandbox Code Playgroud)


A=[00,01,11]
B=[10]
answer = 0*1*
Run Code Online (Sandbox Code Playgroud)

Han*_*man 14

解决此问题的一种方法是使用遗传算法.我碰巧有一个遗传求解器,所以我用以下算法将它应用于你的问题:

  • 从所需的输入中获取不同的标记作为基因
  • 将正则表达式特色添加到基因中
  • 用于健身算法
    • 确保生成的字符串是有效的正则表达式
    • 根据匹配的所需内容以及匹配的不需要的内容获得适合度值
  • 直到找到成功的正则表达式
    • 从不同令牌的数量开始,并根据需要递增
    • 尝试生成一个通过健身要求的长度的正则表达式

这是我在C#中的实现

private static void GenerateRegex(IEnumerable<string> target, IEnumerable<string> dontMatch)
{
    string distinctSymbols = new String(target.SelectMany(x => x).Distinct().ToArray());
    string genes = distinctSymbols + "?*()+";

    Func<string, uint> calcFitness = str =>
        {
            if (str.Count(x => x == '(') != str.Count(x => x == ')'))
            {
                return Int32.MaxValue;
            }
            if ("?*+".Any(x => str[0] == x))
            {
                return Int32.MaxValue;
            }
            if ("?*+?*+".ToArray().Permute(2)
                .Any(permutation => str.IndexOf(new string(permutation.ToArray())) != -1))
            {
                return Int32.MaxValue;
            }
            Regex regex;
            try
            {
                regex = new Regex("^" + str + "$");
            }
            catch (Exception)
            {
                return Int32.MaxValue;
            }
            uint fitness = target.Aggregate<string, uint>(0, (current, t) => current + (regex.IsMatch(t) ? 0U : 1));
            uint nonFitness = dontMatch.Aggregate<string, uint>(0, (current, t) => current + (regex.IsMatch(t) ? 10U : 0));
            return fitness + nonFitness;
        };

    for (int targetGeneLength = distinctSymbols.Length; targetGeneLength < genes.Length * 2; targetGeneLength++)
    {
        string best = new GeneticSolver(50).GetBestGenetically(targetGeneLength, genes, calcFitness, true);
        if (calcFitness(best) != 0)
        {
            Console.WriteLine("-- not solved with regex of length " + targetGeneLength);
            continue;
        }
        Console.WriteLine("solved with: " + best);
        break;
    }
}
Run Code Online (Sandbox Code Playgroud)

并将其应用于您的样品:

public void Given_Sample_A()
{
    var target = new[] { "00", "01", "10" };
    var dontMatch = new[] { "11" };

    GenerateRegex(target, dontMatch);
}
Run Code Online (Sandbox Code Playgroud)

输出:

Generation  1 best: 10 (2)
Generation  2 best: 0+ (2)
Generation  5 best: 0* (2)
Generation  8 best: 00 (2)
Generation  9 best: 01 (2)
-- not solved with regex of length 2
Generation  1 best: 10* (2)
Generation  3 best: 00* (2)
Generation  4 best: 01+ (2)
Generation  6 best: 10+ (2)
Generation  9 best: 00? (2)
Generation 11 best: 00+ (2)
Generation 14 best: 0?1 (2)
Generation 21 best: 0*0 (2)
Generation 37 best: 1?0 (2)
Generation 43 best: 10? (2)
Generation 68 best: 01* (2)
Generation 78 best: 1*0 (2)
Generation 79 best: 0*1 (2)
Generation 84 best: 0?0 (2)
Generation 127 best: 01? (2)
Generation 142 best: 0+1 (2)
Generation 146 best: 0+0 (2)
Generation 171 best: 1+0 (2)
-- not solved with regex of length 3
Generation  1 best: 1*0+ (1)
Generation  2 best: 0+1* (1)
Generation 20 best: 1?0+ (1)
Generation 31 best: 1?0* (1)
-- not solved with regex of length 4
Generation  1 best: 1*00? (1)
Generation  2 best: 0*1?0 (1)
Generation  3 best: 1?0?0 (1)
Generation  4 best: 1?00? (1)
Generation  8 best: 1?00* (1)
Generation 12 best: 1*0?0 (1)
Generation 13 best: 1*00* (1)
Generation 41 best: 0*10* (1)
Generation 44 best: 1*0*0 (1)
-- not solved with regex of length 5
Generation  1 best: 0+(1)? (1)
Generation 36 best: 0+()1? (1)
Generation 39 best: 0+(1?) (1)
Generation 61 best: 1*0+1? (0)
solved with: 1*0+1?
Run Code Online (Sandbox Code Playgroud)

第二个样本:

public void Given_Sample_B()
{
    var target = new[] { "00", "01", "11" };
    var dontMatch = new[] { "10" };

    GenerateRegex(target, dontMatch);
}
Run Code Online (Sandbox Code Playgroud)

输出:

Generation  1 best: 00 (2)
Generation  2 best: 01 (2)
Generation  7 best: 0* (2)
Generation 12 best: 0+ (2)
Generation 33 best: 1+ (2)
Generation 36 best: 1* (2)
Generation 53 best: 11 (2)
-- not solved with regex of length 2
Generation  1 best: 00* (2)
Generation  2 best: 0+0 (2)
Generation  7 best: 0+1 (2)
Generation 12 best: 00? (2)
Generation 15 best: 01* (2)
Generation 16 best: 0*0 (2)
Generation 19 best: 01+ (2)
Generation 30 best: 0?0 (2)
Generation 32 best: 0*1 (2)
Generation 42 best: 11* (2)
Generation 43 best: 1+1 (2)
Generation 44 best: 00+ (2)
Generation 87 best: 01? (2)
Generation 96 best: 0?1 (2)
Generation 125 best: 11? (2)
Generation 126 best: 1?1 (2)
Generation 135 best: 11+ (2)
Generation 149 best: 1*1 (2)
-- not solved with regex of length 3
Generation  1 best: 0*1* (0)
solved with: 0*1*
Run Code Online (Sandbox Code Playgroud)


vtl*_*inh 1

如果这是一个家庭作业问题,那就像是“一份家庭作业,全班得 A”类型。我认为该问题中的某处缺少“或”运算符。

有一个明显的解决方案是 A0|A1|A2|...,但在尝试找到最短的时似乎更难。

我建议使用递归来尝试缩短正则表达式,但这不是一个理想的解决方案。