我正在尝试编写VBA解析器; 为了创建一个ConstantNode,我需要能够匹配Const声明的所有可能变体.
这些工作很漂亮:
Const foo = 123Const foo$ = "123"Const foo As String = "123"Private Const foo = 123Public Const foo As Integer = 123Global Const foo% = 123但我有两个问题:
如果在声明的末尾有评论,我会把它作为价值的一部分:
Const foo = 123 'this comment is included as part of the value
Run Code Online (Sandbox Code Playgroud)如果在同一条指令中声明了两个或多个常量,我将无法匹配整个指令:
Const foo = 123, bar = 456
Run Code Online (Sandbox Code Playgroud)这是我正在使用的正则表达式:
/// <summary>
/// Gets a regular expression pattern for matching a constant declaration.
/// </summary>
/// <remarks>
/// Constants declared in class modules may only be <c>Private</c>.
/// Constants declared at procedure scope cannot have an access modifier.
/// </remarks>
public static string GetConstantDeclarationSyntax()
{
return @"^((Private|Public|Global)\s)?Const\s(?<identifier>[a-zA-Z][a-zA-Z0-9_]*)(?<specifier>[%&@!#$])?(?<as>\sAs\s(?<reference>(((?<library>[a-zA-Z][a-zA-Z0-9_]*))\.)?(?<identifier>[a-zA-Z][a-zA-Z0-9_]*)))?\s\=\s(?<value>.*)$";
}
Run Code Online (Sandbox Code Playgroud)
显然,这两个问题都是由(?<value>.*)$部件引起的,该部件直到行尾都匹配任何东西.我必须VariableNode在一条指令中支持多个声明,方法是将整个模式封装在一个捕获组中并添加一个可选的逗号,但由于常量具有该value组,因此执行该操作会导致第一个常量将所有后续声明作为其值的一部分捕获.这让我回到问题#1.
我想知道是否有可能用正则表达式解决问题#1,因为该值可能是包含撇号的字符串,并且可能是一些转义(双引号)双引号.
我想我可以在ConstantNode课堂上解决它,在getter中Value:
/// <summary>
/// Gets the constant's value. Strings include delimiting quotes.
/// </summary>
public string Value
{
get
{
return RegexMatch.Groups["value"].Value;
}
}
Run Code Online (Sandbox Code Playgroud)
我的意思是,我可以在这里实现一些额外的逻辑,做一些我不能用正则表达式做的事情.
如果问题#1可以通过正则表达式解决,那么我相信问题#2也可以......或者我是否在正确的轨道上?我应该放弃[相当复杂]的正则表达式模式并想出另一种方式吗?我不太熟悉贪婪的子表达式,反向引用和其他更高级的正则表达式功能 - 这是什么限制了我,或者只是因为我使用错误的锤子来钉这个钉子?
注意:模式可能与非法语法匹配并不重要 - 此代码仅针对可编译的VBA代码运行.
Let me go ahead and add the disclaimer on this one. This is absolutely not a good idea (but it was a fun challenge). The regex(s) I'm about to present will parse the test cases in the question, but they obviously are not bullet proof. Using a parser will save you a lot of headache later. I did try to find a parser for VBA, but came up empty handed (and I'm assuming everyone else has too).
Regex
For this to work nicely, you need to have some control over the VBA code coming in. If you can't do this, then you truly need to be looking at writing a parser instead of using Regexes. However, judging from what you already said, you may have a little bit of control. So maybe this will help out.
So for this, I had to split the regex into two distinct regexes. The reason for this is the .Net Regex library cannot handle capturing groups within a repeating group.
Capture the line and start parsing, this will place the variables (with the values) into a single group, but the second Regex will parse them. Just fyi, the regexes make use of negative lookbehinds.
^(?:(?<Accessibility>Private|Public|Global)\s)?Const\s(?<variable>[a-zA-Z][a-zA-Z0-9_]*(?:[%&@!#$])?(?:\sAs)?\s(?:(?:[a-zA-Z][a-zA-Z0-9_]*)\s)?=\s[^',]+(?:(?:(?!"").)+"")?(?:,\s)?){1,}(?:'(?<comment>.+))?$
Run Code Online (Sandbox Code Playgroud)
Here's the regex to parse the variables
(?<identifier>[a-zA-Z][a-zA-Z0-9_]*)(?<specifier>[%&@!#$])?(?:\sAs)?\s(?:(?<reference>[a-zA-Z][a-zA-Z0-9_]*)\s)?=\s(?<value>[^',]+(?:(?:(?!").)+")?),?
Run Code Online (Sandbox Code Playgroud)
And here's some c# code you can toss in and test everything out. This should make it easy to test any edge cases you have.
static void Main(string[] args)
{
List<String> test = new List<string> {
"Const foo = 123",
"Const foo$ = \"123\"",
"Const foo As String = \"1'2'3\"",
"Const foo As String = \"123\"",
"Private Const foo = 123",
"Public Const foo As Integer = 123",
"Global Const foo% = 123",
"Const foo = 123 'this comment is included as part of the value",
"Const foo = 123, bar = 456",
"'Const foo As String = \"123\"",
};
foreach (var str in test)
Parse(str);
Console.Read();
}
private static Regex parse = new Regex(@"^(?:(?<Accessibility>Private|Public|Global)\s)?Const\s(?<variable>[a-zA-Z][a-zA-Z0-9_]*(?:[%&@!#$])?(?:\sAs)?\s(?:(?:[a-zA-Z][a-zA-Z0-9_]*)\s)?=\s[^',]+(?:(?:(?!"").)+"")?(?:,\s)?){1,}(?:'(?<comment>.+))?$", RegexOptions.Compiled | RegexOptions.Singleline, new TimeSpan(0, 0, 20));
private static Regex variableRegex = new Regex(@"(?<identifier>[a-zA-Z][a-zA-Z0-9_]*)(?<specifier>[%&@!#$])?(?:\sAs)?\s(?:(?<reference>[a-zA-Z][a-zA-Z0-9_]*)\s)?=\s(?<value>[^',]+(?:(?:(?!"").)+"")?),?", RegexOptions.Compiled | RegexOptions.Singleline, new TimeSpan(0, 0, 20));
public static void Parse(String str)
{
Console.WriteLine(String.Format("Parsing: {0}", str));
var match = parse.Match(str);
if (match.Success)
{
//Private/Public/Global
var accessibility = match.Groups["Accessibility"].Value;
//Since we defined this with atleast one capture, there should always be something here.
foreach (Capture variable in match.Groups["variable"].Captures)
{
//Console.WriteLine(variable);
var variableMatch = variableRegex.Match(variable.Value);
if (variableMatch.Success)
{
Console.WriteLine(String.Format("Identifier: {0}", variableMatch.Groups["identifier"].Value));
if (variableMatch.Groups["specifier"].Success)
Console.WriteLine(String.Format("specifier: {0}", variableMatch.Groups["specifier"].Value));
if (variableMatch.Groups["reference"].Success)
Console.WriteLine(String.Format("reference: {0}", variableMatch.Groups["reference"].Value));
Console.WriteLine(String.Format("value: {0}", variableMatch.Groups["value"].Value));
Console.WriteLine("");
}
else
{
Console.WriteLine(String.Format("FAILED VARIABLE: {0}", variable.Value));
}
}
if (match.Groups["comment"].Success)
{
Console.WriteLine(String.Format("Comment: {0}", match.Groups["comment"].Value));
}
}
else
{
Console.WriteLine(String.Format("FAILED: {0}", str));
}
Console.WriteLine("+++++++++++++++++++++++++++++++++++++++++++++");
Console.WriteLine("");
}
Run Code Online (Sandbox Code Playgroud)
The c# code was just what I was using to test my theory, so I apologize for the craziness in it.
For completeness here's a small sample of the output. If you run the code you'll get more output, but this directly shows that it can handle the situations you were asking about.
Parsing: Const foo = 123 'this comment is included as part of the value
Identifier: foo
value: 123
Comment: this comment is included as part of the value
Parsing: Const foo = 123, bar = 456
Identifier: foo
value: 123
Identifier: bar
value: 456
Run Code Online (Sandbox Code Playgroud)
What it handles
Here are the major cases I can think of that you're probably interested in. It should still handle everything you had before as I just added to the regex you provided.
What it doesn't handle
The one thing I didn't really handle was spacing, but it shouldn't be hard add that in yourself if you need it. So for instance if the declare multiple variables there MUST be a space after the comma. ie (VALID: foo = 123, foobar = 124) (INVALID: foo = 123,foobar = 124)
You won't get much leniency on the format from it, but there's not a whole lot you can do with that when using regexes.
希望这对您有所帮助,如果您需要有关其工作原理的更多解释,请告诉我。只是知道这是一个坏主意。您会遇到正则表达式无法处理的情况。如果我处于您的位置,我会考虑编写一个简单的解析器,从长远来看,这会给您带来更大的灵活性。祝你好运。
| 归档时间: |
|
| 查看次数: |
395 次 |
| 最近记录: |