在C#中,解析此WIKI标记的最佳方法是什么？

Question

在C#中,解析此WIKI标记的最佳方法是什么？

我需要从WIKI标记页面获取我正在读取的数据并将其存储为表结构.我试图找出如何正确解析下面的标记语法到C#中的一些表数据结构

这是一个示例表:

 || Owner || Action || Status || Comments ||
 | Bill | Fix the lobby | In Progress | This is easy |
 | Joe | Fix the bathroom | In Progress | Plumbing \\
 \\
  Electric \\
 \\
 Painting \\
 \\
 \\ | 
 | Scott | Fix the roof | Complete | This is expensive |

Run Code Online (Sandbox Code Playgroud)

以下是它的直接来源:

|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|

Run Code Online (Sandbox Code Playgroud)

所以你可以看到:

列标题有"||" 作为分隔符
行列有一个分隔符或"|"
一行可能跨越多行(如上面的第二个数据行示例),所以我必须继续阅读,直到我达到相同数量的"|" (cols)我在标题行中.

我尝试逐行阅读,然后连接之间有"\"的行,但这似乎有点hacky.

我也试着简单地读作一个完整的字符串,然后用"||"解析首先然后继续阅读,直到我达到相同数量的"|" 然后转到下一行.这似乎有效,但感觉可能有更优雅的方式使用正则表达式或类似的东西.

任何人都可以建议解析这些数据的正确方法吗？

Answer 1

Ale*_*lex 7

由于编辑后的输入格式与之前发布的格式大不相同,因此我在很大程度上取代了之前的答案.这导致了一个稍微不同的解决方案.

因为行之后不再有任何换行符,所以确定行结束位置的唯一方法是要求每行具有与表头相同的列数.至少,如果您不想依赖于一个且仅提供示例字符串中存在的一些可能脆弱的空白约定(即,行分隔符是唯一|不以空格开头).您的问题至少不提供此作为行分隔符的规范.

下面的"解析器"至少提供了可以从您的格式规范和示例字符串派生的错误处理有效性检查,并且还允许没有行的表.这些评论解释了它在基本步骤中的作用.

public class TableParser
{
    const StringSplitOptions SplitOpts = StringSplitOptions.None;
    const string RowColSep = "|";
    static readonly string[] HeaderColSplit = { "||" };
    static readonly string[] RowColSplit = { RowColSep };
    static readonly string[] MLColSplit = { @"\\" };

    public class TableRow
    {
        public List<string[]> Cells;
    }

    public class Table
    {
        public string[] Header;
        public TableRow[] Rows;
    }

    public static Table Parse(string text)
    {
        // Isolate the header columns and rows remainder.
        var headerSplit = text.Split(HeaderColSplit, SplitOpts);
        Ensure(headerSplit.Length > 1, "At least 1 header column is required in the input");

        // Need to check whether there are any rows.
        var hasRows = headerSplit.Last().IndexOf(RowColSep) >= 0;
        var header = headerSplit.Skip(1)
            .Take(headerSplit.Length - (hasRows ? 2 : 1))
            .Select(c => c.Trim())
            .ToArray();

        if (!hasRows) // If no rows for this table, we are done.
            return new Table() { Header = header, Rows = new TableRow[0] };

        // Get all row columns from the remainder.
        var rowsCols = headerSplit.Last().Split(RowColSplit, SplitOpts);

        // Require same amount of columns for a row as the header.
        Ensure((rowsCols.Length % (header.Length + 1)) == 1, 
            "The number of row colums does not match the number of header columns");
        var rows = new TableRow[(rowsCols.Length - 1) / (header.Length + 1)];

        // Fill rows by sequentially taking # header column cells 
        for (int ri = 0, start = 1; ri < rows.Length; ri++, start += header.Length + 1)
        {
            rows[ri] = new TableRow() { 
                Cells = rowsCols.Skip(start).Take(header.Length)
                    .Select(c => c.Split(MLColSplit, SplitOpts).Select(p => p.Trim()).ToArray())
                    .ToList()
            };
        };

        return new Table { Header = header, Rows = rows };
    }

    private static void Ensure(bool check, string errorMsg)
    {
        if (!check)
            throw new InvalidDataException(errorMsg);
    }
}

Run Code Online (Sandbox Code Playgroud)

当像这样使用时:

public static void Main(params string[] args)
{
        var wikiLine = @"|| Owner|| Action || Status || Comments || | Bill\\ | fix the lobby |In Progress | This is eary| | Joe\\ |fix the bathroom\\ | In progress| plumbing  \\Electric \\Painting \\ \\ | | Scott \\ | fix the roof \\ | Complete | this is expensive|";
        var table = TableParser.Parse(wikiLine);

        Console.WriteLine(string.Join(", ", table.Header));
        foreach (var r in table.Rows)
            Console.WriteLine(string.Join(", ", r.Cells.Select(c => string.Join(Environment.NewLine + "\t# ", c))));
}

Run Code Online (Sandbox Code Playgroud)

它会产生以下输出:

其中"\t# "表示由\\输入中的存在引起的换行.

归档时间：	10 年，11 月前
查看次数：	1202 次
最近记录：	8 年，10 月前