Joe*_*Joe 1 .net c# regex string split
我有一些我从PDF文档中提取的文本,其中包含一个包含内容的项目符号列表,如下所示:
3提交邮件委员会
的法案Fitzgibbon先生(首席政府官员),请假,移动 - 将2011年税法修正案(2011年第7号措施)法案提交主要委员会进一步审议.问题和通过.
4 2011年公司修订(财务咨询的未来)条例草案
Shorten先生(金融服务和退休金部长)根据通知提出了一项修订有关财务建议和相关目的的法律的法案.文件Shorten先生提交了该法案的解释性备忘录.比尔第一次读了.肖恩先生感动 - 现在该法案第二次被宣读.辩论休会(兰德尔先生),辩论的恢复为下一次会议作出了当天的命令.
2011年税务法修正案(2011年措施第8号)条例草案
Shorten先生(财政服务和退休金部长)提出了一项法案,修订有关税收和相关目的的法律.文献
我需要将它们分开,以便每个子弹点都是这样的:
[0,0] =标题
[0,1] =正文
[1,0] =标题
[1,1] =正文
我修改了示例以包含一些真实世界的内容.
任何帮助将不胜感激.
我正在使用.NET框架C#.
您可以使用LINQ:
var result = input
.Split(new[] { "\r\n" }, StringSplitOptions.None)
.Where(x => !string.IsNullOrWhiteSpace(x))
.GroupAdjacent((g, x) => !char.IsDigit(x[0]))
.Select(g => new
{
Title = g.First().Trim(),
Body = string.Join(" ", g.Skip(1).Select(x => x.Trim()))
})
.ToArray();
Run Code Online (Sandbox Code Playgroud)
例:
string input = @"3 BILL REFERRED TO MAIL COMMITTEE
Mr Fitzgibbon (Chief Government Whip), by leave, moved—That the
Tax Laws Amendment (2011 Measures No. 7) Bill 2011 be referred
to the Main Committee for further consideration. Question—put
and passed.
4 CORPORATIONS AMENDMENT (FUTURE OF FINANCIAL ADVICE) BILL 2011
Mr Shorten (Minister for Financial Services and Superannuation),
pursuant to notice, presented a Bill for an Act to amend the law
in relation to financial advice,and for related purposes. Mr
Shorten presented an explanatory memorandum to the bill. Bill
read a first time. Mr Shorten moved—That the bill be now read
a second time. Debate adjourned (Mr Randall), and the resumption
of the debate made an order of the day for the next sitting.
5 TAX LAWS AMENDMENT (2011 MEASURES NO. 8) BILL 2011
Mr Shorten (Minister for Financial Services and Superannuation)
presented a Bill for an Act to amend the law relating to
taxation, and for related purposes.";
Run Code Online (Sandbox Code Playgroud)
输出:
result[0] == { Title = "3 BILL REFERRED ...", Body = "Mr Fitzgibbon ..." }
result[1] == { Title = "4 CORPORATIONS ...", Body = "Mr Shorten ..." }
result[2] == { Title = "5 TAX LAWS ...", Body = "Mr Shorten ..." }
Run Code Online (Sandbox Code Playgroud)
扩展方法:
public static IEnumerable<IEnumerable<T>> GroupAdjacent<T>(
this IEnumerable<T> source, Func<IEnumerable<T>, T, bool> adjacent)
{
var g = new List<T>();
foreach (var x in source)
{
if (g.Count != 0 && !adjacent(g, x))
{
yield return g;
g = new List<T>();
}
g.Add(x);
}
yield return g;
}
Run Code Online (Sandbox Code Playgroud)