标题字符串分开,单词之间没有空格

Mat*_*nis 31 c# regex

我想在没有空格的标题中查找并分隔单词。

之前:

这是一个示例标题HELLO-WORLD2019T.EST(Test)“ Test”'Test'[Test]

后:

这是示例标题HELLO-WORLD 2019 TEST(测试)[测试]“测试”“测试”


我正在寻找可以执行以下操作的正则表达式规则。

我以为如果每个单词都以大写字母开头,我会确定的。

但也要保留所有大写单词,以免将它们分隔成A L L U P P E R C A S E

附加规则:

  • 如果字母碰到数字,请用空格隔开: Hello2019World Hello 2019 World
  • 忽略包含句点,连字符或下划线的首字母空格 T.E.S.T.
  • 如果放在方括号,括号或引号之间,则忽略空格 [Test] (Test) "Test" 'Test'
  • 保留连字符 Hello-World

C#

https://rextester.com/GAZJS38767

// Title without spaces
string title = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)[Test]\"Test\"'Test'";

// Detect where to space words
string[] split =  Regex.Split(title, "(?<!^)(?=(?<![.\\-'\"([{])[A-Z][\\d+]?)");

// Trim each word of extra spaces before joining
split = (from e in split
         select e.Trim()).ToArray();

// Join into new title
string newtitle = string.Join(" ", split);

// Display
Console.WriteLine(newtitle);
Run Code Online (Sandbox Code Playgroud)

正则表达式

我在数字,方括号,括号和引号之前没有空格。

https://regex101.com/r/9IIYGX/1

(?<!^)(?=(?<![.\-'"([{])(?<![A-Z])[A-Z][\d+?]?)

(?<!^)          // Negative look behind

(?=             // Positive look ahead

(?<![.\-'"([{]) // Ignore if starts with punctuation
(?<![A-Z])      // Ignore if starts with double Uppercase letter
[A-Z]           // Space after each Uppercase letter
[\d+]?          // Space after number

)
Run Code Online (Sandbox Code Playgroud)

感谢您在答案中的共同努力。这是一个正则表达式示例。我将此应用到文件名,并且排除了特殊字符\/:*?"<>|

https://rextester.com/FYEVE73725

https://regex101.com/r/xi8L4z/1

Tim*_*sen 18

这是一个看起来不错的正则表达式,至少对于您的示例输入而言:

(?<=[a-z])(?=[A-Z])|(?<=[0-9])(?=[A-Za-z])|(?<=[A-Za-z])(?=[0-9])|(?<=\W)(?=\W)
Run Code Online (Sandbox Code Playgroud)

该专利要求在以下条件之一的边界上进行分割:

  • 前面是小写,前面是大写(反之亦然)
  • 前面是数字,后面是字母(反之亦然)
  • 前面和后面是非单词字符(例如,引号,括号等)


string title = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)[Test]\"Test\"'Test'";
string[] split =  Regex.Split(title, "(?<=[a-z])(?=[A-Z])|(?<=[0-9])(?=[A-Za-z])|(?<=[A-Za-z])(?=[0-9])|(?<=\\W)(?=\\W)"); 
split = (from e in split select e.Trim()).ToArray();
string newtitle = string.Join(" ", split);

This Is An Example Title HELLO-WORLD 2019 T.E.S.T. (Test) [Test] "Test" 'Test'
Run Code Online (Sandbox Code Playgroud)

注意:您可能还想将此断言添加到regex替代中:

(?<=\W)(?=\w)|(?<=\w)(?=\W)
Run Code Online (Sandbox Code Playgroud)

我们在这里避免了这种情况,因为这种边界条件从未发生过。但是您可能需要其他输入。


Mic*_*zyn 9

为了简化而不是使用大量正则表达式,我建议使用小的简单模式编写此代码(注释的注释在代码中):

string str = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)\"Test\"'Test'[Test]";
// insert space when there is small letter followed by upercase letter
str = Regex.Replace(str, "(?<=[a-z])(?=[A-Z])", " ");
// insert space whenever there's digit followed by a ltter
str = Regex.Replace(str, @"(?<=\d)(?=[A-Za-z])", " ");
// insert space when there's letter followed by digit
str = Regex.Replace(str, @"(?<=[A-Za-z])(?=\d)", " ");
// insert space when there's one of characters ("'[ followed by letter or digit
str = Regex.Replace(str, @"(?=[(\[""'][a-zA-Z0-9])", " ");
// insert space when what preceeds is on of characters ])"'
str = Regex.Replace(str, @"(?<=[)\]""'])", " ");
Run Code Online (Sandbox Code Playgroud)

  • @revo我使用标准的C#注释:)我认为它更具可读性。 (2认同)
  • 您还可以通过设置* standard *`x`修饰符来编写这种可读的注释,该修饰符使您能够编写多行,缩进的完美注释。顺便说一下,这并不简单。只是分裂。。 (2认同)

Muk*_*yuu 8

前几部分与@revo 答案(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P},另外我下面的正则表达式添加到数字和字母之间的空间:(?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z])与检测OTPIsADevice,然后用先行更换和回顾后发现大写与小写:(((?<!^)[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))

注意,|is或operator允许执行所有正则表达式。

正则表达式: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z])|(((?<!^)[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))

演示版

更新资料

即兴一点:

从: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z])

成:(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=\p{L})\d做同样的事情。

(((?<!^)(?<!\p{P})[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\]}!&}])OP评论中即兴创作,这为某些标点添加了例外:(((?<!^)(?<!['([{])[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\\]}!&}])

最终正则表达式: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=\p{L})\d|(((?<!^)(?<!\p{P})[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\]}!&}])

演示版


rev*_*evo 7

您可以使用对它们的不同解释来减少要求以缩短正则表达式的步骤。例如,第一个要求就是说,如果没有大写字母或大写字母,则保留大写字母。

以下正则表达式几乎可以满足所有上述要求,并且可以扩展为包括或排除其他情况:

(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}
Run Code Online (Sandbox Code Playgroud)

您必须使用Replace()method并将其 $0用作替换字符串。

在这里观看现场演示

.NET(查看实际操作):

string input = @"ThisIsAnExample.TitleHELLO-WORLD2019T.E.S.T.(Test)""Test""'Test'[Test]";
Regex regex = new Regex(@"(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}", RegexOptions.Multiline);
Console.WriteLine(regex.Replace(input, @" $0"));
Run Code Online (Sandbox Code Playgroud)