ram*_*aaa 82 html c# regex string
如何删除所有HTML标记,包括在C#中使用正则表达式.我的字符串看起来像
"<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div> </div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"
Run Code Online (Sandbox Code Playgroud)
Rav*_*yal 190
如果你不能使用面向HTML解析器的解决方案来过滤掉标签,这里有一个简单的正则表达式.
string noHTML = Regex.Replace(inputHTML, @"<[^>]+>| ", "").Trim();
Run Code Online (Sandbox Code Playgroud)
理想情况下,您应该通过正则表达式过滤器进行另一次传递,该过滤器可以处理多个空格
string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");
Run Code Online (Sandbox Code Playgroud)
Don*_*ing 30
我拿了@Ravi Thapliyal的代码并制定了一个方法:它很简单,可能不会清理所有内容,但到目前为止它正在做我需要做的事情.
public static string ScrubHtml(string value) {
var step1 = Regex.Replace(value, @"<[^>]+>| ", "").Trim();
var step2 = Regex.Replace(step1, @"\s{2,}", " ");
return step2;
}
Run Code Online (Sandbox Code Playgroud)
Dav*_* S. 16
我一直在使用这个功能.删除几乎任何凌乱的HTML,你可以抛出它,并保持文本完整.
private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);
//add characters that are should not be removed to this regex
private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled);
public static String UnHtml(String html)
{
html = HttpUtility.UrlDecode(html);
html = HttpUtility.HtmlDecode(html);
html = RemoveTag(html, "<!--", "-->");
html = RemoveTag(html, "<script", "</script>");
html = RemoveTag(html, "<style", "</style>");
//replace matches of these regexes with space
html = _tags_.Replace(html, " ");
html = _notOkCharacter_.Replace(html, " ");
html = SingleSpacedTrim(html);
return html;
}
private static String RemoveTag(String html, String startTag, String endTag)
{
Boolean bAgain;
do
{
bAgain = false;
Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
if (startTagPos < 0)
continue;
Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
if (endTagPos <= startTagPos)
continue;
html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
bAgain = true;
} while (bAgain);
return html;
}
private static String SingleSpacedTrim(String inString)
{
StringBuilder sb = new StringBuilder();
Boolean inBlanks = false;
foreach (Char c in inString)
{
switch (c)
{
case '\r':
case '\n':
case '\t':
case ' ':
if (!inBlanks)
{
inBlanks = true;
sb.Append(' ');
}
continue;
default:
inBlanks = false;
sb.Append(c);
break;
}
}
return sb.ToString().Trim();
}
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
124817 次 |
最近记录: |