有没有办法从FCKEditor中删除所有不必要的MS Word格式

Question

有没有办法从FCKEditor中删除所有不必要的MS Word格式

use*_*433 6 javascript c# asp.net fckeditor

我已经安装了fckeditor,当从MS Word粘贴时,它添加了许多不必要的格式.我想保留一些像粗体,斜体,公牛等等的东西.我已经在网上搜索并提出了一些解决方案,即使是我希望保留的大胆和斜体,也能解决所有问题.有没有办法去除不必要的单词格式？

Answer 1

Jac*_*lls 10

以防万一有人想要接受答案的ac#版本:

public string CleanHtml(string html)
    {
        //Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
        // Only returns acceptable HTML, and converts line breaks to <br />
        // Acceptable HTML includes HTML-encoded entities.

        html = html.Replace("&" + "nbsp;", " ").Trim(); //concat here due to SO formatting
        // Does this have HTML tags?

        if (html.IndexOf("<") >= 0)
        {
            // Make all tags lowercase
            html = Regex.Replace(html, "<[^>]+>", delegate(Match m){
                return m.ToString().ToLower();
            });
            // Filter out anything except allowed tags
            // Problem: this strips attributes, including href from a
            // http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
            string AcceptableTags = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote";
            string WhiteListPattern = "</?(?(?=" + AcceptableTags + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>";
            html = Regex.Replace(html, WhiteListPattern, "", RegexOptions.Compiled);
            // Make all BR/br tags look the same, and trim them of whitespace before/after
            html = Regex.Replace(html, @"\s*<br[^>]*>\s*", "<br />", RegexOptions.Compiled);
        }


         // No CRs
         html = html.Replace("\r", "");
         // Convert remaining LFs to line breaks
         html = html.Replace("\n", "<br />");
         // Trim BRs at the end of any string, and spaces on either side
         return Regex.Replace(html, "(<br />)+$", "", RegexOptions.Compiled).Trim();
    }

Run Code Online (Sandbox Code Playgroud)

Answer 2

ric*_*ent 7

这是我用来从富文本编辑器中擦除传入HTML的解决方案...它是用VB.NET编写的,我没有时间转换为C#,但它非常简单:

 Public Shared Function CleanHtml(ByVal html As String) As String
     '' Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
     '' Only returns acceptable HTML, and converts line breaks to <br />
     '' Acceptable HTML includes HTML-encoded entities.
     html = html.Replace("&" & "nbsp;", " ").Trim() ' concat here due to SO formatting
     '' Does this have HTML tags?
     If html.IndexOf("<") >= 0 Then
         '' Make all tags lowercase
         html = RegEx.Replace(html, "<[^>]+>", AddressOf LowerTag)
         '' Filter out anything except allowed tags
         '' Problem: this strips attributes, including href from a
         '' http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
         Dim AcceptableTags      As String   = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
         Dim WhiteListPattern    As String   = "</?(?(?=" & AcceptableTags & ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
         html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)
         '' Make all BR/br tags look the same, and trim them of whitespace before/after
         html = RegEx.Replace(html, "\s*<br[^>]*>\s*", "<br />", RegExOptions.Compiled)
     End If
     '' No CRs
     html = html.Replace(controlChars.CR, "")
     '' Convert remaining LFs to line breaks
     html = html.Replace(controlChars.LF, "<br />")
     '' Trim BRs at the end of any string, and spaces on either side
     Return RegEx.Replace(html, "(<br />)+$", "", RegExOptions.Compiled).Trim()
 End Function

 Public Shared Function LowerTag(m As Match) As String
   Return m.ToString().ToLower()
 End Function

Run Code Online (Sandbox Code Playgroud)

在你的情况,你要修改"AcceptableTags""批复" HTML标签的列表 - 该代码将仍然去除所有无用的属性(和,不幸的是,有用的像HREF和SRC,希望那些不是活得对你很重要).

当然,这需要访问服务器.如果你不想这样,你需要在工具栏中添加某种"清理"按钮,调用JavaScript来搞乱编辑器的当前文本.不幸的是,"粘贴"不是一个可以被捕获以自动清理标记的事件,并且每次OnChange之后的清理会使一个不可用的编辑器(因为更改标记会改变文本光标位置).

Answer 3

小智 5

尝试了接受的解决方案，但它没有清除单词生成的标签。

但这段代码对我有用

静态字符串 CleanWordHtml(字符串 html) {

StringCollection sc = new StringCollection();
// get rid of unnecessary tag spans (comments and title)
sc.Add(@"<!--(\w|\W)+?-->");
sc.Add(@"<title>(\w|\W)+?</title>");
// Get rid of classes and styles
sc.Add(@"\s?class=\w+");
sc.Add(@"\s+style='[^']+'");
// Get rid of unnecessary tags
sc.Add(
@"<(meta|link|/?o:|/?style|/?div|/?st\d|/?head|/?html|body|/?body|/?span|!\[)[^>]*?>");
// Get rid of empty paragraph tags
sc.Add(@"(<[^>]+>)+&nbsp;(</\w+>)+");
// remove bizarre v: element attached to <img> tag
sc.Add(@"\s+v:\w+=""[^""]+""");
// remove extra lines
sc.Add(@"(\n\r){2,}");
foreach (string s in sc)
{
    html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase);
}
return html; 
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	16 年前
查看次数：	7503 次
最近记录：	7 年，1 月前