如何使用正则表达式匹配HTML中的charset字符串?

sil*_*ent 8 html regex

HTML代码示例:

<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
Run Code Online (Sandbox Code Playgroud)

我想使用RegEx来提取字符集信息(即这里,它是"utf-8")

(我正在使用C#)

小智 16

我的回答提供了一个更强大的@ Floyd版本,并且尽可能地解决了@ You的破损测试案例,其中使用负向前瞻来避免它.实际上只有一个我能想到的相关案例(@ You的例子的变体)会给出误报,但我认为这种情况非常罕见.表达式应该使用不区分大小写的标志运行,并使用java.util.regexJRegex进行测试.

捕获组会自动修剪,不会包含引号,也不会包含其他标记字符,如"/"或">".在第二个表达式中,有2个捕获组; 第一个是内容类型值,可能是空的(即,当使用字符集属性时),第二个是字符集值,它将始终为非空(除非字符集值由于某些奇怪的原因而实际上保留为空).

正则表达式匹配/分组charset值 - 修剪,跳过引号

<meta(?!\s*(?:name|value)\s*=)[^>]*?charset\s*=[\s"']*([^\s"'/>]*)
Run Code Online (Sandbox Code Playgroud)

与上面相同,但也匹配/分组内容类型(可选)和字符集(必需)值,修剪,跳过引号.次要警告 - 错过匹配独立内容类型值,即"text/html"

<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s"']*)?([^>]*?)[\s"';]*charset\s*=[\s"']*([^\s"'/>]*)
Run Code Online (Sandbox Code Playgroud)

测试用例(除最后一个之外全部通过)...

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1"/>
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" />
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'/>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' />
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1/>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 />
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" >
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1' >
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1>
<meta http-equiv=Content-Type content=text/html;charset=iso-8859-1 >

<meta http-equiv="Content-Type" content="text/html;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";charset="iso-8859-1"'>

<meta http-equiv="Content-Type" content="text/html;;;charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html;;;charset='iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html;;;charset=iso-8859-1'">
<meta http-equiv="Content-Type" content="'text/html';;;charset='iso-8859-1'">
<meta http-equiv='Content-Type' content='text/html;;;charset=iso-8859-1'>
<meta http-equiv='Content-Type' content='text/html;;;charset="iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html;;;charset=iso-8859-1"'>
<meta http-equiv='Content-Type' content='"text/html";;;charset="iso-8859-1"'>

<meta  http-equiv  =  "  Content-Type  "  content  =  "  '  text/html  '  ;  ;;  '  ;  '  '  ;  '  ;  ' ;;  ;  charset  =  '  iso-8859-1  '  "  >
<meta  content  =  "  '  text/html  '  ;  ;;  '  ;  '  '  ;  '  ;  ' ;;  ;  charset  =  '  iso-8859-1  '  "  http-equiv  =  "  Content-Type  "  >
<meta  http-equiv  =  Content-Type  content  =  text/html;charset=iso-8859-1  >
<meta  content  =  text/html;charset=iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;;;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;;;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >
<meta  http-equiv  =  Content-Type  content  =  text/html  ;  ;  ;  charset  =  iso-8859-1  >
<meta  content  =  text/html  ;  ;  ;  charset  =  iso-8859-1  http-equiv  =  Content-Type  >

<meta charset="utf-8"/>
<meta charset="utf-8" />
<meta charset='utf-8'/>
<meta charset='utf-8' />
<meta charset=utf-8/>
<meta charset=utf-8 />
<meta charset="utf-8">
<meta charset="utf-8" >
<meta charset='utf-8'>
<meta charset='utf-8' >
<meta charset=utf-8>
<meta charset=utf-8 >

<meta  charset  =  "  utf-8  "  >
<meta  charset  =  '  utf-8  '  >
<meta  charset  =  "  utf-8  '  >
<meta  charset  =  '  utf-8  "  >
<meta  charset  =  "  utf-8     >
<meta  charset  =  '  utf-8     >
<meta  charset  =     utf-8  '  >
<meta  charset  =     utf-8  "  >
<meta  charset  =     utf-8     >
<meta  charset  =     utf-8    />

<meta name="title" value="charset=utf-8 — is it really useful (yep)?">
<meta value="charset=utf-8 — is it really useful (yep)?" name="title">
<meta name="title" content="charset=utf-8 — is it really useful (yep)?">
<meta name="charset=utf-8" content="charset=utf-8 — is it really useful (yep)?">

<meta content="charset=utf-8 — is it really useful (nope, not here, but gotta admit pretty robust otherwise)?" name="title">
Run Code Online (Sandbox Code Playgroud)


Nul*_*ion 8

这个正则表达式:

<meta.*?charset=([^"']+)
Run Code Online (Sandbox Code Playgroud)

应该管用.使用XML解析器来提取它过度的.