从网页中提取元关键字？

Question

从网页中提取元关键字？

Zac*_*own 8 python webpage extract urllib keyword

我需要使用Python从网页中提取元关键字.我在想这可以使用urllib或urllib2完成,但我不确定.有人有主意吗？

我在Windows XP上使用Python 2.6

Answer 1

lxml比BeautifulSoup更快(我认为)并且具有更好的功能,同时保持相对容易使用.例:

52> from urllib import urlopen
53> from lxml import etree

54> f = urlopen( "http://www.google.com" ).read()
55> tree = etree.HTML( f )
61> m = tree.xpath( "//meta" )

62> for i in m:
..>     print etree.tostring( i )
..>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-2"/>

Run Code Online (Sandbox Code Playgroud)

编辑:另一个例子.

75> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
76> tree = etree.HTML( f )
85> tree.xpath( "//meta[@name='Keywords']" )[0].get("content")
85> "xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql,colors,soap,php,authoring,programming,training,learning,b
eginner's guide,primer,lessons,school,howto,reference,examples,samples,source code,tags,demos,tips,links,FAQ,tag list,forms,frames,color table,w3c,cascading
 style sheets,active server pages,dynamic html,internet,database,development,Web building,Webmaster,html guide"

Run Code Online (Sandbox Code Playgroud)

BTW:XPath值得了解.

另一个编辑:

或者,您可以使用regexp:

87> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
88> import re
101> re.search( "<meta name=\"Keywords\".*?content=\"([^\"]*)\"", f ).group( 1 )
101>"xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql, ...etc...

Run Code Online (Sandbox Code Playgroud)

...但我发现它不太可读,更容易出错(但只涉及标准模块,仍然适合一行).

Answer 2

Don*_*ner 7

BeautifulSoup是用Python解析HTML的好方法.

特别是,请查看findAll方法:http://www.crummy.com/software/BeautifulSoup/documentation.html

归档时间：	15 年，8 月前
查看次数：	7864 次
最近记录：	12 年，4 月前