Cha*_*ens 69 html language-agnostic html-parsing
如何使用各种语言解析HTML并解析库?
回答时:
个人评论将链接到有关如何使用正则表达式解析HTML的问题的答案,作为展示正确行事方式的一种方式.
为了保持一致性,我要求该示例解析hrefin锚标记的HTML文件.为了便于搜索此问题,我要求您遵循此格式
语言:[语言名称]
图书馆:[图书馆名称]
[example code]
Run Code Online (Sandbox Code Playgroud)
请使库成为库文档的链接.如果您想提供除提取链接之外的示例,还请包括:
目的:[解析的作用]
War*_*uck 29
语言:JavaScript
库:jQuery
$.each($('a[href]'), function(){
console.debug(this.href);
});
Run Code Online (Sandbox Code Playgroud)
(使用firebug console.debug输出...)
并加载任何HTML页面:
$.get('http://stackoverflow.com/', function(page){
$(page).find('a[href]').each(function(){
console.debug(this.href);
});
});
Run Code Online (Sandbox Code Playgroud)
使用另一个函数,我认为链接方法时它更清晰.
ale*_*exn 25
语言:C#
Library:HtmlAgilityPack
class Program
{
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerHtml);
}
}
}
Run Code Online (Sandbox Code Playgroud)
Pao*_*ino 22
语言:Python
库:BeautifulSoup
from BeautifulSoup import BeautifulSoup
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"
soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
print links
Run Code Online (Sandbox Code Playgroud)
输出:
[<a href="http://foo.com">foo</a>,
<a href="http://bar.com">bar</a>,
<a href="http://baz.com">baz</a>]
Run Code Online (Sandbox Code Playgroud)
也有可能:
for link in links:
print link['href']
Run Code Online (Sandbox Code Playgroud)
输出:
http://foo.com
http://bar.com
http://baz.com
Run Code Online (Sandbox Code Playgroud)
dra*_*tun 20
语言:Perl
Library:pQuery
use strict;
use warnings;
use pQuery;
my $html = join '',
"<html><body>",
(map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
"</body></html>";
pQuery( $html )->find( 'a' )->each(
sub {
my $at = $_->getAttribute( 'href' );
print "$at\n" if defined $at;
}
);
Run Code Online (Sandbox Code Playgroud)
小智 15
语言:shell
库:lynx(嗯,它不是库,但在shell中,每个程序都是类库)
lynx -dump -listonly http://news.google.com/
Run Code Online (Sandbox Code Playgroud)
Pes*_*sto 14
语言:Ruby
库:Hpricot
#!/usr/bin/ruby
require 'hpricot'
html = '<html><body>'
['foo', 'bar', 'baz'].each {|link| html += "<a href=\"http://#{link}.com\">#{link}</a>" }
html += '</body></html>'
doc = Hpricot(html)
doc.search('//a').each {|elm| puts elm.attributes['href'] }
Run Code Online (Sandbox Code Playgroud)
Cha*_*ens 12
language:Python
库:HTMLParser
#!/usr/bin/python
from HTMLParser import HTMLParser
class FindLinks(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
at = dict(attrs)
if tag == 'a' and 'href' in at:
print at['href']
find = FindLinks()
html = "<html><body>"
for link in ("foo", "bar", "baz"):
html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"
find.feed(html)
Run Code Online (Sandbox Code Playgroud)
Cha*_*ens 11
language:Perl
库:HTML :: Parser
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my $find_links = HTML::Parser->new(
start_h => [
sub {
my ($tag, $attr) = @_;
if ($tag eq 'a' and exists $attr->{href}) {
print "$attr->{href}\n";
}
},
"tag, attr"
]
);
my $html = join '',
"<html><body>",
(map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
"</body></html>";
$find_links->parse($html);
Run Code Online (Sandbox Code Playgroud)
小智 9
语言Perl
库:HTML :: LinkExtor
Perl的美妙之处在于,您拥有适用于特定任务的模块.像链接提取.
整个计划:
#!/usr/bin/perl -w
use strict;
use HTML::LinkExtor;
use LWP::Simple;
my $url = 'http://www.google.com/';
my $content = get( $url );
my $p = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );
exit;
sub process_link {
my ( $tag, %attr ) = @_;
return unless $tag eq 'a';
return unless defined $attr{ 'href' };
print "- $attr{'href'}\n";
return;
}
Run Code Online (Sandbox Code Playgroud)
说明:
就这样.
小智 8
语言:Ruby
Library:Nokogiri
#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'
document = Nokogiri::HTML(open("http://google.com"))
document.css("html head title").first.content
=> "Google"
document.xpath("//title").first.content
=> "Google"
Run Code Online (Sandbox Code Playgroud)
语言:Common Lisp
Library:关闭Html,关闭Xml,CL-WHO
(使用DOM API显示,不使用XPATH或STP API)
(defvar *html*
(who:with-html-output-to-string (stream)
(:html
(:body (loop
for site in (list "foo" "bar" "baz")
do (who:htm (:a :href (format nil "http://~A.com/" site))))))))
(defvar *dom*
(chtml:parse *html* (cxml-dom:make-dom-builder)))
(loop
for tag across (dom:get-elements-by-tag-name *dom* "a")
collect (dom:get-attribute tag "href"))
=>
("http://foo.com/" "http://bar.com/" "http://baz.com/")
Run Code Online (Sandbox Code Playgroud)
语言:Clojure
Library: Enlive(基于选择器(àlaCSS)的Clojure模板和转换系统)
选择器表达式:
(def test-select
(html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))
Run Code Online (Sandbox Code Playgroud)
现在我们可以在REPL上执行以下操作(我添加了换行符test-select):
user> test-select
({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
{:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
{:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
user> (map #(get-in % [:attrs :href]) test-select)
("http://foo.com/" "http://bar.com/" "http://baz.com/")
Run Code Online (Sandbox Code Playgroud)
您需要以下内容才能尝试:
前言:
(require '[net.cgrand.enlive-html :as html])
Run Code Online (Sandbox Code Playgroud)
测试HTML:
(def test-html
(apply str (concat ["<html><body>"]
(for [link ["foo" "bar" "baz"]]
(str "<a href=\"http://" link ".com/\">" link "</a>"))
["</body></html>"])))
Run Code Online (Sandbox Code Playgroud)
language:Perl
库:XML :: Twig
#!/usr/bin/perl
use strict;
use warnings;
use Encode ':all';
use LWP::Simple;
use XML::Twig;
#my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
my $url = 'http://www.google.com';
my $content = get($url);
die "Couldn't fetch!" unless defined $content;
my $twig = XML::Twig->new();
$twig->parse_html($content);
my @hrefs = map {
$_->att('href');
} $twig->get_xpath('//*[@href]');
print "$_\n" for @hrefs;
Run Code Online (Sandbox Code Playgroud)
警告:可以使用像这样的页面获得宽字符错误(将URL更改为注释掉的将会出现此错误),但上面的HTML :: Parser解决方案不会分享此问题.
我在此示例中包含了故意格式错误且不一致的XML.
import java.io.IOException;
import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Element;
import nu.xom.Node;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;
public class HtmlTest {
public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
final Parser parser = new Parser();
parser.setFeature(Parser.namespacesFeature, false);
final Builder builder = new Builder(parser);
final Document document = builder.build("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null);
final Element root = document.getRootElement();
final Nodes links = root.query("//a[@href]");
for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) {
final Node node = links.get(linkNumber);
System.out.println(((Element) node).getAttributeValue("href"));
}
}
}
Run Code Online (Sandbox Code Playgroud)
默认情况下,TagSoup将一个引用XHTML的XML命名空间添加到文档中.我选择在这个样本中压制它.使用默认行为需要调用root.query包含如下命名空间:
root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
27271 次 |
| 最近记录: |