我正在使用HTML Parser开发应用程序.下面的代码无法获取页面中的整个标记集.有一些标签被遗漏,它们的属性和文本体也被遗漏了.请帮我解释为什么会发生这种情况.....或建议我采取其他方式....
URL url = new URL("...");
PrintWriter pw=new PrintWriter(new FileWriter("HTMLElements.txt"));
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument)htmlKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
parser.parse(br, callback, true);
ElementIterator iterator = new ElementIterator(htmlDoc);
Element element;
while ((element = iterator.next()) != null)
{
AttributeSet attributes = element.getAttributes();
Enumeration e=attributes.getAttributeNames();
pw.println("Element Name :"+element.getName());
while(e.hasMoreElements())
{
Object key=e.nextElement();
Object val=attributes.getAttribute(key);
int startOffset = element.getStartOffset();
int endOffset = element.getEndOffset();
int length = endOffset - startOffset;
String text=htmlDoc.getText(startOffset, length);
pw.println("Key :"+key.toString()+" Value :"+val.toString()+"\r\n"+"Text :"+text+"\r\n");
}
}
Run Code Online (Sandbox Code Playgroud)
}
我使用HTML Parser相当可靠地完成了这项工作(假设HTML文档没有改变其结构).具有稳定API的Web服务要好得多,但有时我们没有.
大概的概念:
首先必须在什么标签(要知道div,meta,span,等),你想要的信息是,知道的属性,以确定这些标签.示例:
<span class="price"> $7.95</span>
Run Code Online (Sandbox Code Playgroud)
如果您正在寻找这个"价格",那么您对"价格"的span标签感兴趣class.
HTML Parser具有按属性过滤的功能.
filter = new HasAttributeFilter("class", "price");
Run Code Online (Sandbox Code Playgroud)
当您使用过滤器进行解析时,您将获得一个列表Nodes,您可以instanceof对它们进行操作以确定它们是否属于您感兴趣的类型,因为span您可以执行类似的操作
if (node instanceof Span) // or any other supported element.
Run Code Online (Sandbox Code Playgroud)
请在此处查看支持的标签列表.
使用HTML Parser获取包含有关网站描述的元标记的示例:
标签示例:
<meta name="description" content="Amazon.com: frankenstein: Books"/>
Run Code Online (Sandbox Code Playgroud)
码:
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;
public class HTMLParserTest {
public static void main(String... args) {
Parser parser = new Parser();
//<meta name="description" content="Some texte about the site." />
HasAttributeFilter filter = new HasAttributeFilter("name", "description");
try {
parser.setResource("http://www.youtube.com");
NodeList list = parser.parse(filter);
Node node = list.elementAt(0);
if (node instanceof MetaTag) {
MetaTag meta = (MetaTag) node;
String description = meta.getAttribute("content");
System.out.println(description);
// Prints: "YouTube is a place to discover, watch, upload and share videos."
}
} catch (ParserException e) {
e.printStackTrace();
}
}
}
Run Code Online (Sandbox Code Playgroud)