Rob*_*rto 15 iphone parsing html-entities nsxmlparser
我想我读过与这个问题有关的每一个网页,但我仍然无法找到解决方案,所以我在这里.
我有一个不受我控制的HTML网页,我需要从我的iPhone应用程序解析它.这是我正在谈论的网页示例:
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
</HEAD>
<BODY>
<LI class="bye bye" rel="hello 1">
<H5 class="onlytext">
<A name="morning_part">morning</A>
</H5>
<DIV class="mydiv">
<SPAN class="myclass">something about you</SPAN>
<SPAN class="anotherclass">
<A href="http://www.google.it">Bye Bye è un saluto</A>
</SPAN>
</DIV>
</LI>
</BODY>
</HTML>
Run Code Online (Sandbox Code Playgroud)
我正在使用NSXMLParser,它一直顺利,直到找到èhtml实体.它调用foundCharacters:for"Bye Bye",然后调用resolveExternalEntityName:systemID ::,其实体名称为"egrave".在这个方法中,我只是返回在NSData中转换的字符"è",再次调用foundCharacters将字符串"è"添加到前一个"Bye Bye",然后解析器引发NSXMLParserUndeclaredEntityError错误.
我没有DTD,我无法更改我正在解析的html文件.你对这个问题有什么想法吗?罗伯,先谢谢你们所有人.
更新(12/03/2010).在Griffo的建议之后我得到了这样的结论:
data = [self replaceHtmlEntities:data];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser parse];
Run Code Online (Sandbox Code Playgroud)
其中replaceHtmlEntities:(NSData*)是这样的:
- (NSData *)replaceHtmlEntities:(NSData *)data {
NSString *htmlCode = [[NSString alloc] initWithData:data encoding:NSISOLatin1StringEncoding];
NSMutableString *temp = [NSMutableString stringWithString:htmlCode];
[temp replaceOccurrencesOfString:@"&" withString:@"&" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
[temp replaceOccurrencesOfString:@" " withString:@" " options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
...
[temp replaceOccurrencesOfString:@"À" withString:@"À" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
NSData *finalData = [temp dataUsingEncoding:NSISOLatin1StringEncoding];
return finalData;
}
Run Code Online (Sandbox Code Playgroud)
但我仍然在寻找解决这个问题的最佳方法.我会在接下来的几天尝试使用TouchXml,但我仍然认为应该有一种方法可以使用NSXMLParser API,所以如果你知道如何,请随时在这里写:)
在探索了几种替代方案后,NSXMLParser似乎不支持标准实体以外的实体 <, >, ', " and &
下面的代码失败导致了NSXMLParserUndeclaredEntityError.
// Create a dictionary to hold the entities and NSString equivalents
// A complete list of entities and unicode values is described in the HTML DTD
// which is available for download http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
NSDictionary *entityMap = [NSDictionary dictionaryWithObjectsAndKeys:
[NSString stringWithFormat:@"%C", 0x00E8], @"egrave",
[NSString stringWithFormat:@"%C", 0x00E0], @"agrave",
...
,nil];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser setShouldResolveExternalEntities:YES];
[parser parse];
// NSXMLParser delegate method
- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName:(NSString *)entityName systemID:(NSString *)systemID {
return [[entityMap objectForKey:entityName] dataUsingEncoding: NSUTF8StringEncoding];
}
Run Code Online (Sandbox Code Playgroud)
尝试通过在HTML文档前加上ENTITY声明来声明实体将会通过,但是扩展的实体不会传回,parser:foundCharacters并且会删除è和à字符.
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[
<!ENTITY agrave "à">
<!ENTITY egrave "è">
]>
Run Code Online (Sandbox Code Playgroud)
在另一个实验中,我创建了一个带有内部DTD的完全有效的xml文档
<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE author [
<!ELEMENT author (#PCDATA)>
<!ENTITY js "Jo Smith">
]>
<author>< &js; ></author>
Run Code Online (Sandbox Code Playgroud)
我实现了parser:foundInternalEntityDeclarationWithName:value:;委托方法,很明显解析器正在获取实体数据,但是parser:foundCharacters只调用预定义的实体.
2010-03-20 12:53:59.871 xmlParsing[1012:207] Parser Did Start Document
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundElementDeclarationWithName: author model:
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundInternalEntityDeclarationWithName: js value: Jo Smith
2010-03-20 12:53:59.874 xmlParsing[1012:207] didStartElement: author type: (null)
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters Before:
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.877 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.878 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters After: < >
2010-03-20 12:53:59.880 xmlParsing[1012:207] didEndElement: author with content: < >
2010-03-20 12:53:59.880 xmlParsing[1012:207] Parser Did End Document
Run Code Online (Sandbox Code Playgroud)
我找到了一个关于使用LibXML的SAX接口的教程的链接.的xmlSAXHandler是,用于由NSXMLParser允许getEntity被定义的回调.调用之后getEntity,实体的扩展将传递给characters回调.
NSXMLParser这里缺少功能.应该发生的是,它NSXMLParser或它delegate存储实体定义并将它们提供给xmlSAXHandler getEntity回调.这显然没有发生.我将提交错误报告.
与此同时,如果您的文档很小,那么执行字符串替换的早期答案是完全可以接受的.查看上面提到的SAX教程以及Apple的XMLPerformance示例应用程序,看看是否libxml值得实现解析器.
这很有趣.
| 归档时间: |
|
| 查看次数: |
10454 次 |
| 最近记录: |