我需要将HTML转换为纯文本.我对格式化的唯一要求是在纯文本中保留新行.新行不仅应显示在<br>其他标签的情况下,例如<tr/>,也应显示</p>新行.
用于测试的示例HTML页面是:
请注意,这些只是随机网址.
我已经尝试了在这个StackOverflow问题的答案中提到的各种库(JSoup,Javax.swing,Apache utils)来将HTML转换为纯文本.
使用JSoup的示例:
public class JSoupTest {
@Test
public void SimpleParse() {
try {
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
System.out.print(doc.text());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Run Code Online (Sandbox Code Playgroud)
HTMLEditorKit示例:
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the …Run Code Online (Sandbox Code Playgroud) 这是我的代码
String html = "<font>fsdfs<font>dfsdf</font>dasdasd</font>";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("font");
for(Element element : elements)
{
element.replaceWith(new Element(Tag.valueOf("span"),"").html(element.html()));
}
System.out.println(doc.html());
Run Code Online (Sandbox Code Playgroud)
我想替换字体标记并放置span标记.在此,它将替换第一个字体标记,但不替换第二个标记
我想提取的HTML页面(S),其放置在文本p和li标签,这样我就可以开始来标记页面构造为每个页面反向索引(ES)为了回答搜索查询.
我如何p使用jsoup 获取标签
Elements e = doc.select("");
Run Code Online (Sandbox Code Playgroud)
可能是该参数中要写的字符串是什么?
我创建了一个包含两个<script type="application/javascript">/* .. */</script>元素的JSoup文档.
问题:当我打电话.html()或.toString()JSoup将逃脱我的JavaScript.
if (foo && bar)
Run Code Online (Sandbox Code Playgroud)
得到
if (foo && bar)
Run Code Online (Sandbox Code Playgroud)
是否可以配置JSoup以<script>在转义时忽略元素?
这(基本上)是我如何创建我的jsoup文档.
final Document doc = Document.createShell("");
final Element head = doc.head();
head.appendElement("meta").attr("charset", "utf-8");
final String myJS = ...;
head.appendElement("script").attr("type", "application/javascript").text(myJS);
Run Code Online (Sandbox Code Playgroud)
我目前的解决方法是用String.replaceon 替换占位符.html().但这有点像黑客.
head.appendElement("script").attr("type", "application/javascript").text("$MYJS$");
String s = doc.html();
s = s.replace("$MYJS$", myJS);
Run Code Online (Sandbox Code Playgroud) 以下Jsoup语句有效:
Elements divs = document.select("div[class=mncls sbucls]");
Run Code Online (Sandbox Code Playgroud)
但相当的声明:
Elements divs = document.select("div.mncls sbucls");
Run Code Online (Sandbox Code Playgroud)
不行.
为什么?
Jsoup是否有带空格的类名有问题?
我有一个Div标签如下
<div id="eventTTL" style="text-transform: uppercase; font-weight: 900;" eventTTL="4583476000">5 days 07:14:41</div>
Run Code Online (Sandbox Code Playgroud)
我如何获得eventTTL的价值?我想显示eventTTL的值,即:)"4583476000".
当xml标记有冒号时抛出异常,
例外:
org.jsoup.select.Selector $ SelectorParseException:无法解析查询'w:r':':r'处的意外标记
XML:
<w:r>
<w:rPr>
<w:rStyle w:val="jid"/>
</w:rPr>
<w:t>AN</w:t>
</w:r>
Run Code Online (Sandbox Code Playgroud)
Java代码:
org.jsoup.nodes.Document doc = Jsoup.parse(documentXmlString);
Run Code Online (Sandbox Code Playgroud)
这里documentXmlString具有上面指定的xml
我尝试了其他答案中指定的几种解决方案,例如尝试使用不同的用户代理(Chrome,safari等),并使用HTTPClient和BufferedReader直接获取HTML,但它们都不起作用.如何使Android输出类似于Web输出?这是我正在寻找的网络输出; (查看完整输出的https://finance.yahoo.com/quote/AAPL/financials?p=AAPL的页面源- 这基本上包含名为"Quarterly"的AJAX选项卡,其中包含一个表.我需要获取该数据,但Android HTML源代码没有它,但网络源代码确实如此.)
root.App.main = {"context":{"dispatcher":{"stores":{"PageStore":{"currentPageName":"quote","currentRenderTargetId":"default","pagesConfigRaw":{"base":{"quote":{"layout":{"bundleName":"yahoodotcom-layout.TwoColumnLayout","name":"TwoColumnLayout","config":{"enableHeaderCollapse":true,"Header":{"isFixed":true,"uhContainerClasses":"Bgi($uhGrayGradient)","navContainerClasses":"Bgi($navrailGrayGradient) Bxsh($navrailShadow) Pos(r) hasScrolled_Bxsh(headerShadow) Panel-open_Bxsh(headerShadow)","navTransitionClasses":"HideNavrail_Translate3d(0,-46px,0) Panel-open_Translate3d(0,-46px,0)","secondaryNavContainerClasses":"hasScrolled_Bdbw(0px) Bxsh($navrailShadow)","height":135},"fetchNewAttribution":true},"meta":{"property":{"twitter:site":"@YahooFinance"}}},"meta":{"property":{"twitter:site":"@YahooFinance","fb:pages":"90376669494"}},"regions":{"SecondaryNav":[{"bundleName":"react-finance","name":"SecondaryNav","config":{"ui":{"enableRelativeUrl":true}},"props":{"key":"SecondaryNav-0-SecondaryNav","id":"SecondaryNav-0-SecondaryNav"},"isPageComposite":true}],"Overlay":[{"bundleName":"react-lightbox","name":"Lightbox","props":{"key":"Overlay-0-Lightbox","id":"Overlay-0-Lightbox"},"isPageComposite":true},{"bundleName":"td-app-finance","name":"Null","props":{"key":"Overlay-1-Null","id":"Overlay-1-Null"},"isPageComposite":true},{"bundleName":"td-app-finance","name":"Null","props":{"key":"Overlay-2-Null","id":"Overlay-2-Null"},"isPageComposite":true}],"Lead":[{"bundleName":"react-finance","name":"FinanceHeader","props":{"className":"Bxz(bb) H(100%) Pos(r) Maw($newGridWidth) Miw($minGridWidth) Miw(a)!--tab768 Miw(a)!--tab1024 Mstart(a) Mend(a) Px(20px) My(10px)","showAds":true,"adsConfig":{"positions":["FB2A","FB2B","FB2C","FB2D"]},"key":"Lead-0-FinanceHeader","id":"Lead-0-FinanceHeader"},"isPageComposite":true},{"bundleName":"tdv2-applet-featurebar","name":"FeatureBar","config":{"ui":{"container_classnames":"W(100%) Bxz(bb) Bdrs(2px) Mb(10px) Maw($maxModuleWidth) Miw($minGridWidth) Miw(a)!--tab768 Miw(a)!--tab1024 Mx(a)","prerender":{"enabled":true,"renderTargetId":"modal"}},"site":"finance"},"props":{"key":"Lead-1-FeatureBar","id":"Lead-1-FeatureBar"},"isPageComposite":true},{"bundleName":"QuotePage","name":"QuoteHeader","props":{"key":"Lead-2-QuoteHeader","id":"Lead-2-QuoteHeader"},"isPageComposite":true},{"bundleName":"QuotePage","name":"QuoteNav","props":{"key":"Lead-3-QuoteNav","id":"Lead-3-QuoteNav"},"isPageComposite":true}],"Col1":[{"bundleName":"td-ads","name":"Ad","props":{"pos":"LDRB","style":{"marginBottom":"8px","paddingTop":"0px","marginLeft":"auto","marginRight":"auto","textAlign":"center","lineHeight":"0px","position":"relative","zIndex":"5"},"key":"Col1-0-Ad","id":"Col1-0-Ad"},"isPageComposite":true},{"bundleName":"Quote.financials","name":"Financials","props":{"key":"Col1-1-Financials","id":"Col1-1-Financials"},"isPageComposite":true},{"bundleName":"react-finance","name":"AdUnitWithTdAds","props":{"className":"ad-foot","positions":["FOOT"],"key":"Col1-2-AdUnitWithTdAds","id":"Col1-2-AdUnitWithTdAds"},"isPageComposite":true},{"bundleName":"react-finance","name":"AdUnitWithTdAds","props":{"className":"ad-fsrvy","positions":["FSRVY"],"key":"Col1-3-AdUnitWithTdAds","id":"Col1-3-AdUnitWithTdAds"},"isPageComposite":true}],"Col2":[{"bundleName":"td-app-finance","name":"ExtPromoButton","props":{"className":"btn Bds(s) Bdc($c-fuji-grey-c) Bdrs(4px) Bgc($white) Bdw(1px) Bgc($ExtButtonHov):h C($white):h C($ExtButtonHov) Cur(p) Fz(s) Fw(b) H(44px) Lh(40px) Mb(20px) Ta(c) Td(n) W(100%)","sec":"ext-promo-all-mkt-submit","titleId":"EXTENSION_PROMO_TITLE","url":"https:\u002F\u002Fchrome.google.com\u002Fwebstore\u002Fdetail\u002Fdoojmkhhplhicnghmafjbhncmgjiohma","enabled":true,"key":"Col2-0-ExtPromoButton","id":"Col2-0-ExtPromoButton"},"isPageComposite":true},{"bundleName":"QuotePage","name":"QuoteModule","props":{"type":"eventPromo","key":"Col2-1-QuoteModule","id":"Col2-1-QuoteModule"},"isPageComposite":true},{"bundleName":"td-ads","name":"ComboAd","props":{"adparseStyle":{"marginBottom":"20px"},"finishedStyle":{"marginBottom":"20px"},"children":[{"bundleName":"td-ads","name":"Ad","props":{"pos":"LREC"}},{"bundleName":"td-ads","name":"Ad","props":{"pos":"MON"}}],"serverHeight":true,"key":"Col2-2-ComboAd","id":"Col2-2-ComboAd"},"isPageComposite":true},{"bundleName":"QuotePage","name":"QuoteModule","props":{"type":"similarCompanies","key":"Col2-3-QuoteModule","id":"Col2-3-QuoteModule"},"initMode":{"deferRender":true},"isPageComposite":true},{"bundleName":"QuotePage","name":"QuoteModule","props":{"type":"earningsChart","key":"Col2-4-QuoteModule","id":"Col2-4-QuoteModule"},"initMode":{"deferRender":true},"isPageComposite":true},{"bundleName":"QuotePage","name":"QuoteModule","props":{"type":"financialsChart","key":"Col2-5-QuoteModule","id":"Col2-5-QuoteModule"},"initMode":{"deferRender":true},"isPageComposite":true},{"bundleName":"react-finance",..."}}}};
Run Code Online (Sandbox Code Playgroud)
这是我得到的Android输出;
(root.App.main = {"context":{"dispatcher":{"stores":{"PageStore":{"currentPageName":"quote","currentRenderTargetId":"default","pagesConfigRaw":{"base":{"quote":{"layout":{"bundleName":"yahoodotcom-layout.TwoColumnLayout","name":"TwoColumnLayout","config":{"enableHeaderCollapse":true,"Header":{"isFixed":true,"uhContainerClasses":"Bgi($uhGrayGradient)","navContainerClasses":"Bgi($navrailGrayGradient) Bxsh($navrailShadow) Pos(r) hasScrolled_Bxsh(headerShadow) Panel-open_Bxsh(headerShadow)","navTransitionClasses":"HideNavrail_Translate3d(0,-46px,0) Panel-open_Translate3d(0,-46px,0)","secondaryNavContainerClasses":"hasScrolled_Bdbw(0px) Bxsh($navrailShadow)","height":135},"fetchNewAttribution":true},"meta":{"property":{"twitter:site":"@YahooFinance"}}},"meta":{"property":{"twitter:site":"@YahooFinance","fb:pages":"90376669494"}},"regions":{"SecondaryNav":[{"bundleName":"react-finance","name":"SecondaryNav","config":{"ui":{"enableRelativeUrl":true}},"props":{"key":"SecondaryNav-0-SecondaryNav","id":"SecondaryNav-0-SecondaryNav"},"isPageComposite":true}],"Overlay":[{"bundleName":"react-lightbox","name":"Lightbox","props":{"key":"Overlay-0-Lightbox","id":"Overlay-0-Lightbox"},"isPageComposite":true},{"bundleName":"td-app-finance","name":"Null","props":{"key":"Overlay-1-Null","id":"Overlay-1-Null"},"isPageComposite":true},{"bundleName":"td-app-finance","name":"Null","props":{"key":"Overlay-2-Null","id":"Overlay-2-Null"},"isPageComposite":true}],"Lead":[{"bundleName":"react-finance","name":"FinanceHeader","props":{"className":"Bxz(bb) H(100%) Pos(r) Maw($newGridWidth) Miw($minGridWidth) Miw(a)!--tab768 Miw(a)!--tab1024 Mstart(a) Mend(a) Px(20px) My(10px)","showAds":true,"adsConfig":{"positions":["FB2A","FB2B","FB2C","FB2D"]},"key":"Lead-0-FinanceHeader","id":"Lead-0-FinanceHeader"},"isPageComposite":true},{"bundleName":"tdv2-applet-featurebar","name":"FeatureBar","config":{"ui":{"container_classnames":"W(100%) Bxz(bb) Bdrs(2px) Mb(10px) Maw($maxModuleWidth) Miw($minGridWidth) Miw(a)!--tab768 …Run Code Online (Sandbox Code Playgroud) jsoup ×10
java ×9
html ×3
html-parsing ×3
parsing ×2
ajax ×1
android ×1
crawler4j ×1
javascript ×1
plaintext ×1
web-crawler ×1
xml-parsing ×1