标签: jsoup

在Java中将HTML转换为纯文本

我需要将HTML转换为纯文本.我对格式化的唯一要求是在纯文本中保留新行.新行不仅应显示在<br>其他标签的情况下,例如<tr/>,也应显示</p>新行.

用于测试的示例HTML页面是:

请注意,这些只是随机网址.

我已经尝试了在这个StackOverflow问题的答案中提到的各种库(JSoup,Javax.swing,Apache utils)来将HTML转换为纯文本.

使用JSoup的示例:

public class JSoupTest {

 @Test
 public void SimpleParse() {
  try {
   Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
   System.out.print(doc.text());

  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }
}
Run Code Online (Sandbox Code Playgroud)

HTMLEditorKit示例:

import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
 StringBuffer s;

 public Html2Text() {}

 public void parse(Reader in) throws IOException {
   s = new StringBuffer();
   ParserDelegator delegator = new ParserDelegator();
   // the …
Run Code Online (Sandbox Code Playgroud)

java parsing plaintext htmleditorkit jsoup

10
推荐指数
2
解决办法
4万
查看次数

使用jsoup替换HTML标记

这是我的代码

String html = "<font>fsdfs<font>dfsdf</font>dasdasd</font>";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("font");
for(Element element : elements)
{
element.replaceWith(new Element(Tag.valueOf("span"),"").html(element.html()));
}


System.out.println(doc.html());
Run Code Online (Sandbox Code Playgroud)

我想替换字体标记并放置span标记.在此,它将替换第一个字体标记,但不替换第二个标记

java jsoup

10
推荐指数
1
解决办法
1万
查看次数

如何在<p>标签之间提取文本

我想提取的HTML页面(S),其放置在文本pli标签,这样我就可以开始来标记页面构造为每个页面反向索引(ES)为了回答搜索查询.

我如何p使用jsoup 获取标签

Elements e = doc.select(""); 
Run Code Online (Sandbox Code Playgroud)

可能是该参数中要写的字符串是什么?

html java parsing jsoup

10
推荐指数
1
解决办法
3万
查看次数

如何创建包含JS的JSoup文档?

我创建了一个包含两个<script type="application/javascript">/* .. */</script>元素的JSoup文档.

问题:当我打电话.html().toString()JSoup将逃脱我的JavaScript.

if (foo && bar) 
Run Code Online (Sandbox Code Playgroud)

得到

if (foo &amp;&amp; bar)
Run Code Online (Sandbox Code Playgroud)

是否可以配置JSoup以<script>在转义时忽略元素?


这(基本上)是我如何创建我的jsoup文档.

final Document doc = Document.createShell("");
final Element head = doc.head();
head.appendElement("meta").attr("charset", "utf-8");
final String myJS = ...;
head.appendElement("script").attr("type", "application/javascript").text(myJS);
Run Code Online (Sandbox Code Playgroud)

我目前的解决方法是用String.replaceon 替换占位符.html().但这有点像黑客.

head.appendElement("script").attr("type", "application/javascript").text("$MYJS$");
String s = doc.html();
s = s.replace("$MYJS$", myJS);
Run Code Online (Sandbox Code Playgroud)

javascript java jsoup

10
推荐指数
1
解决办法
1314
查看次数

如何解析HTML中的文本

如何使用java使用jsoup解析网页中的文本?

java jsoup

9
推荐指数
1
解决办法
2万
查看次数

为什么"div [class = mncls sbucls]"工作而"div.mncls sbucls"不工作?

以下Jsoup语句有效:

 Elements divs = document.select("div[class=mncls sbucls]");
Run Code Online (Sandbox Code Playgroud)

但相当的声明:

 Elements divs = document.select("div.mncls sbucls");
Run Code Online (Sandbox Code Playgroud)

不行.

为什么?

Jsoup是否有带空格的类名有问题?

html html-parsing jsoup

9
推荐指数
1
解决办法
2677
查看次数

通过jSoup从Div标签获取属性值

我有一个Div标签如下

<div id="eventTTL" style="text-transform: uppercase; font-weight: 900;" eventTTL="4583476000">5 days 07:14:41</div>
Run Code Online (Sandbox Code Playgroud)

我如何获得eventTTL的价值?我想显示eventTTL的值,即:)"4583476000".

java html-parsing jsoup

9
推荐指数
1
解决办法
1万
查看次数

Jsoup:xml标记中冒号时的SelectorParseException

当xml标记有冒号时抛出异常,

例外:

org.jsoup.select.Selector $ SelectorParseException:无法解析查询'w:r':':r'处的意外标记

XML:

<w:r>
 <w:rPr>
   <w:rStyle w:val="jid"/>
 </w:rPr>
 <w:t>AN</w:t>
</w:r>
Run Code Online (Sandbox Code Playgroud)

Java代码:

    org.jsoup.nodes.Document doc = Jsoup.parse(documentXmlString);
Run Code Online (Sandbox Code Playgroud)

这里documentXmlString具有上面指定的xml

java xml-parsing jsoup

9
推荐指数
2
解决办法
3541
查看次数

Crawler4j与Jsoup一起用于Java中的页面爬行和解析

我想获取页面的内容并提取其中的特定部分.据我所知,至少有两种解决方案可以完成这样的任务:Crawler4jJsoup.

它们都能够检索页面的内容并提取它的子部分.我唯一不明白的是它们之间的区别是什么?有一个类似的问题,标记为已回答:

Crawler4j是一个爬虫,Jsoup是一个解析器.

但我刚刚检查过,除了解析功能外,Jsoup 1.8.3还能够抓取页面,而Crawler4j不仅可以抓取页面而且可以解析其内容.

那么,请你澄清Crawler4j和Jsoup之间的区别吗?

java web-crawler html-parsing jsoup crawler4j

9
推荐指数
1
解决办法
3817
查看次数

如何从网页的(一个标签内)的HTML页面源中提取数据?

我尝试了其他答案中指定的几种解决方案,例如尝试使用不同的用户代理(Chrome,safari等),并使用HTTPClient和BufferedReader直接获取HTML,但它们都不起作用.如何使Android输出类似于Web输出?这是我正在寻找的网络输出; (查看完整输出的https://finance.yahoo.com/quote/AAPL/financials?p=AAPL的页面源- 这基本上包含名为"Quarterly"的AJAX选项卡,其中包含一个表.我需要获取该数据,但Android HTML源代码没有它,但网络源代码确实如此.)

root.App.main = {"context":{"dispatcher":{"stores":{"PageStore":{"currentPageName":"quote","currentRenderTargetId":"default","pagesConfigRaw":{"base":{"quote":{"layout":{"bundleName":"yahoodotcom-layout.TwoColumnLayout","name":"TwoColumnLayout","config":{"enableHeaderCollapse":true,"Header":{"isFixed":true,"uhContainerClasses":"Bgi($uhGrayGradient)","navContainerClasses":"Bgi($navrailGrayGradient) Bxsh($navrailShadow) Pos(r) hasScrolled_Bxsh(headerShadow) Panel-open_Bxsh(headerShadow)","navTransitionClasses":"HideNavrail_Translate3d(0,-46px,0) Panel-open_Translate3d(0,-46px,0)","secondaryNavContainerClasses":"hasScrolled_Bdbw(0px) Bxsh($navrailShadow)","height":135},"fetchNewAttribution":true},"meta":{"property":{"twitter:site":"@YahooFinance"}}},"meta":{"property":{"twitter:site":"@YahooFinance","fb:pages":"90376669494"}},"regions":{"SecondaryNav":[{"bundleName":"react-finance","name":"SecondaryNav","config":{"ui":{"enableRelativeUrl":true}},"props":{"key":"SecondaryNav-0-SecondaryNav","id":"SecondaryNav-0-SecondaryNav"},"isPageComposite":true}],"Overlay":[{"bundleName":"react-lightbox","name":"Lightbox","props":{"key":"Overlay-0-Lightbox","id":"Overlay-0-Lightbox"},"isPageComposite":true},{"bundleName":"td-app-finance","name":"Null","props":{"key":"Overlay-1-Null","id":"Overlay-1-Null"},"isPageComposite":true},{"bundleName":"td-app-finance","name":"Null","props":{"key":"Overlay-2-Null","id":"Overlay-2-Null"},"isPageComposite":true}],"Lead":[{"bundleName":"react-finance","name":"FinanceHeader","props":{"className":"Bxz(bb) H(100%) Pos(r) Maw($newGridWidth) Miw($minGridWidth) Miw(a)!--tab768 Miw(a)!--tab1024 Mstart(a) Mend(a) Px(20px) My(10px)","showAds":true,"adsConfig":{"positions":["FB2A","FB2B","FB2C","FB2D"]},"key":"Lead-0-FinanceHeader","id":"Lead-0-FinanceHeader"},"isPageComposite":true},{"bundleName":"tdv2-applet-featurebar","name":"FeatureBar","config":{"ui":{"container_classnames":"W(100%) Bxz(bb) Bdrs(2px) Mb(10px) Maw($maxModuleWidth) Miw($minGridWidth) Miw(a)!--tab768 Miw(a)!--tab1024 Mx(a)","prerender":{"enabled":true,"renderTargetId":"modal"}},"site":"finance"},"props":{"key":"Lead-1-FeatureBar","id":"Lead-1-FeatureBar"},"isPageComposite":true},{"bundleName":"QuotePage","name":"QuoteHeader","props":{"key":"Lead-2-QuoteHeader","id":"Lead-2-QuoteHeader"},"isPageComposite":true},{"bundleName":"QuotePage","name":"QuoteNav","props":{"key":"Lead-3-QuoteNav","id":"Lead-3-QuoteNav"},"isPageComposite":true}],"Col1":[{"bundleName":"td-ads","name":"Ad","props":{"pos":"LDRB","style":{"marginBottom":"8px","paddingTop":"0px","marginLeft":"auto","marginRight":"auto","textAlign":"center","lineHeight":"0px","position":"relative","zIndex":"5"},"key":"Col1-0-Ad","id":"Col1-0-Ad"},"isPageComposite":true},{"bundleName":"Quote.financials","name":"Financials","props":{"key":"Col1-1-Financials","id":"Col1-1-Financials"},"isPageComposite":true},{"bundleName":"react-finance","name":"AdUnitWithTdAds","props":{"className":"ad-foot","positions":["FOOT"],"key":"Col1-2-AdUnitWithTdAds","id":"Col1-2-AdUnitWithTdAds"},"isPageComposite":true},{"bundleName":"react-finance","name":"AdUnitWithTdAds","props":{"className":"ad-fsrvy","positions":["FSRVY"],"key":"Col1-3-AdUnitWithTdAds","id":"Col1-3-AdUnitWithTdAds"},"isPageComposite":true}],"Col2":[{"bundleName":"td-app-finance","name":"ExtPromoButton","props":{"className":"btn Bds(s) Bdc($c-fuji-grey-c) Bdrs(4px) Bgc($white) Bdw(1px) Bgc($ExtButtonHov):h C($white):h C($ExtButtonHov) Cur(p) Fz(s) Fw(b) H(44px) Lh(40px) Mb(20px) Ta(c) Td(n) W(100%)","sec":"ext-promo-all-mkt-submit","titleId":"EXTENSION_PROMO_TITLE","url":"https:\u002F\u002Fchrome.google.com\u002Fwebstore\u002Fdetail\u002Fdoojmkhhplhicnghmafjbhncmgjiohma","enabled":true,"key":"Col2-0-ExtPromoButton","id":"Col2-0-ExtPromoButton"},"isPageComposite":true},{"bundleName":"QuotePage","name":"QuoteModule","props":{"type":"eventPromo","key":"Col2-1-QuoteModule","id":"Col2-1-QuoteModule"},"isPageComposite":true},{"bundleName":"td-ads","name":"ComboAd","props":{"adparseStyle":{"marginBottom":"20px"},"finishedStyle":{"marginBottom":"20px"},"children":[{"bundleName":"td-ads","name":"Ad","props":{"pos":"LREC"}},{"bundleName":"td-ads","name":"Ad","props":{"pos":"MON"}}],"serverHeight":true,"key":"Col2-2-ComboAd","id":"Col2-2-ComboAd"},"isPageComposite":true},{"bundleName":"QuotePage","name":"QuoteModule","props":{"type":"similarCompanies","key":"Col2-3-QuoteModule","id":"Col2-3-QuoteModule"},"initMode":{"deferRender":true},"isPageComposite":true},{"bundleName":"QuotePage","name":"QuoteModule","props":{"type":"earningsChart","key":"Col2-4-QuoteModule","id":"Col2-4-QuoteModule"},"initMode":{"deferRender":true},"isPageComposite":true},{"bundleName":"QuotePage","name":"QuoteModule","props":{"type":"financialsChart","key":"Col2-5-QuoteModule","id":"Col2-5-QuoteModule"},"initMode":{"deferRender":true},"isPageComposite":true},{"bundleName":"react-finance",..."}}}};
Run Code Online (Sandbox Code Playgroud)

这是我得到的Android输出;

(root.App.main = {"context":{"dispatcher":{"stores":{"PageStore":{"currentPageName":"quote","currentRenderTargetId":"default","pagesConfigRaw":{"base":{"quote":{"layout":{"bundleName":"yahoodotcom-layout.TwoColumnLayout","name":"TwoColumnLayout","config":{"enableHeaderCollapse":true,"Header":{"isFixed":true,"uhContainerClasses":"Bgi($uhGrayGradient)","navContainerClasses":"Bgi($navrailGrayGradient) Bxsh($navrailShadow) Pos(r) hasScrolled_Bxsh(headerShadow) Panel-open_Bxsh(headerShadow)","navTransitionClasses":"HideNavrail_Translate3d(0,-46px,0) Panel-open_Translate3d(0,-46px,0)","secondaryNavContainerClasses":"hasScrolled_Bdbw(0px) Bxsh($navrailShadow)","height":135},"fetchNewAttribution":true},"meta":{"property":{"twitter:site":"@YahooFinance"}}},"meta":{"property":{"twitter:site":"@YahooFinance","fb:pages":"90376669494"}},"regions":{"SecondaryNav":[{"bundleName":"react-finance","name":"SecondaryNav","config":{"ui":{"enableRelativeUrl":true}},"props":{"key":"SecondaryNav-0-SecondaryNav","id":"SecondaryNav-0-SecondaryNav"},"isPageComposite":true}],"Overlay":[{"bundleName":"react-lightbox","name":"Lightbox","props":{"key":"Overlay-0-Lightbox","id":"Overlay-0-Lightbox"},"isPageComposite":true},{"bundleName":"td-app-finance","name":"Null","props":{"key":"Overlay-1-Null","id":"Overlay-1-Null"},"isPageComposite":true},{"bundleName":"td-app-finance","name":"Null","props":{"key":"Overlay-2-Null","id":"Overlay-2-Null"},"isPageComposite":true}],"Lead":[{"bundleName":"react-finance","name":"FinanceHeader","props":{"className":"Bxz(bb) H(100%) Pos(r) Maw($newGridWidth) Miw($minGridWidth) Miw(a)!--tab768 Miw(a)!--tab1024 Mstart(a) Mend(a) Px(20px) My(10px)","showAds":true,"adsConfig":{"positions":["FB2A","FB2B","FB2C","FB2D"]},"key":"Lead-0-FinanceHeader","id":"Lead-0-FinanceHeader"},"isPageComposite":true},{"bundleName":"tdv2-applet-featurebar","name":"FeatureBar","config":{"ui":{"container_classnames":"W(100%) Bxz(bb) Bdrs(2px) Mb(10px) Maw($maxModuleWidth) Miw($minGridWidth) Miw(a)!--tab768 …
Run Code Online (Sandbox Code Playgroud)

html java ajax android jsoup

9
推荐指数
1
解决办法
817
查看次数