<td width="10"></td>
<td width="65"><img src="/images/sparks/NIFTY.png" /></td>
<td width="65">5,390.85</td>
<td width="65">5,428.15</td>
<td width="65">5,376.15</td>
<td width="65">5,413.85</td>
Run Code Online (Sandbox Code Playgroud)
这是HTML源代码,我必须从中提取值5390.85,5428.15,5376.15,5413.85.我想用jsoup来做这件事.但我对jsoup相对较新(今天我开始使用它).那我该怎么做呢?
URL url = new URL("http://www.nseindia.com/content/equities/niftysparks.htm");
Document doc = Jsoup.parse(url,3*1000);
String text = doc.body().text();
Run Code Online (Sandbox Code Playgroud)
我已经使用jsoup提取了网站的内容.但如何提取我需要的值?提前致谢
我有一个jsoup中的文档,看起来像这样
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Run Code Online (Sandbox Code Playgroud)
我如何将其doc
转换为字符串.
我正在使用JSoup进行身份验证,然后连接到网站.某些URL具有JSON响应(因为该站点的一部分是在AJAX中).JSoup可以处理JSON响应吗?
Connection.Response doc = Jsoup.connect("...")
.data(...)
.cookie(...)
.header(...)
.method(Method.POST)
.execute();
String result = doc.body()
Run Code Online (Sandbox Code Playgroud)
在我的情况下,身体是"".
是否有类似JSON的JSoup库?
它在HTTP上运行良好,但是当我尝试使用HTTPS源时,它会抛出以下异常:
10-12 13:22:11.169: WARN/System.err(332): javax.net.ssl.SSLHandshakeException: java.security.cert.CertPathValidatorException: Trust anchor for certification path not found.
10-12 13:22:11.179: WARN/System.err(332): at org.apache.harmony.xnet.provider.jsse.OpenSSLSocketImpl.startHandshake(OpenSSLSocketImpl.java:477)
10-12 13:22:11.179: WARN/System.err(332): at org.apache.harmony.xnet.provider.jsse.OpenSSLSocketImpl.startHandshake(OpenSSLSocketImpl.java:328)
10-12 13:22:11.179: WARN/System.err(332): at org.apache.harmony.luni.internal.net.www.protocol.http.HttpConnection.setupSecureSocket(HttpConnection.java:185)
10-12 13:22:11.179: WARN/System.err(332): at org.apache.harmony.luni.internal.net.www.protocol.https.HttpsURLConnectionImpl$HttpsEngine.makeSslConnection(HttpsURLConnectionImpl.java:433)
10-12 13:22:11.189: WARN/System.err(332): at org.apache.harmony.luni.internal.net.www.protocol.https.HttpsURLConnectionImpl$HttpsEngine.makeConnection(HttpsURLConnectionImpl.java:378)
10-12 13:22:11.189: WARN/System.err(332): at org.apache.harmony.luni.internal.net.www.protocol.http.HttpURLConnectionImpl.connect(HttpURLConnectionImpl.java:205)
10-12 13:22:11.189: WARN/System.err(332): at org.apache.harmony.luni.internal.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:152)
10-12 13:22:11.189: WARN/System.err(332): at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:377)
10-12 13:22:11.189: WARN/System.err(332): at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
10-12 13:22:11.189: WARN/System.err(332): at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
Run Code Online (Sandbox Code Playgroud)
这是相关的代码:
try {
doc = Jsoup.connect("https url here").get();
} catch (IOException e) {
Log.e("sys","coudnt get the html"); …
Run Code Online (Sandbox Code Playgroud) 我正在为一个班级制作一个小型Android应用程序,在那里我可以从美国癌症协会的网站上找到癌症相关事件.我一直在使用JSoup来获取有关事件的基本信息,并从我尝试使用select()方法的网站获取特定信息.但是,我正在使用的当前方法抓取的方式比我想要的更多HTML节点,我无法弄清楚原因.我试图抓住的表看起来像这样:
编辑:我意识到where id ="pnlResults"并没有在那个表结束,它在大约3个表之后结束,所有表都包含我想要获取的信息.这是表格
<div id="pnlResults">
<h2><span id="lblEventName">American Cancer Society 44th Annual Walter Hagen Golf Tournament</span></h2>
<!-- General Information Box -->
<div class="text-box boxed wide">
<h3 class="head" style="width:97%;">
General Information
</h3>
<div class="content">
<p>
<label>Event Times:</label><span id="lblStartDate">Monday, July 30, 2012</span><span id="lblEndDate"></span><br />
<label> </label><span id="lblStartTime">10:00 AM</span> - <span id="lblEndTime">9:00 PM</span>
</p>
<p>
<label>Time Zone:</label><span id="lblTimeZone">Eastern</span>
</p>
<p>
<label>Description:</label><span id="lblDesc" class="fieldData long">The American Cancer Society Walter Hagen Golf Tournament highlights the Society’s role in supporting research and patient care here in Rochester. …
Run Code Online (Sandbox Code Playgroud) 我正在使用JSoup来解析来自http://www.latijnengrieks.com/vertaling.php?id=5368的内容.这是第三方网站,未指定正确的编码.我使用以下代码加载数据:
public class Loader {
public static void main(String[] args){
String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";
Document doc;
try {
doc = Jsoup.connect(url).timeout(5000).get();
Element content = doc.select("div.kader").first();
Element contenttableElement = content.getElementsByClass("kopje").first().parent().parent();
String contenttext = content.html();
String tabletext = contenttableElement.html();
contenttext = Jsoup.parse(contenttext).text();
contenttext = contenttext.replace("br2n", "\n");
tabletext = Jsoup.parse(tabletext.replaceAll("(?i)<br[^>]*>", "br2n")).text();
tabletext = tabletext.replace("br2n", "\n");
String text = contenttext.substring(tabletext.length(), contenttext.length());
System.out.println(text);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Run Code Online (Sandbox Code Playgroud)
这给出了以下输出:
Aeneas dwaalt rond in Troje …
Run Code Online (Sandbox Code Playgroud) 我试图获取div类中包含的所有信息:bg_block_info
,但我得到另一个div类的信息<div class="bg_block_info pad_20">
为什么我弄错了?
Document doc = Jsoup.connect("http://www.maib.md").get();
Elements myin = doc.getElementsByClass("bg_block_info");
Run Code Online (Sandbox Code Playgroud) 编辑:我的PROGUARD版本是4.7
今天我尝试jsoup (version 1.7.1)
在我的Android应用程序中包含,但它给我带来了很多麻烦.每当我的应用程序面临强制关闭问题时,我导出已签名apk
的proguard
打开,然后我禁用proguard
并导出它apk
并且它运行完美.请帮我.我该如何解决错误?堆栈跟踪如下:
java.lang.RuntimeException: An error occured while executing doInBackground()
at android.os.AsyncTask$3.done(AsyncTask.java:278)
at java.util.concurrent.FutureTask$Sync.innerSetException(FutureTask.java:273)
at java.util.concurrent.FutureTask.setException(FutureTask.java:124)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:307)
at java.util.concurrent.FutureTask.run(FutureTask.java:137)
at android.os.AsyncTask$SerialExecutor$1.run(AsyncTask.java:208)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:107
6)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:56
9)
at java.lang.Thread.run(Thread.java:856)
Caused by: java.lang.ExceptionInInitializerError
at org.jsoup.nodes.Document$OutputSettings.<init>(Unknown Source)
at org.jsoup.nodes.Document.<init>(Unknown Source)
at org.jsoup.parser.TreeBuilder.void initialiseParse(java.lang.String,java.lang.
String,org.jsoup.parser.ParseErrorList)(Unknown Source)
at org.jsoup.parser.TreeBuilder.org.jsoup.nodes.Document parse(java.lang.String,
java.lang.String,org.jsoup.parser.ParseErrorList)(Unknown Source)
boolean process(org.jsoup.parser.Token)
at org.jsoup.parser.HtmlTreeBuilder.org.jsoup.nodes.Document parse(java.lang.Str
ing,java.lang.String,org.jsoup.parser.ParseErrorList)(Unknown Source)
boolean process(org.jsoup.parser.Token)
boolean process(org.jsoup.parser.Token,org.j
soup.parser.HtmlTreeBuilderState)
void transition(org.jsoup.parser.HtmlTreeBui
lderState)
org.jsoup.parser.HtmlTreeBuilderState state(
)
void …
Run Code Online (Sandbox Code Playgroud) 我正在使用Jsoup库来读取URL.此网址包含几个<script>
标记内的文字.我可以在每个<script>
标签中获取文本吗?请注意,我并不是要求解析Javascript文件,因为我已经知道JSoup不允许这样做.URL的实际源代码在脚本标记中包含文本,我需要它.
doc = Jsoup.connect("http://www.example.com").timeout(10000).get();
Element div = doc.select("script").first();
for (Element element : div.children()) {
System.out.println(element.toString());
}
Run Code Online (Sandbox Code Playgroud)
这是源代码中的一个脚本标记:
<script type="text/javascript">
(function() {
...
})();
</script>
Run Code Online (Sandbox Code Playgroud) String body = "<br>";
Document document = Jsoup.parseBodyFragment(body);
document.outputSettings().escapeMode(EscapeMode.xhtml);
String str = document.body().html();
System.out.println(str);
Run Code Online (Sandbox Code Playgroud)
期望: <br />
结果: <br>
Jsoup可以将值HTML转换为XHTML吗?
jsoup ×10
java ×9
android ×3
html ×1
html-parser ×1
html-parsing ×1
https ×1
json ×1
parsing ×1
proguard ×1
web-scraping ×1
xhtml ×1