Fah*_*din 5 html java parsing jsoup
简而言之,这就是我想要做的事情:(我想使用jsoup)
所以,第一点我现在所拥有的:
String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
Document document = Jsoup.connect(url).get();
Run Code Online (Sandbox Code Playgroud)
现在在这里我想了解“文档”是哪种格式,是否已经从html或任何类型的网页类型中解析出来了?
然后第二点我现在所拥有的:
Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = document.getElementsMatchingOwnText(p);
Run Code Online (Sandbox Code Playgroud)
在这里,我正在尝试匹配日期正则表达式以在页面中搜索日期并存储在字符串中以备后用(第3点),但我确定我离它不远了,在这里需要帮助。
我已经完成了第4点。
因此,请任何可以帮助我理解并带我正确方向的人,我如何才能达到上述4点。
提前致谢 !
更新: 所以这是我想要的:
public static void main(String[] args){
try {
// using USER AGENT for giving information to the server that I am a browser not a bot
final String USER_AGENT =
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
// My only one url which I want to parse
String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
// Creating a jsoup.Connection to connect the url with USER AGENT
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
// retrieving the parsed document
Document htmlDocument = connection.get();
/* Now till this part, I have A parsed document of the url page which is in plain-text format right?
* If not, in which type or in which format it is stored in the variable 'htmlDocument'
* */
/* Now, If 'htmlDocument' holds the text format of the web page
* Why do i need elements to find dates, because dates can be normal text in a web page,
* So, how I am going to find an element tag for that?
* As an example, If i wanted to collect text from <p> paragraph tag,
* I would use this :
*/
// I am not sure is it correct or not
//***************************************************/
Elements paragraph = htmlDocument.getElementsByTag("p");
for(Element src: paragraph){
System.out.println("text"+src.attr("abs:p"));
}
//***************************************************//
/* But I do not want any elements to find to gather dates on the page
* I just want to search the whole text document for date
* So, I need a regex formatted date string which will be passed as a input for a search method
* this search mechanism should be on text formatted page as we have parsed document in 'htmlDocument'
*/
// At the end we will use only one date from our search result and format it in a standard form
/*
* That is it.
*/
/*
* I was trying something like this
*/
//final Elements elements = document.getElementsMatchingOwnText("\\d{4}-\\d{2}-\\d{2}");
Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = htmlDocument.getElementsMatchingOwnText(p);
for(Element e: elements){
System.out.println("element = [" + e + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
}
Run Code Online (Sandbox Code Playgroud)
这是我发现的一种可能的解决方案:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;
import java.util.List;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
/**
* Created by ruben.alfarodiaz on 21/12/2016.
*/
@RunWith(JUnit4.class)
public class StackTest {
@Test
public void findDates() {
final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
try {
String url = "http://stackoverflow.com/questions/51224/regular-expression-to-match-valid-dates";
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
//with this pattern we can find all dates with regex dd/mm/yyyy if we need cover extra formats we should create N more patterns
Pattern pattern = Pattern.compile("(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)");
//Here we find all document elements which have some element with the searched pattern
Elements elements = htmlDocument.getElementsMatchingText(pattern);
//in this loop we are going to filter from all original elements to find only the leaf elements
List<Element> finalElements = elements.stream().filter(elem -> isLastElem(elem, pattern)).collect(Collectors.toList());
finalElements.stream().forEach(elem ->
System.out.println("Node: " + elem.html())
);
}catch(Exception ex){
ex.printStackTrace();
}
}
//Method to decide if the current element is a leaf or contains others dates inside
private boolean isLastElem(Element elem, Pattern pattern) {
return elem.getElementsMatchingText(pattern).size() <= 1;
}
}
Run Code Online (Sandbox Code Playgroud)
应该根据需要添加尽可能多的模式,因为我认为找到一个匹配所有可能性的单一模式会很复杂
编辑:最重要的是,库为您提供了元素的层次结构,因此您需要迭代它们才能找到最终的叶子。例如
<html>
<body>
<div>
20/11/2017
</div>
</body>
</html>
Run Code Online (Sandbox Code Playgroud)
如果我们找到 dd/mm/yyyy 模式,库将返回 3 个元素 html、body 和 div,但我们只对 div 感兴趣