如何从Java获取网页的源代码

brt*_*rtb 10 java web-crawler web-content web

我只想从Java检索任何网页的源代码.到目前为止我找到了很多解决方案,但我找不到适用于以下所有链接的代码:

对我来说,主要问题是某些代码检索网页源代码,但缺少代码.例如,下面的代码不适用于第一个链接.

InputStream is = fURL.openStream(); //fURL can be one of the links above
BufferedReader buffer = null;
buffer = new BufferedReader(new InputStreamReader(is, "iso-8859-9"));

int byteRead;
while ((byteRead = buffer.read()) != -1) {
    builder.append((char) byteRead);
}
buffer.close();
System.out.println(builder.toString());
Run Code Online (Sandbox Code Playgroud)

nar*_*yan 25

使用添加的请求属性尝试以下代码:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

public class SocketConnection
{
    public static String getURLSource(String url) throws IOException
    {
        URL urlObject = new URL(url);
        URLConnection urlConnection = urlObject.openConnection();
        urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");

        return toString(urlConnection.getInputStream());
    }

    private static String toString(InputStream inputStream) throws IOException
    {
        try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8")))
        {
            String inputLine;
            StringBuilder stringBuilder = new StringBuilder();
            while ((inputLine = bufferedReader.readLine()) != null)
            {
                stringBuilder.append(inputLine);
            }

            return stringBuilder.toString();
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

  • 的System.out.println(getUrlSource( "http://cumhuriyet.com.tr/?hn=298710")); 没关系 (2认同)