用java从网页中读取源代码

Ahm*_*Ali 2 java html-content-extraction

我正在尝试从网页中读取源代码。我的Java代码是

import java.net.*;
import java.io.*;
import java.util.*;
import javax.swing.JOptionPane;

class Testing{
public static void Connect() throws Exception{


  URL url = new URL("http://excite.com/education");
  URLConnection spoof = url.openConnection();


  spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
  BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream()));
  String strLine = "";


  while ((strLine = in.readLine()) != null){


   System.out.println(strLine);
  }

  System.out.println("End of page.");
 }

 public static void main(String[] args){

  try{

   Connect();
  }catch(Exception e){

  }
}
Run Code Online (Sandbox Code Playgroud)

当我编译并运行此代码时,它提供以下输出:

? I?%&/m?{J?J??t?$?@??????iG#)?*??eVe]f@????{???{???;?N'????\fdl??J??!?? ??~|?"~?$}?>???????4?????7N?????+??M?N???J?tZfM??G?j?? ??R??!?9??>JgE??Ge[????????W???????8?????? ?|8? ??????? ??ho????0????|?:--?|?L?U?????m?zt?n3??l\?w??O^f?G[?CG< ?y6K??gM?rg???y?E?y????h~????X???l=??Z?/????(?^O?UU6???? ?&?6_? @yC}?p?y???lAH????zF#?V?6_??}??)?v=J+?$????G?Y?L?b???wS"?7?y^????Z?m???Y:????J<N_?Y=???U?f???,???y?Q2(J?P!??i????1&F0&?n???x?T??h?Qzw?+????n?)?h??K??2????8g???????A0 ???1I?%????Q?Z????{????????w????x????N???<d?S????%a|4?j??z???k?Bak??k-?c?z?g??z???l>????s^,??5??/B?{????]]????Ý?????y{?_l?8g?k???b ???"+|??(??M??^[ J?P??_?..???????x?Z?$?????????E>????u???E~????{????f?e1? ?QZ,?????f??e?3J?b?^??4??????> ??y??;??<?{?l??ZfW S@ {?]? ?1??Q ?????n[ ?,t??????~?n?S?u#SL??n?^?????????EC??q?/?y???FE?tpm??????e&??oB???z9eY????????P??IK??????????w?N??;?;J?????;?/??5???M???rZ??q??]??C?d???F?nd???}???A5???M?5?.?:??/?_D???3????'?c?Z7??}??(OI),?i????{?<?w???????DZ?e????'q???eY]=???kj??????????\qhrRn???l?o-??.???k??_???oD8??GA?P?r??|$???Pv~Y?:?[q??sH?? <??C????^N?[ v(??S??l?c?C????3???E5&5?V?L?T??????oQr???/???#[f?5?5"????[???t?vm?\??.0?nh????a?WYM ^T?|\,????L?u ????B???C?r?????????????'?%?{??)?);?fV?]??g,?>?C ?c2? ??p?4??}H???P??(?%j"?}?&?:?Oh\5I?l????{?/?]?LB?l??2??我"??=??Y?|?>??n???????}?????~?[??' ??O ???? :/?)?Wz?3? ?lo?.5?k?&??????>??'o?????<???G?g???>->?xQM?????%<?|????u?.??3 ???[?[r????;???]4E??6[????]????1???*?8}??n?w??????? ?????|????}|qo|?~u????w|?i?i???Z?`z??????Q}?u??!??? w ?O???R9?)?~??g~?w6??{?wd?o??/Z?uUS???l??I^???>??[? U1?o?_??J??}??@?@?U?/??/????i?7|CZT?(?2b~????c?W?c5'??? ?EeF???0??T??{??W?2????/???O???YJj????K/???>??:'_l?

除了来自该目录的 URL,即“excite.com/education”,所有 URL 都提供了正确的源代码,但这些 URL 会产生问题。

任何人请帮助。

提前致谢。

Sam*_*yan 5

这个对我有用。

private static String getWebPabeSource(String sURL) throws IOException {
        URL url = new URL(sURL);
        URLConnection urlCon = url.openConnection();
        BufferedReader in = null;

        if (urlCon.getHeaderField("Content-Encoding") != null
                && urlCon.getHeaderField("Content-Encoding").equals("gzip")) {
            in = new BufferedReader(new InputStreamReader(new GZIPInputStream(
                    urlCon.getInputStream())));
        } else {
            in = new BufferedReader(new InputStreamReader(
                    urlCon.getInputStream()));
        }

        String inputLine;
        StringBuilder sb = new StringBuilder();

        while ((inputLine = in.readLine()) != null)
            sb.append(inputLine);
        in.close();

        return sb.toString();
}
Run Code Online (Sandbox Code Playgroud)