小编Jef*_*eff的帖子

HTMLUnit:大量过时的内容并且无法在getPage()上创建对象警告然后在getByXPath()上调用setOuterHTML的异常失败

我正在尝试HTMLUnit来自动从webapp下载数据.但是,我在getPage()上得到了一大堆警告(大多数似乎处理我认为我甚至不需要的链接脚本)然后是一个致命的com.gargoylesoftware.htmlunit.ScriptException:异常调用setOuterHTML时我尝试运行getByXPath来提取我正在寻找的数据.从我得到的错误中,我不能为我的生活找出正在发生的事情.你们有什么想法吗?

这是我的代码:

import java.util.List;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class ScrapperApp {

    private static void go() throws Exception {
        HtmlPage nextPage;
        String url = "http://media.ethics.ga.gov/search/Campaign/Campaign_Name.aspx?NameID=5751&FilerID=C2009000085&Type=candidate";

        final WebClient webclient = new WebClient();
        final HtmlPage page = webclient.getPage(url);

        System.out.println("PULLING LINKS:");

        List<HtmlAnchor> articles = (List<HtmlAnchor>) page.getByXPath("//div[@class='hform1']/a[@class='lblentrylink']");

        /*for(int x=0; x<articles.size(); x++) {
            nextPage = articles.get(x).click();
            System.out.println(nextPage.getBody());
        }*/
    }

    public static void main(String[] args) throws Exception {
        go();
        System.out.println("COMPLETE");
    }

}
Run Code Online (Sandbox Code Playgroud)

这是我的控制台输出:

Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete …
Run Code Online (Sandbox Code Playgroud)

java htmlunit

13
推荐指数
1
解决办法
2万
查看次数

为什么我在string.replace()上遇到"AttributeError:'module'对象没有属性'replace'"

导致错误的行是

totalR = totalR + (float(string.replace(contri[0][5],",","")) + float(string.replace(contri[0][6],",","")))
Run Code Online (Sandbox Code Playgroud)

contri [0] [5]和[6]是包含格式为1,000.00的数字的字符串.我在将字符串转换为浮点数之前删除逗号,以便将它们添加到totalR,这是一个浮点数.(创建为totalR = 0.0)我也尝试使用Decimal,但错误也发生在那里.我做了"导入字符串".程序失败并出现错误:

File "mine.py", line 43, in fillDonorData
totalR = totalR + (float(string.replace(contri[0][5],",","")) + float(string.replace(contri[0][6],",","")))
AttributeError: 'module' object has no attribute 'replace'
Run Code Online (Sandbox Code Playgroud)

python string floating-point

8
推荐指数
2
解决办法
1万
查看次数

subprocess.check_output失败,错误127

我正在尝试从python应用程序调用外部程序,但它未显示任何输出,并失败并显示错误127。从命令行执行命令可以正常工作。(并且我在正确的工作目录中)

def buildContris (self, startUrl, reportArray):
    urls = []

    for row in reportArray:
        try:
            url = subprocess.check_output(["casperjs", "casper.js", startUrl, row[0]], shell=True)
            print (url)
            urls.append(url)
            break
        except subprocess.CalledProcessError as e:
            print ("Error: " + str(e.returncode) + " Output:" + e.output.decode())          

    return urls
Run Code Online (Sandbox Code Playgroud)

每个循环都会输出以下错误:(我也检查了e.cmd。这是正确的,但很长,因此在此示例中省略了它)

Error: 127 Output: 
Run Code Online (Sandbox Code Playgroud)

解:

以下代码有效

app = subprocess.Popen(["./casperjs/bin/casperjs", "casper.js", startUrl, row[0]], stdout=subprocess.PIPE, stderr=subprocess.PIPE, env = {"PATH" : "/usr/local/bin/:/usr/bin"}, universal_newlines=True)
app.wait()
out, errs = app.communicate()
Run Code Online (Sandbox Code Playgroud)

python subprocess

5
推荐指数
1
解决办法
9690
查看次数

无法使HTMLUnit跟随使用__doPostBack()函数的页面上的链接

我正在尝试从单击一个链接的__doPostBack函数的ASP页中抓取数据。当我单击()具有HTMLUnit的链接时,它将返回我从其开始的页面。我需要怎么做才能完成回发并返回下一页?

码:

import java.util.List;

import com.gargoylesoftware.htmlunit.ScriptResult;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class ScrapperApp {

    private static void go() throws Exception {
        /* turn off annoying htmlunit warnings */
        java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

        HtmlPage nextPage;
        ScriptResult onClick; 

        String url = "http://media.ethics.ga.gov/search/Campaign/Campaign_Name.aspx?NameID=5751&FilerID=C2009000085&Type=candidate";

        final WebClient webclient = new WebClient(BrowserVersion.CHROME_16);
        final HtmlPage page = webclient.getPage(url);

        System.out.println("PULLING LINKS:");

        List<HtmlAnchor> articles = (List<HtmlAnchor>) page.getByXPath("//table[@id='ctl00_ContentPlaceHolder1_Name_Reports1_TabContainer1_TabPanel1_dgReports']/tbody/tr/td/a[@class='lblentrylink']");

        for(int x=0; x<articles.size(); x++) {
            System.out.println("Clicking "+x+": "+articles.get(x).asText()); 
            nextPage = articles.get(x).click();
            System.out.println(nextPage.getUrl());
        }
    }

    public static void main(String[] args) throws Exception …
Run Code Online (Sandbox Code Playgroud)

java asp.net postback htmlunit

4
推荐指数
1
解决办法
1613
查看次数