我正在尝试HTMLUnit来自动从webapp下载数据.但是,我在getPage()上得到了一大堆警告(大多数似乎处理我认为我甚至不需要的链接脚本)然后是一个致命的com.gargoylesoftware.htmlunit.ScriptException:异常调用setOuterHTML时我尝试运行getByXPath来提取我正在寻找的数据.从我得到的错误中,我不能为我的生活找出正在发生的事情.你们有什么想法吗?
这是我的代码:
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class ScrapperApp {
private static void go() throws Exception {
HtmlPage nextPage;
String url = "http://media.ethics.ga.gov/search/Campaign/Campaign_Name.aspx?NameID=5751&FilerID=C2009000085&Type=candidate";
final WebClient webclient = new WebClient();
final HtmlPage page = webclient.getPage(url);
System.out.println("PULLING LINKS:");
List<HtmlAnchor> articles = (List<HtmlAnchor>) page.getByXPath("//div[@class='hform1']/a[@class='lblentrylink']");
/*for(int x=0; x<articles.size(); x++) {
nextPage = articles.get(x).click();
System.out.println(nextPage.getBody());
}*/
}
public static void main(String[] args) throws Exception {
go();
System.out.println("COMPLETE");
}
}
Run Code Online (Sandbox Code Playgroud)
这是我的控制台输出:
Jul 2, 2013 6:19:51 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete …Run Code Online (Sandbox Code Playgroud) 导致错误的行是
totalR = totalR + (float(string.replace(contri[0][5],",","")) + float(string.replace(contri[0][6],",","")))
Run Code Online (Sandbox Code Playgroud)
contri [0] [5]和[6]是包含格式为1,000.00的数字的字符串.我在将字符串转换为浮点数之前删除逗号,以便将它们添加到totalR,这是一个浮点数.(创建为totalR = 0.0)我也尝试使用Decimal,但错误也发生在那里.我做了"导入字符串".程序失败并出现错误:
File "mine.py", line 43, in fillDonorData
totalR = totalR + (float(string.replace(contri[0][5],",","")) + float(string.replace(contri[0][6],",","")))
AttributeError: 'module' object has no attribute 'replace'
Run Code Online (Sandbox Code Playgroud) 我正在尝试从python应用程序调用外部程序,但它未显示任何输出,并失败并显示错误127。从命令行执行命令可以正常工作。(并且我在正确的工作目录中)
def buildContris (self, startUrl, reportArray):
urls = []
for row in reportArray:
try:
url = subprocess.check_output(["casperjs", "casper.js", startUrl, row[0]], shell=True)
print (url)
urls.append(url)
break
except subprocess.CalledProcessError as e:
print ("Error: " + str(e.returncode) + " Output:" + e.output.decode())
return urls
Run Code Online (Sandbox Code Playgroud)
每个循环都会输出以下错误:(我也检查了e.cmd。这是正确的,但很长,因此在此示例中省略了它)
Error: 127 Output:
Run Code Online (Sandbox Code Playgroud)
解:
以下代码有效
app = subprocess.Popen(["./casperjs/bin/casperjs", "casper.js", startUrl, row[0]], stdout=subprocess.PIPE, stderr=subprocess.PIPE, env = {"PATH" : "/usr/local/bin/:/usr/bin"}, universal_newlines=True)
app.wait()
out, errs = app.communicate()
Run Code Online (Sandbox Code Playgroud) 我正在尝试从单击一个链接的__doPostBack函数的ASP页中抓取数据。当我单击()具有HTMLUnit的链接时,它将返回我从其开始的页面。我需要怎么做才能完成回发并返回下一页?
码:
import java.util.List;
import com.gargoylesoftware.htmlunit.ScriptResult;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class ScrapperApp {
private static void go() throws Exception {
/* turn off annoying htmlunit warnings */
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
HtmlPage nextPage;
ScriptResult onClick;
String url = "http://media.ethics.ga.gov/search/Campaign/Campaign_Name.aspx?NameID=5751&FilerID=C2009000085&Type=candidate";
final WebClient webclient = new WebClient(BrowserVersion.CHROME_16);
final HtmlPage page = webclient.getPage(url);
System.out.println("PULLING LINKS:");
List<HtmlAnchor> articles = (List<HtmlAnchor>) page.getByXPath("//table[@id='ctl00_ContentPlaceHolder1_Name_Reports1_TabContainer1_TabPanel1_dgReports']/tbody/tr/td/a[@class='lblentrylink']");
for(int x=0; x<articles.size(); x++) {
System.out.println("Clicking "+x+": "+articles.get(x).asText());
nextPage = articles.get(x).click();
System.out.println(nextPage.getUrl());
}
}
public static void main(String[] args) throws Exception …Run Code Online (Sandbox Code Playgroud)