add*_*ons 15 python web-crawler web-scraping
我的python级别是新手.我从来没有写过网络刮刀或爬虫.我编写了一个python代码来连接到api并提取我想要的数据.但对于一些提取的数据,我想得到作者的性别.我发现这个网站,http://bookblog.net/gender/genie.php
但缺点是没有api可用.我想知道如何编写一个python来向页面中的表单提交数据并提取返回数据.如果我能得到一些指导,那将是一个很大的帮助.
这是dom的形式:
<form action="analysis.php" method="POST">
<textarea cols="75" rows="13" name="text"></textarea>
<div class="copyright">(NOTE: The genie works best on texts of more than 500 words.)</div>
<p>
<b>Genre:</b>
<input type="radio" value="fiction" name="genre">
fiction
<input type="radio" value="nonfiction" name="genre">
nonfiction
<input type="radio" value="blog" name="genre">
blog entry
</p>
<p>
</form>
Run Code Online (Sandbox Code Playgroud)
结果页面dom:
<p>
<b>The Gender Genie thinks the author of this passage is:</b>
male!
</p>
Run Code Online (Sandbox Code Playgroud)
Aco*_*orn 26
无需使用机械化,只需在POST请求中发送正确的表单数据即可.
此外,使用正则表达式来解析HTML是一个坏主意.你最好使用像lxml.html这样的HTML解析器.
import requests
import lxml.html as lh
def gender_genie(text, genre):
url = 'http://bookblog.net/gender/analysis.php'
caption = 'The Gender Genie thinks the author of this passage is:'
form_data = {
'text': text,
'genre': genre,
'submit': 'submit',
}
response = requests.post(url, data=form_data)
tree = lh.document_fromstring(response.content)
return tree.xpath("//b[text()=$caption]", caption=caption)[0].tail.strip()
if __name__ == '__main__':
print gender_genie('I have a beard!', 'blog')
Run Code Online (Sandbox Code Playgroud)
bra*_*zzi 17
您可以使用mechanize提交和检索内容,使用re模块获取所需内容.例如,下面的脚本是为您自己的问题的文本做的:
import re
from mechanize import Browser
text = """
My python level is Novice. I have never written a web scraper
or crawler. I have written a python code to connect to an api and
extract the data that I want. But for some the extracted data I want to
get the gender of the author. I found this web site
http://bookblog.net/gender/genie.php but downside is there isn't an api
available. I was wondering how to write a python to submit data to the
form in the page and extract the return data. It would be a great help
if I could get some guidance on this."""
browser = Browser()
browser.open("http://bookblog.net/gender/genie.php")
browser.select_form(nr=0)
browser['text'] = text
browser['genre'] = ['nonfiction']
response = browser.submit()
content = response.read()
result = re.findall(
r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!', content)
print result[0]
Run Code Online (Sandbox Code Playgroud)
它有什么作用?它创建一个mechanize.Browser
并转到给定的URL:
browser = Browser()
browser.open("http://bookblog.net/gender/genie.php")
Run Code Online (Sandbox Code Playgroud)
然后它选择表格(因为只有一个表格要填写,它将是第一个):
browser.select_form(nr=0)
Run Code Online (Sandbox Code Playgroud)
此外,它设置表单的条目...
browser['text'] = text
browser['genre'] = ['nonfiction']
Run Code Online (Sandbox Code Playgroud)
...并提交:
response = browser.submit()
Run Code Online (Sandbox Code Playgroud)
现在,我们得到结果:
content = response.read()
Run Code Online (Sandbox Code Playgroud)
我们知道结果的形式如下:
<b>The Gender Genie thinks the author of this passage is:</b> male!
Run Code Online (Sandbox Code Playgroud)
所以我们创建了一个匹配和使用的正则表达式re.findall()
:
result = re.findall(
r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!',
content)
Run Code Online (Sandbox Code Playgroud)
现在结果可供您使用:
print result[0]
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
42408 次 |
最近记录: |