我有以下HTML结构:我正在尝试构建一个强大的方法来提取第二个颜色摘要元素,因为DOM中会有很多这样的标记.
<table>
<tbody>
<tr bgcolor="#AAAAAA">
<tr>
<tr>
<tr>
<tr>
<td>Color Digest </td>
<td>AgArAQICGQMVBBwTIRQHIwg0GUMURAZTBWQJcwV0AoEDAQ </td>
</tr>
<tr>
<td>Color Digest </td>
<td>2,43,2,25,21,28,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,20,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
</tbody>
</table>
Run Code Online (Sandbox Code Playgroud)
我试图提取具有解码值的第二个"颜色摘要"td元素.
我写了下面的xpath,但没有得到第二个我没有得到第二个td元素.
//td[text() = ' Color Digest ']/following-sibling::td[2]
Run Code Online (Sandbox Code Playgroud)
当我把它改为td [2]到td [1]时,我得到了两个元素.
我想提取:
image标签的src的文本和div类标记内的锚标记的文本我成功地设法提取img src,但是无法从锚标记中提取文本.
<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
Run Code Online (Sandbox Code Playgroud)
这是整个HTML页面的链接.
这是我的代码:
for div in soup.findAll('div', attrs={'class':'image'}):
print "\n"
for data in div.findNextSibling('div', attrs={'class':'data'}):
for a in data.findAll('a', attrs={'class':'title'}):
print a.text
for img in div.findAll('img'):
print img['src']
Run Code Online (Sandbox Code Playgroud)
我想要做的是提取图像src(链接)和里面的标题div class=data,例如:
<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> …Run Code Online (Sandbox Code Playgroud) 首先,我认为值得一提的是,我知道有很多类似的问题但是没有一个对我有用......
我是Python,html和web scraper的新手.我正试图从需要先登录的网站上抓取用户信息.在我的测试中,我使用刮刀github的电子邮件设置作为示例.主页是" https://github.com/login ",目标页面是" https://github.com/settings/emails "
这是我尝试过的方法列表
##################################### Method 1
import mechanize
import cookielib
from BeautifulSoup import BeautifulSoup
import html2text
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Chrome')]
# The site we will navigate into, handling it's session
br.open('https://github.com/login')
for f in br.forms():
print f
br.select_form(nr=0)
# User credentials
br.form['login'] = 'myusername'
br.form['password'] = 'mypwd'
# Login
br.submit()
br.open('github.com/settings/emails').read()
################ Method 2
import urllib, urllib2, cookielib …Run Code Online (Sandbox Code Playgroud) 我写了很多刮刀但是我不确定如何处理无限滚动条.这些天大多数网站等,Facebook,Pinterest都有无限的滚动条.
我的网站是多语言的,我有一个类似FB的按钮.我希望用不同语言的帖子.
根据Facebook文档,如果我使用元标记og:locale和og:locale:alternate,则刮刀会通过参数"locale"和标题"X-Facebook-Locale"获取我的网站信息,但它不会发送(https://developers.facebook.com/docs/beta/opengraph/internationalization/).所以帖子总是以en_US结尾.
谁有同样的问题?
我有大约1,500个PDF,每个只包含1页,并展示相同的结构(例如,请参阅http://files.newsnetz.ch/extern/interactive/downloads/BAG_15m_kzh_2012_de.pdf).
我正在寻找的是迭代所有这些文件(如果可能在本地)并提取表的实际内容(作为CSV,存储到SQLite DB,无论如何)的方法.
我很乐意在Node.js中这样做,但找不到任何合适的库来解析这些东西.你知道吗?
如果在Node.js中不可能,我也可以用Python编写它,如果有更好的方法可用.
因此,我已经通过Scrapy中的经过身份验证的会话阅读并且我被挂断了,我99%确定我的解析代码是正确的,我只是不相信登录是重定向并且成功.
我也遇到了check_login_response()的问题,不知道它正在检查哪个页面.虽然"注销"会有意义.
======更新======
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from linkedpy.items import LinkedPyItem
class LinkedPySpider(InitSpider):
name = 'LinkedPy'
allowed_domains = ['linkedin.com']
login_page = 'https://www.linkedin.com/uas/login'
start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]
def init_request(self):
#"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'session_key': 'user@email.com', 'session_password': 'somepassword'},
callback=self.check_login_response)
def check_login_response(self, response):
#"""Check …Run Code Online (Sandbox Code Playgroud) 我有一个刮刮一个站点(用python编写).在抓取网站时,那些即将用CSV写入的打印行.Scraper是用Python编写的,现在我想通过PHP代码执行它.我的问题是
如何打印由python代码打印的每一行.
我已经使用了exec函数,但它不是我的用途,并在执行所有程序后给出输出.所以;
是否有可能在通过PHP执行时打印python输出.
我正在使用下面的函数(GWT)动态创建HTML元标记.在DOM上花费1秒钟.除Facebook外,它工作正常.当我从我的网络共享链接时,刮刀获取HTML中的元标记:无.我怎样才能解决这个问题?
/**
* Include the HTML attributes: title, description and keywords (meta tags)
*/
private void createHTMLheader(MyClass thing) {
String title=thing.getTitle();
String description=thing.getDescription();
Document.get().setTitle(title);
MetaElement metaDesc = Document.get().createMetaElement();
metaDesc.setName("description");
metaDesc.setContent(description);
NodeList<Element> nodes = Document.get().getElementsByTagName("head");
nodes.getItem(0).appendChild(metaDesc);
}
Run Code Online (Sandbox Code Playgroud)
这是DOM上的结果HEAD.标题aaaa和元描述已动态加载.(感谢CBroe提示).在"查看源"功能中,不显示这些动态标记(仅在开发人员工具上 - 查看dom).
<head>
<title>aaaa</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
<meta name="description" content="My description">
<script language="javascript" type="text/javascript" src="dialective/dialective.nocache.js"></script><script defer="defer">dialective.onInjectionDone('dialective')</script>
</head>
Run Code Online (Sandbox Code Playgroud)
原始HTML没有TITLE或META-DESCRIPTION标记.