我是scrapy的新手我想从这个网站上提取每个广告的所有内容.所以我尝试了以下方法:
from scrapy.spiders import Spider
from craigslist_sample.items import CraigslistSampleItem
from scrapy.selector import Selector
class MySpider(Spider):
name = "craig"
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/npo"]
def parse(self, response):
links = response.selector.xpath(".//*[@id='sortable-results']//ul//li//p")
for link in links:
content = link.xpath(".//*[@id='titletextonly']").extract()
title = link.xpath("a/@href").extract()
print(title,content)
Run Code Online (Sandbox Code Playgroud)
项目:
# Define here the models for your scraped items
from scrapy.item import Item, Field
class CraigslistSampleItem(Item):
title = Field()
link = Field()
Run Code Online (Sandbox Code Playgroud)
但是,当我运行爬虫时,我什么都没得到:
$ scrapy crawl --nolog craig
[]
[]
[]
[]
[]
[]
[]
[] …Run Code Online (Sandbox Code Playgroud) 我有这个 pandas 数据框,它实际上是一个 Excel 电子表格:
Unnamed: 0 Date Num Company Link ID
0 NaN 1990-11-15 131231 apple... http://www.example.com/201611141492/xellia... 290834
1 NaN 1990-10-22 1231 microsoft http://www.example.com/news/arnsno... NaN
2 NaN 2011-10-20 123 apple http://www.example.com/ator... 209384
3 NaN 2013-10-27 123 apple... http://example.com/sections/th-shots/2016/... 098
4 NaN 1990-10-26 123 google http://www.example.net/business/Drugmak... 098098
5 NaN 1990-10-18 1231 google... http://example.com/news/va-rece... NaN
6 NaN 2011-04-26 546 amazon... http://www.example.com/news/home/20160425... 9809
Run Code Online (Sandbox Code Playgroud)
我想删除列NaN中的所有行ID并重新索引“索引假想列”:
Unnamed: 0 Date Num Company Link ID
0 NaN 1990-11-15 131231 apple... http://www.example.com/201611141492/xellia... 290834 …Run Code Online (Sandbox Code Playgroud) 我有一个包含120列的pandas数据帧.列看起来像这样:
0_x 1_x 2_x 3_x 4_x 5_x 6_x 7_x 8_x 0_y ... 65 ... 120
Run Code Online (Sandbox Code Playgroud)
如何在一次运动中重命名它们?我阅读了文档,发现在pandas中重命名列的方法是:
df.columns = ['col1', 'col2', 'col3']
Run Code Online (Sandbox Code Playgroud)
问题是我写的120多列的列表可能很奇怪.这个问题存在哪些替代方案?假设我想将所有列命名为:col1to colN.