小编V.A*_*Anh的帖子

从python字符串中删除unicode字符

我在Python中有一个字符串,如下所示:

u'\u200cHealth & Fitness'
Run Code Online (Sandbox Code Playgroud)

我怎么能删除

\u200c
Run Code Online (Sandbox Code Playgroud)

部分来自字符串?

python unicode python-2.7

13
推荐指数
4
解决办法
3万
查看次数

在抓取时无法检索中文文本

我创建了一个脚本刮网站:1688.com问题是,该网站是中文的,所以每当我尝试检索文本时,它都会给我一堆unicode,当我导出到CSV文件时,没有任何内容文件.我的代码:

# -*- coding: utf-8 -*-
import csv
from urllib import urlopen
from bs4 import BeautifulSoup as BS

csv_content = open('content.csv', 'w+')
writer_content = csv.writer(csv_content)

url = urlopen('https://fuzhuang.1688.com/nvzhuang?
spm=a260k.635.1998214976.1.7eqUGT')
html = BS(url, 'lxml')
container = html.find('ul', {'class' : 'ch-box fd-clr'})
offers = container.find_all('div', {'class' : 'ch-offer-body'})
lst = []

for offer in offers:
    offer_box = offer.find('div', {'component-name' : '@alife/ocms-
component-1688-pc-ch-offer-pic'})
    images = offer_box.find('img')['src']
    title = offer.find('div', {'class' : 'ocms-component-1688-pc-ch-offer-
title-0-1-11'}).text
    price = offer.find('div', {'class' : 'ocms-component-1688-pc-ch-offer-
price-0-1-14'}).text
    lst.append(price)
Run Code Online (Sandbox Code Playgroud)

对于lst中的项目:writer_content.writerow([item])

print …
Run Code Online (Sandbox Code Playgroud)

python beautifulsoup web-scraping

6
推荐指数
1
解决办法
166
查看次数

抓取时拒绝访问

我想创建一个脚本以继续访问https://www.size.co.uk/featured/footwear/并抓取内容,但不知何故,当我运行脚本时,访问被拒绝。这是代码:

from urllib import urlopen
from bs4 import BeautifulSoup as BS
url = urlopen('https://www.size.co.uk/')
print BS(url, 'lxml')
Run Code Online (Sandbox Code Playgroud)

输出是

<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>

You don't have permission to access "http://www.size.co.uk/" on this server.
<p>
Reference #18.6202655f.1498945327.11002828
</p></body>
</html>
Run Code Online (Sandbox Code Playgroud)

当我在其他网站上尝试时,代码运行良好,而且当我使用 Selenium 时,没有任何反应,但我仍然想知道如何在不使用 Selenium 的情况下绕过此错误。但是当我在http://www.footpatrol.co.uk/shop等不同网站上使用 Selenium 时,我遇到了相同的访问被拒绝错误,这是footpatrol 的代码:

from selenium import webdriver

driver = webdriver.PhantomJS('C:\Users\V\Desktop\PY\web_scrape\phantomjs.exe')
driver.get('http://www.footpatrol.com')
pageSource = driver.page_source
soup = BS(pageSource, 'lxml')
print soup
Run Code Online (Sandbox Code Playgroud)

输出是:

<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>

You don't have permission to …
Run Code Online (Sandbox Code Playgroud)

python beautifulsoup web-scraping

2
推荐指数
1
解决办法
1万
查看次数

单击链接后查找URL

单击元素后,如何使用硒查找当前URL。我有这个网站:http : //www.runningintheusa.com/Classic/View.aspx?RaceID=5622

我有代码(假设所有相关的库都已导入)

def get_detail(x):
    dic = {}
    driver = webdriver.PhantomJS(path)
    driver.get(x)
    driver.find_element_by_id('ctl00_ctl00_MainContent_hypPrimaryURL').click()
    return driver.current_url
print get_detail('http://www.runningintheusa.com/Classic/View.aspx?RaceID=5622')
Run Code Online (Sandbox Code Playgroud)

我运行了代码,它仅返回原始URL,即http://www.runningintheusa.com/Classic/View.aspx?RaceID=5622

单击http://flagstaffbigs.org/dave-mckay-run.htm网站上的“比赛网站”链接后,如何找到网址

python selenium web-scraping

1
推荐指数
1
解决办法
3440
查看次数

如何迭代地创建嵌套字典?

我想从给定列表创建一个字典,嵌套元素如下所示.例如,给定:

lst = range(1, 11)
Run Code Online (Sandbox Code Playgroud)

如何创建一个函数来从此列表创建嵌套字典:

dic = {1: {2: {3: {4: {5: {6: {7: {8: {9: 10}}}}}}}}}
Run Code Online (Sandbox Code Playgroud)

python dictionary

1
推荐指数
1
解决办法
79
查看次数