小编V.A*_*Anh的帖子

从python字符串中删除unicode字符

我在Python中有一个字符串,如下所示:

u'\u200cHealth & Fitness'

Run Code Online (Sandbox Code Playgroud)

我怎么能删除

\u200c

Run Code Online (Sandbox Code Playgroud)

部分来自字符串？

python unicode python-2.7

V.A*_*Anh

2019 07-28

13
推荐指数

4
解决办法

3万
查看次数

在抓取时无法检索中文文本

我创建了一个脚本刮网站:1688.com问题是,该网站是中文的,所以每当我尝试检索文本时,它都会给我一堆unicode,当我导出到CSV文件时,没有任何内容文件.我的代码:

# -*- coding: utf-8 -*-
import csv
from urllib import urlopen
from bs4 import BeautifulSoup as BS

csv_content = open('content.csv', 'w+')
writer_content = csv.writer(csv_content)

url = urlopen('https://fuzhuang.1688.com/nvzhuang?
spm=a260k.635.1998214976.1.7eqUGT')
html = BS(url, 'lxml')
container = html.find('ul', {'class' : 'ch-box fd-clr'})
offers = container.find_all('div', {'class' : 'ch-offer-body'})
lst = []

for offer in offers:
    offer_box = offer.find('div', {'component-name' : '@alife/ocms-
component-1688-pc-ch-offer-pic'})
    images = offer_box.find('img')['src']
    title = offer.find('div', {'class' : 'ocms-component-1688-pc-ch-offer-
title-0-1-11'}).text
    price = offer.find('div', {'class' : 'ocms-component-1688-pc-ch-offer-
price-0-1-14'}).text
    lst.append(price)

Run Code Online (Sandbox Code Playgroud)

对于lst中的项目:writer_content.writerow([item])

print …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup web-scraping

V.A*_*Anh

2017 06-30

6
推荐指数

1
解决办法

166
查看次数

抓取时拒绝访问

我想创建一个脚本以继续访问https://www.size.co.uk/featured/footwear/并抓取内容，但不知何故，当我运行脚本时，访问被拒绝。这是代码：

from urllib import urlopen
from bs4 import BeautifulSoup as BS
url = urlopen('https://www.size.co.uk/')
print BS(url, 'lxml')

Run Code Online (Sandbox Code Playgroud)

输出是

<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>

You don't have permission to access "http://www.size.co.uk/" on this server.
<p>
Reference #18.6202655f.1498945327.11002828
</p></body>
</html>

Run Code Online (Sandbox Code Playgroud)

当我在其他网站上尝试时，代码运行良好，而且当我使用 Selenium 时，没有任何反应，但我仍然想知道如何在不使用 Selenium 的情况下绕过此错误。但是当我在http://www.footpatrol.co.uk/shop等不同网站上使用 Selenium 时，我遇到了相同的访问被拒绝错误，这是footpatrol 的代码：

from selenium import webdriver

driver = webdriver.PhantomJS('C:\Users\V\Desktop\PY\web_scrape\phantomjs.exe')
driver.get('http://www.footpatrol.com')
pageSource = driver.page_source
soup = BS(pageSource, 'lxml')
print soup

Run Code Online (Sandbox Code Playgroud)

输出是：

<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>

You don't have permission to …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup web-scraping

V.A*_*Anh

2017 07-02

2
推荐指数

1
解决办法

1万
查看次数

单击链接后查找URL

单击元素后，如何使用硒查找当前URL。我有这个网站：http : //www.runningintheusa.com/Classic/View.aspx?RaceID=5622

我有代码（假设所有相关的库都已导入）

def get_detail(x):
    dic = {}
    driver = webdriver.PhantomJS(path)
    driver.get(x)
    driver.find_element_by_id('ctl00_ctl00_MainContent_hypPrimaryURL').click()
    return driver.current_url
print get_detail('http://www.runningintheusa.com/Classic/View.aspx?RaceID=5622')

Run Code Online (Sandbox Code Playgroud)

我运行了代码，它仅返回原始URL，即http://www.runningintheusa.com/Classic/View.aspx?RaceID=5622

单击http://flagstaffbigs.org/dave-mckay-run.htm网站上的“比赛网站”链接后，如何找到网址？

python selenium web-scraping

V.A*_*Anh

lucky-day

1
推荐指数

1
解决办法

3440
查看次数

如何迭代地创建嵌套字典？

我想从给定列表创建一个字典,嵌套元素如下所示.例如,给定:

lst = range(1, 11)

Run Code Online (Sandbox Code Playgroud)

如何创建一个函数来从此列表创建嵌套字典:

dic = {1: {2: {3: {4: {5: {6: {7: {8: {9: 10}}}}}}}}}

Run Code Online (Sandbox Code Playgroud)

python dictionary

V.A*_*Anh

2017 12-28

1
推荐指数

1
解决办法

79
查看次数

标签统计

python ×5

web-scraping ×3

beautifulsoup ×2

dictionary ×1

python-2.7 ×1

selenium ×1

unicode ×1

从python字符串中删除unicode字符

在抓取时无法检索中文文本

抓取时拒绝访问

单击链接后查找URL

如何迭代地创建嵌套字典？

标签 统计

小编V.A_Anh的帖子

标签统计