Aks*_*wal 5 python selenium beautifulsoup web-scraping python-3.x
我有一个简单的项目,从旅游网站抓取评论并将其存储在excel文件中.评论可以是西班牙语,日语或任何其他语言,评论有时也包含特殊符号,如"❤❤".
我需要存储所有数据(如果无法写入,可以排除特殊符号).
我能够抓取我想要的数据并将其打印在控制台中(如日文文本),但问题是将其存储在csv文件中,它显示错误消息,如下所示
我尝试使用utf-8编码打开文件(如下面的评论中所述),但随后它将数据保存在一些没有意义的奇怪符号中....并且无法找到问题的答案.有什么建议.
我使用的是python 3.5.3
我的python代码:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import re
file = "TajMahalSpanish.csv"
f = open(file, "w")
headers = "rating, title, review\n"
f.write(headers)
pages = 119
pageNumber = 2
option = webdriver.ChromeOptions()
option.add_argument("--incognito")
browser = webdriver.Chrome(executable_path='C:\Program Files\JetBrains\PyCharm Community Edition 2017.1.5\chrome webdriver\chromedriver', chrome_options=option)
browser.get("https://www.tripadvisor.in/Attraction_Review-g297683-d317329-Reviews-Taj_Mahal-Agra_Agra_District_Uttar_Pradesh.html")
time.sleep(10)
browser.find_element_by_xpath('//*[@id="taplc_location_review_filter_controls_0_form"]/div[4]/ul/li[5]/a').click()
time.sleep(5)
browser.find_element_by_xpath('//*[@id="BODY_BLOCK_JQUERY_REFLOW"]/span/div[1]/div/form/ul/li[2]/label').click()
time.sleep(5)
while (pages):
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
containers = soup.find_all("div",{"class":"innerBubble"})
showMore = soup.find("span", {"onclick": "widgetEvCall('handlers.clickExpand',event,this);"})
if showMore:
browser.find_element_by_xpath("//span[@onclick=\"widgetEvCall('handlers.clickExpand',event,this);\"]").click()
time.sleep(3)
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
containers = soup.find_all("div", {"class": "innerBubble"})
showMore = False
for container in containers:
bubble = container.div.div.span["class"][1]
title = container.div.find("div", {"class": "quote"}).a.span.text
review = container.find("p", {"class": "partial_entry"}).text
f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
print(bubble)
print(title)
print(review)
browser.find_element_by_xpath("//div[@class='ppr_rup ppr_priv_location_reviews_list']//div[@class='pageNumbers']/span[@data-page-number='" + str(pageNumber) + "']").click()
time.sleep(5)
pages -= 1
pageNumber += 1
f.close()
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
Traceback (most recent call last):
File "C:/Users/Akshit/Documents/pycharmProjects/spanish.py", line 45, in <module>
f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n")
File "C:\Users\Akshit\AppData\Local\Programs\Python\Python35\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-18: character maps to <undefined>
Process finished with exit code 1
Run Code Online (Sandbox Code Playgroud)
UPDATE
我正在尝试解决此问题.最后我需要将日语评论翻译成英语以及研究,所以我可以使用google api之一在编写代码之前将字符串转换为字符串,然后将其写入csv文件中. ..
更新
\n\n找到了解决方案
\n\n\n \n\n\n
正如评论中 @MaartenFabr\xc3\xa9 所建议的。
\n\n基本上根据我的理解,问题是Excel文件在读取具有utf-8编码的csv文件时出现问题,因此当我用Excel直接打开csv文件(通过python制作)时......所有数据都已损坏。
\n\n解决办法是:
\n\n再次感谢@MaartenFabre 的帮助!
\n| 归档时间: |
|
| 查看次数: |
492 次 |
| 最近记录: |