1 python unicode encoding decoding web-scraping
我正在尝试使用 beautifulsoup 抓取网站。我基本上是成功的,但有两个问题
\n\n从网站获取数据后,我将它们打印到屏幕上,并将它们写入 CSV 文件。网站中有一个价格字段,其中包含实际金额的卢比符号(价格字段的示例结构:\xe2\x82\xb9 10000)。当我将金额打印到控制台时,它打印得很好并且没有任何问题。当我尝试将其写入 Excel 工作表时,出现错误 \n"Unicodeencoeerror" 编解码器 \'charmap\' 无法对位置 28 中的字符 \'\\u20b9\' 进行编码。我正在将其他字段打印到控制台和 Excel该问题仅显示 \nup 有两个字段,一个带有货币符号,另一个带有 * \n符号
我正在运行一个循环来从网页中获取特定\n搜索的所有页面。搜索结果约为 344 页,但循环在大约页 \n43 处停止,仅 HTML 错误 500 作为错误消息
\n\nimport bs4\nfrom urllib.request import urlopen as uReq\n\nfrom bs4 import BeautifulSoup as Soup\nfilename = "data.csv"\nf = open(filename,"w")\nheaders = "phone_name, phone_price, phone_rating,number_of_ratings, \nmemory, display, camera, battery, processor, Warrenty, security, OS\\n"\nf.write(headers)\n\n\nfor i in range(2): # Number of pages minus one \n my_url = \'https://www.flipkart.com/search?as=off&as-\n show=on&otracker=start&page=\n {}&q=cell+phones&viewType=list\'.format(i+1)\n print(my_url)\n\n uClient=uReq(my_url)\n\n page_html=uClient.read()\n\n page_soup = Soup(page_html,"html.parser")\n\n containers=page_soup.findAll("a", {"class":"_1UoZlX"})\n\n\n\n\nfor container in containers: phone_name = \ncontainer.find("div",{"class":"_3wU53n"}).text\n\n try:\n phone_price = container.find("div",{"class":"_1vC4OE _2rQ-NK"}).text\n\n except:\n phone_price = \'No Data\'\nRun Code Online (Sandbox Code Playgroud)非常感谢您提供的一切帮助!
\n为 Excel 编写 .CSV 文件时,utf-8-sig应使用编码来正确支持任何 Unicode 字符。utf8如果仅使用Excel,Excel 将采用 Windows 上的本地化 ANSI 编码并错误地显示字符。
#!python3\nimport csv\nfrom urllib.request import urlopen as uReq\nfrom bs4 import BeautifulSoup as Soup\n\nfilename = "data.csv"\nwith open(filename,\'w\',newline=\'\',encoding=\'utf-8-sig\') as f:\n w = csv.writer(f)\n headers = \'phone_name phone_price phone_rating number_of_ratings memory display camera battery processor Warrenty security OS\'\n w.writerow(headers.split())\n\n for i in range(2): # Number of pages minus one \n my_url = \'https://www.flipkart.com/search?as=off&as-show=on&otracker=start&page={}&q=cell+phones&viewType=list\'.format(i+1)\n print(my_url)\n uClient=uReq(my_url)\n page_html=uClient.read()\n page_soup = Soup(page_html,"html.parser")\n containers=page_soup.findAll("a", {"class":"_1UoZlX"})\n\n for container in containers:\n phone_name = container.find("div",{"class":"_3wU53n"}).text\n\n try:\n phone_price = container.find("div",{"class":"_1vC4OE _2rQ-NK"}).text\n except:\n phone_price = \'No Data\'\n\n w.writerow([phone_name,phone_price])\nRun Code Online (Sandbox Code Playgroud)\n\n输出:
\n\nphone_name,phone_price,phone_rating,number_of_ratings,memory,display,camera,battery,processor,Warrenty,security,OS\n"Asus Zenfone 3 Laser (Gold, 32 GB)","\xe2\x82\xb99,999"\n"Intex Aqua Style III (Champagne/Champ, 16 GB)","\xe2\x82\xb93,999"\n"iVooMi i1s (Platinum Gold, 32 GB)","\xe2\x82\xb97,499"\n"Xolo ERA 3X (Posh Black, 16 GB)","\xe2\x82\xb96,999"\n"iVooMi Me1 (Sunshine Gold, 8 GB)","\xe2\x82\xb93,599"\n"Panasonic Eluga A4 (Mocha Gold, 32 GB)","\xe2\x82\xb99,790"\nSamsung Metro 313 Dual Sim,"\xe2\x82\xb92,025"\n"Samsung Galaxy J3 Pro (Gold, 16 GB)","\xe2\x82\xb96,990"\nSamsung Guru Music 2,"\xe2\x82\xb91,625"\n"Panasonic Eluga A4 (Marine Blue, 32 GB)","\xe2\x82\xb99,640"\n"Asus Zenfone 4 Selfie (Black, 32 GB)","\xe2\x82\xb99,999"\nSwipe Elite 3- 4G with VoLTE,"\xe2\x82\xb93,999"\n"Asus Zenfone Max (Black, 16 GB)","\xe2\x82\xb97,486"\nSwipe Elite 3- 4G with VoLTE,"\xe2\x82\xb93,999"\n"Swipe Elite Power (Space Grey, 16 GB)","\xe2\x82\xb95,499"\n"Celkon Diamond Mega (Grey, 16 GB)","\xe2\x82\xb95,499"\n"Asus Zenfone Max (Black, 32 GB)","\xe2\x82\xb97,999"\n"Swipe Elite Power (Champagne Gold, 16 GB)","\xe2\x82\xb95,499"\n"Asus Zenfone 4 Selfie (Gold, 32 GB)","\xe2\x82\xb99,999"\n"Karbonn Aura (Champagne, 8 GB)","\xe2\x82\xb93,199"\n"Infinix Note 4 (Ice Blue, 32 GB)","\xe2\x82\xb98,999"\n"Infinix Note 4 (Milan Black, 32 GB)","\xe2\x82\xb98,999"\n"Moto G5s Plus (Blush Gold, 64 GB)","\xe2\x82\xb915,990"\n"Moto G5s Plus (Lunar Grey, 64 GB)","\xe2\x82\xb915,940"\nRun Code Online (Sandbox Code Playgroud)\n\nExcel:
\n\n\n