使用 BeautifulSoup 抓取包含多个表的页面

Gia*_*ini 3 python beautifulsoup web-scraping

几天来我一直在尝试抓取特定页面,但无济于事。我在抓取和 Python 方面都是菜鸟。

我确实在寻找页面的最后一个大表,但没有可以依赖的 ID,所以我尝试抓取所有表。

我想出了这个代码:

import requests
import urllib.request
from bs4 import BeautifulSoup

url = "https://www.freecell.net/f/c/personal.html?uname=Giampaolo44&submit=Go"

r = requests.get(url)
r.raise_for_status()
html_content = r.text

soup = BeautifulSoup(html_content,"html.parser")

tables = soup.findAll("table")

for table in tables:
        row_data = []
        for row in table.find_all('tr'):
            cols = row.find_all('td')
            cols = [ele.text.strip() for ele in cols]
            row_data.append(cols)
        print(row_data)
Run Code Online (Sandbox Code Playgroud)

通过上述内容,我在打印输出(*)中得到了大量垃圾,这是我两天的标准输出。

(*) IE:

['12/155:27\xa0pm8x4\xa05309-6Streak4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '5:27\xa0pm8x4\xa05309-6Streak4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '8x4\xa05309-6Streak4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', 'Streak4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '4:07Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', 'Won12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '5:23\xa0pm8x4\xa013396-6Streak2:49Won', '8x4\xa013396-6Streak2:49Won', 'Streak2:49Won', '2:49Won', 'Won'], ['12/155:23\xa0pm8x4\xa013396-6Streak2:49Won', '5:23\xa0pm8x4\xa013396-6Streak2:49Won', '8x4\xa013396-6Streak2:49Won', 'Streak2:49Won', '2:49Won', 'Won']]
Run Code Online (Sandbox Code Playgroud)

QHa*_*arr 5

如果您只想要最后一个,可以使用表标签的索引

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://www.freecell.net/f/c/personal.html?uname=Giampaolo44&submit=Go'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36', 'Referer': 'https://www.nseindia.com/'}
r = requests.get(url,  headers=headers)
soup = bs(r.content,'lxml')
table =soup.select('table')[-1]
rows = table.find_all('tr')
output = []
for row in rows:
    cols = row.find_all('td')
    cols = [item.text.strip() for item in cols]
    output.append([item for item in cols if item])
df = pd.DataFrame(output, columns = ['Date','Time','Game','Mode','Elapsed','Won/Lost'])
df = df.iloc[1:]
print(df)
Run Code Online (Sandbox Code Playgroud)