Phi*_*Phi 1 python selenium beautifulsoup web-scraping python-3.x
我是Python的初学者。在这个问题中,他们从外汇工厂提取数据。当时的解决方案是按照他们的逻辑工作,找到 table soup.find('table', class_="calendar__table")。但是,现在网络结构已经改变,html table is removed and converted to some javascript format. 所以,这个解决方案现在找不到任何东西。
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.forexfactory.com/calendar.php?day=nov18.2016')
soup = BeautifulSoup(r.text, 'lxml')
calendar_table = soup.find('table', class_="calendar__table")
print(calendar_table)
# for row in calendar_table.find_all('tr', class_=['calendar__row calendar_row','newday']):
# row_data = [td.get_text(strip=True) for td in row.find_all('td')]
# print(row_data)
Run Code Online (Sandbox Code Playgroud)
由于我是初学者,我不知道该怎么做。那么,我该如何抓取数据呢?如果您给我任何提示,这对我会有帮助。非常感谢您阅读我的帖子。
由于您已使用selenium标记此问题,因此该答案依赖于Selenium。为了方便起见,我正在使用webdriver 管理器。
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
try:
driver.get("http://www.forexfactory.com/calendar.php?day=nov18.2016")
# Get the table
table = driver.find_element(By.CLASS_NAME, "calendar__table")
# Iterate over each table row
for row in table.find_elements(By.TAG_NAME, "tr"):
# list comprehension to get each cell's data and filter out empty cells
row_data = list(filter(None, [td.text for td in row.find_elements(By.TAG_NAME, "td")]))
if row_data == []:
continue
print(row_data)
except Exception as e:
print(e)
finally:
driver.quit()
Run Code Online (Sandbox Code Playgroud)
目前打印出:
['Fri\nNov 18', '2:00am', 'EUR', 'German PPI m/m', '0.7%', '0.3%', '-0.2%']
['3:30am', 'EUR', 'ECB President Draghi Speaks']
['4:00am', 'EUR', 'Current Account', '25.3B', '31.3B', '29.1B']
['4:10am', 'GBP', 'MPC Member Broadbent Speaks']
['5:30am', 'CHF', 'Gov Board Member Maechler Speaks']
['EUR', 'German Buba President Weidmann Speaks']
['USD', 'FOMC Member Bullard Speaks']
['8:30am', 'CAD', 'Core CPI m/m', '0.2%', '0.3%', '0.2%']
['CAD', 'CPI m/m', '0.2%', '0.2%', '0.1%']
['9:30am', 'USD', 'FOMC Member Dudley Speaks']
['USD', 'FOMC Member George Speaks']
['10:00am', 'USD', 'CB Leading Index m/m', '0.1%', '0.1%', '0.2%']
['9:45pm', 'USD', 'FOMC Member Powell Speaks']
Run Code Online (Sandbox Code Playgroud)
它打印的数据只是为了表明它可以提取数据,您需要根据需要对其进行更改和格式化。
目前他们已经实施了一些云保护,因此beautifulsouop无法收集数据。为此我们必须使用硒。
工作代码示例:
import random
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
def create_driver():
user_agent_list = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11.5; rv:90.0) Gecko/20100101 Firefox/90.0',
'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
]
user_agent = random.choice(user_agent_list)
browser_options = webdriver.ChromeOptions()
browser_options.add_argument("--no-sandbox")
browser_options.add_argument("--headless")
browser_options.add_argument("start-maximized")
browser_options.add_argument("window-size=1900,1080")
browser_options.add_argument("disable-gpu")
browser_options.add_argument("--disable-software-rasterizer")
browser_options.add_argument("--disable-dev-shm-usage")
browser_options.add_argument(f'user-agent={user_agent}')
driver = webdriver.Chrome(options=browser_options, service_args=["--verbose", "--log-path=test.log"])
return driver
def parse_data(driver, url):
driver.get(url)
data_table = driver.find_element(By.CLASS_NAME, "calendar__table")
value_list = []
for row in data_table.find_elements(By.TAG_NAME, "tr"):
row_data = list(filter(None, [td.text for td in row.find_elements(By.TAG_NAME, "td")]))
if row_data:
value_list.append(row_data)
return value_list
driver = create_driver()
url = 'https://www.forexfactory.com/calendar?day=aug26.2021'
value_list = parse_data(driver=driver, url=url)
for value in value_list:
if '\n' in value[0]:
date_str = value.pop(0).replace('\n', ' - ')
print(f'Date: {date_str}')
print(value)
Run Code Online (Sandbox Code Playgroud)
输出:
Date: Thu - Aug 26
['2:00am', 'EUR', 'German GfK Consumer Climate', '-1.2', '-0.5', '-0.4']
['4:00am', 'EUR', 'M3 Money Supply y/y', '7.6%', '7.6%', '8.3%']
['EUR', 'Private Loans y/y', '4.2%', '4.1%', '4.0%']
['7:30am', 'EUR', 'ECB Monetary Policy Meeting Accounts']
['8:30am', 'USD', 'Prelim GDP q/q', '6.6%', '6.7%', '6.5%']
['USD', 'Unemployment Claims', '353K', '345K', '349K']
['USD', 'Prelim GDP Price Index q/q', '6.1%', '6.0%', '6.0%']
['10:30am', 'USD', 'Natural Gas Storage', '29B', '40B', '46B']
['Day 1', 'All', 'Jackson Hole Symposium']
['5:00pm', 'USD', 'President Biden Speaks']
['7:30pm', 'JPY', 'Tokyo Core CPI y/y', '0.0%', '-0.1%', '0.1%']
['9:30pm', 'AUD', 'Retail Sales m/m', '-2.7%', '-2.6%', '-1.8%']
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3852 次 |
| 最近记录: |