小编kha*_*baa的帖子

根据熊猫列表列中的条件创建新列

我有一个包含列表列的数据框:

col_1            
[A, A, A, B, C]
[D, B, C]
[C]
[A, A, A]
NaN
Run Code Online (Sandbox Code Playgroud)

我想创建新列,如果列表以 3* 开头A,则返回 1,否则返回 0:

col_1              new_col           
[A, A, A, B, C]    1
[D, B, C]          0
[C]                0
[A, A, A]          1
NaN                0
Run Code Online (Sandbox Code Playgroud)

我试过这个但没有用:

df['new_col'] = df.loc[df.col_1[0:3] == [A, A, A]]
Run Code Online (Sandbox Code Playgroud)

python pandas

5
推荐指数
1
解决办法
312
查看次数

无限加载从新闻网站上抓取头条新闻

我想从这个网站上抓取头条新闻:https : //www.marketwatch.com/latest-news?mod=top_nav

我需要加载较早的新闻,因此需要单击蓝色按钮“查看更多”。

我创建了这段代码,但没有奏效:

from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
u = 'https://www.marketwatch.com/latest-news?mod=top_nav' #US Business


driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe")
driver.maximize_window()
driver.get(u)
time.sleep(10)
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CLASS_NAME,'close-btn'))).click()
time.sleep(10)

driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
for i in range(3):
        element =WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR,'component.component--module.more-headlines div.group.group--buttons.cover > a.btn.btn--secondary.js--more-headlines)))
        driver.execute_script("arguments[0].scrollIntoView();", element)
        element.click()
        time.sleep(5)
        driver.execute_script("arguments[0].scrollIntoView();", element)

        print(f'click {i} done')
soup = BeautifulSoup(driver.page_source, 'html.parser')

driver.quit()
Run Code Online (Sandbox Code Playgroud)

它返回此错误:

raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Run Code Online (Sandbox Code Playgroud)

python web-scraping selenium-webdriver

5
推荐指数
1
解决办法
146
查看次数

使用 selenium 抓取页面链接总是返回有限数量的链接

我想从这个页面“https://m.aiscore.com/basketball/20210610”中抓取所有匹配链接,但只能得到限制数量的匹配:我试过这个代码:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless") 
driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe", options=options)

url = 'https://m.aiscore.com/basketball/20210610'
driver.get(url)

driver.maximize_window()
driver.implicitly_wait(60) 

driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")    

soup = BeautifulSoup(driver.page_source, 'html.parser')

links = [i['href'] for i in soup.select('.w100.flex a')]
links_length = len(links) #always return 16
driver.quit()
Run Code Online (Sandbox Code Playgroud)

当我运行代码时,我总是只得到 16 个匹配链接,但页面有 35 个匹配。我需要获取页面中的所有匹配链接。

python beautifulsoup web-scraping selenium-webdriver

4
推荐指数
1
解决办法
223
查看次数