小编bar*_*rny的帖子

使用 Python 3.7 中的 Beautifulsoup 从 WSJ 抓取网页文章？

我正在尝试使用 Python 中的 Beautifulsoup 从华尔街日报中抓取文章。但是，我正在运行的代码执行没有任何错误（退出代码 0）但没有结果。我不明白发生了什么？为什么这段代码没有给出预期的结果。

我什至支付了订阅费。

我知道有些地方不对，但我找不到问题所在。

import time

import requests

from bs4 import BeautifulSoup

url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
  '&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'

pages = 32
for page in range(1, pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".items.hedSumm li > a"):
        resp = requests.get(item.get("href"))
        _href = item.get("href")

        try:
            resp = requests.get(_href)
        except Exception as e:
            try:
            resp = requests.get("https://www.wsj.com" + _href)
        except Exception as e:
            continue
    sauce = BeautifulSoup(resp.text,"lxml")
    date = sauce.select("time.timestamp.article__timestamp.flexbox__flex--1")
    date = date[0].text
    tag = sauce.select("li.article-breadCrumb span").text …

Run Code Online (Sandbox Code Playgroud)

python beautifulsoup web-scraping

Piy*_*iya

2021 02-05

2
推荐指数

1
解决办法

1671
查看次数

使用 BeautifulSoup 时出现 AttributeError: 'str' 对象没有属性 'descendants' 错误

@ayivima 有一个很好的答案，但我应该补充一点，该网站本身最终没有被 BeautifulSoup 正确抓取，因为它有大量的 Javascript。

所以我对使用Python完全陌生，我只是想打印网页的标题。我主要使用来自谷歌的代码：

from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601"
page = requests.get(url)
data = page.text
soup = BeautifulSoup
soup.find_all('h1')

print(text)

Run Code Online (Sandbox Code Playgroud)

我不断收到错误：

AttributeError: 'str' object has no attribute 'descendants'

Run Code Online (Sandbox Code Playgroud)

老实说，我真的不知道这意味着什么，我能找到的唯一其他答案来自：AttributeError: 'str' object has no attribute 'descendants'我认为这不适用于我？

我在代码中做错了什么吗？（可能很多，但我的意思主要是为了这个错误）

python beautifulsoup python-3.x

fac*_*asd

2021 06-07

2
推荐指数

1
解决办法

9693
查看次数

Prometheus 如何抓取 Kafka 主题？

我是一名网络专家，正在尝试构建我的第一个 Kafka --> Prometheus --> Grafana 管道。我的 Kafka 经纪人有一个由外部制作人填充的主题。那太棒了。但是我不知道如何配置我的 Prometheus 服务器以从该主题中作为消费者抓取数据。

我还应该说我的 Kafka 节点在我的主机 Ubuntu 机器上运行（不在 Docker 容器中）。当我运行 Kafka 时，我也在运行一个 JMX 导出器的实例。以下是我在 Ubuntu 命令行上启动 Kafka 的方法：

KAFKA_OPTS="$KAFKA_OPTS -javaagent:/home/me/kafka_2.11-2.1.1/jmx_prometheus_javaagent-0.6.jar=7071:/home/Me/kafka_2.11-2.1.1/kafka-0-8-2.yml" \
  ./bin/kafka-server-start.sh config/server.properties &

Run Code Online (Sandbox Code Playgroud)

好的。我的 Prometheus（也是一个主机进程，不是 Docker 容器版本）可以成功地从我的 Kafka 中提取很多指标。所以我只需要弄清楚如何让 Prometheus 读取我的主题中的消息。我想知道这些消息是否已经可见？我的主题称为“vflow.sflow”，当我查看 Kafka (TCP 7071) 上可用的“可抓取”指标时，我确实看到了这些指标：

From http://localhost:7071/metrics:

kafka_cluster_partition_replicascount{partition="0",topic="vflow.sflow",} 1.0
kafka_cluster_partition_insyncreplicascount{partition="0",topic="vflow.sflow",} 1.0
kafka_log_logendoffset{partition="0",topic="vflow.sflow",} 1.5357405E7
kafka_cluster_partition_laststableoffsetlag{partition="0",topic="vflow.sflow",} 0.0
kafka_log_numlogsegments{partition="0",topic="vflow.sflow",} 11.0
kafka_cluster_partition_underminisr{partition="0",topic="vflow.sflow",} 0.0
kafka_cluster_partition_underreplicated{partition="0",topic="vflow.sflow",} 0.0
kafka_log_size{partition="0",topic="vflow.sflow",} 1.147821017E10
kafka_log_logstartoffset{partition="0",topic="vflow.sflow",} 0.0

Run Code Online (Sandbox Code Playgroud)

“分区 0”、“日志大小”、“日志结束偏移”……所有这些看起来都很有希望……我猜？

但请记住，我对 Kafka/JMX/Prometheus 生态系统完全陌生。问题：上述指标是否描述了我的“vflow.sflow”主题？我可以使用它们来配置 Prometheus 以实际读取主题中的消息吗？

如果是这样，有人可以为此推荐一个好的教程吗？我一直在玩我的 Prometheus YAML 配置文件，但我设法做的就是在我这样做时使 Prometheus …

apache-kafka kafka-consumer-api prometheus jmx-exporter

Pet*_*ete

2021 06-03

2
推荐指数

1
解决办法

4254
查看次数

Prometheus Exporter - 直接检测与自定义收集器

我目前正在为遥测网络应用程序编写 Prometheus 导出器。

我已经阅读了此处的文档编写导出器，虽然我了解实现自定义收集器以避免竞争条件的用例，但我不确定我的用例是否适合直接检测。

基本上，网络指标由网络设备通过 gRPC 流式传输，因此我的出口商只需接收它们而不必有效地抓取它们。

我使用了以下代码的直接检测：

我使用 promauto 包声明我的指标以保持代码紧凑：

package metrics

import (
    "github.com/lucabrasi83/prom-high-obs/proto/telemetry"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    cpu5Sec = promauto.NewGaugeVec(

        prometheus.GaugeOpts{
            Name: "cisco_iosxe_iosd_cpu_busy_5_sec_percentage",
            Help: "The IOSd daemon CPU busy percentage over the last 5 seconds",
        },
        []string{"node"},
    )

Run Code Online (Sandbox Code Playgroud)

下面是我如何从 gRPC 协议缓冲区解码的消息中简单地设置度量值：

cpu5Sec.WithLabelValues(msg.GetNodeIdStr()).Set(float64(val))

Run Code Online (Sandbox Code Playgroud)

最后，这是我的主循环，它基本上处理我感兴趣的指标的遥测 gRPC 流：

for {

        req, err := stream.Recv()
        if err == io.EOF {
            return nil
        }
        if err != nil {
            logging.PeppaMonLog(
                "error",
                fmt.Sprintf("Error while reading client %v stream: …

Run Code Online (Sandbox Code Playgroud)

go prometheus

Luc*_*asi

2021 06-02

2
推荐指数

1
解决办法

1379
查看次数

TypeError: request() 有一个意外的关键字参数“header” - 当我使用 header 时，403 错误 - 没有 header

我正在尝试从该网站抓取信息，但不断收到状态代码：403，因此尝试使用 header 但收到TypeError：request() 获得意外的关键字参数“header”

代码：

import requests head = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'} url = "https://www.accuweather.com/en/bd/dhaka/28143/current-weather/28143" pageObj = requests.get(url, header = head) print("Status code: " + str(pageObj.status_code)) # *for testing purpose*
Run Code Online (Sandbox Code Playgroud)
错误：

Traceback (most recent call last): File "F:/Python/PyCharm Community Edition 2019.2.3/Workshop/WEB_SCRAPING/test2.py", line 6, in <module> pageObj = requests.get(url, header = head) File "F:\Python\PyCharm Community Edition 2019.2.3\Workshop\WEB_SCRAPING\venv\lib\site-packages\requests\api.py", line 75, in get return request('get', url, params=params, **kwargs) File "F:\Python\PyCharm …
Run Code Online (Sandbox Code Playgroud)

html python http-status-code-403 python-requests

Red*_*nob

2021 05-19

2
推荐指数

1
解决办法

2万
查看次数

使用python yfinance多线程下载雅虎股票历史

我正在尝试下载股票代码列表的历史数据并将每个数据导出到 csv 文件。我可以把它作为一个 for 循环来工作，但是当股票行情列表在 1000 的时候这很慢。我正在尝试对进程进行多线程处理，但我不断收到许多不同的错误。有时它只会下载 1 个文件，有时会下载 2 或 3 次，甚至会下载 6 次，但绝不会超过这个数。我猜这与拥有 6 核 12 线程处理器有关，但我真的不知道。

import csv import os import yfinance as yf import pandas as pd from threading import Thread ticker_list = [] with open('tickers.csv', 'r') as csvfile: reader = csv.reader(csvfile, delimiter=',') name = None for row in reader: if row[0]: ticker_list.append(row[0]) start_date = '2019-03-03' end_date = '2020-03-04' data = pd.DataFrame() def y_hist(i): ticker = ticker_list[i] data = yf.download(ticker, start=start_date, end=end_date, group_by="ticker") data.to_csv('yhist/' + ticker …
Run Code Online (Sandbox Code Playgroud)

python multithreading yfinance

rwa*_*rwa

2021 05-03

2
推荐指数

1
解决办法

2153
查看次数

我如何使用 python 从 flashscore 中抓取足球结果

网页抓取 Python

'我是刮新的。我想抓取 2018-19 赛季英超联赛结果（赛程、结果、日期），但我很难浏览网站。我得到的只是空列表/[无]。如果你有一个可以分享的解决方案，那将是一个很大的帮助。'

“这就是我试过的。”

'''

import pandas as pd import requests as uReq from bs4 import BeautifulSoup url = uReq.get('https://www.flashscore.com/football/england/premier-league-2018-2019/results/') soup = BeautifulSoup(url.text, 'html.parser') divs = soup.find_all('div', attrs={'id': 'live-table'}) Home = [] for div in divs: anchor = div.find(class_='event__participant event__participant--home') Home.append(anchor) print(Home)
Run Code Online (Sandbox Code Playgroud)
'''

beautifulsoup web-scraping python-3.x python-requests

ory*_*nnn

2021 02-10

2
推荐指数

1
解决办法

3364
查看次数

用于从指定/选定节点提取 xpath 查询的工具

通常，人们会使用 XPath 查询来获取某个值或节点。就我而言，我正在使用谷歌电子表格进行一些网络抓取，使用该importXML函数自动更新一些值。下面给出两个例子：

=importxml("http://www.creditagricoledtvm.com.br/";"(//td[@class='xl7825385'])[9]") =importxml("http://www.bloomberg.com/quote/ELIPCAM:BZ";"(//span)[32]")
Run Code Online (Sandbox Code Playgroud)
问题是我正在抓取的页面会时不时地发生变化，而且我对 XML/XPath 知之甚少，因此需要大量的试验和错误才能到达节点。我想知道是否有任何工具可以用来指向可以提供适当查询的元素（在页面中或其代码中）。

例如，在第二种情况下，我注意到我想要的信息位于跨度节点中（因此(//span)），因此我将所有信息打印在电子表格中，并使用行数来查找索引[32]。这需要很长时间才能加载，所以非常不方便。另外，我什至不记得我是如何计算出这个//td[@class='xl7825385']查询的。因此，为什么我想知道是否有更实用的方法来指向页面元素。

xpath google-sheets web-scraping

Mef*_*ico

2021 02-18

2
推荐指数

1
解决办法

246
查看次数

使用selenium Python滚动到无限加载页面的末尾

我正在使用 Selenium 从 Twitter 上抓取关注者姓名，并且该页面是无限的，每当我向下滚动时，我都可以看到新的关注者。不知何故，我想转到页面底部，以便我可以抓取所有关注者。

while number != 5: driver.execute_script("window.scrollTo(0,document.body.scrollHeight)") number = number + 1 time.sleep(5) usernames = driver.find_elements_by_class_name( "css-4rbku5.css-18t94o4.css-1dbjc4n.r-1loqt21.r-1wbh5a2.r-dnmrzs.r-1ny4l3l") for username in usernames: print(username.get_attribute("href"))
Run Code Online (Sandbox Code Playgroud)
现在代码滚动了 5 次。我已经设置了一个静态值，但我不知道需要多少滚动才能到达页面底部。

selenium selenium-chromedriver

作者

2021 02-24

2
推荐指数

1
解决办法

4271
查看次数

使用 Python 和报纸3k lib 进行网页抓取不返回数据

我已经Newspapper3k在我的 Mac 上安装了 Lib sudo pip3 install Newspapper3k。我使用 Python 3。我想返回 Article 对象支持的数据，即 url、日期、标题、文本、摘要和关键字，但我没有得到任何数据：

import newspaper from newspaper import Article #creating website for scraping cnn_paper = newspaper.build('https://www.euronews.com/', memoize_articles=False) #I have tried for https://www.euronews.com/, https://edition.cnn.com/, https://www.bbc.com/ for article in cnn_paper.articles: article_url = article.url #works news_article = Article(article_url)#works print("OBJECT:", news_article, '\n')#works print("URL:", article_url, '\n')#works print("DATE:", news_article.publish_date, '\n')#does not work print("TITLE:", news_article.title, '\n')#does not work print("TEXT:", news_article.text, '\n')#does not work print("SUMMARY:", news_article.summary, '\n')#does not work print("KEYWORDS:", news_article.keywords, '\n')#does not work print() …
Run Code Online (Sandbox Code Playgroud)

python web-scraping python-newspaper newspaper3k

tag*_*aga

2021 02-09

2
推荐指数

1
解决办法

4236
查看次数

标签统计

python ×5

web-scraping ×4

beautifulsoup ×3

prometheus ×2

python-3.x ×2

python-requests ×2

apache-kafka ×1

go ×1

google-sheets ×1

html ×1

http-status-code-403 ×1

jmx-exporter ×1

kafka-consumer-api ×1

multithreading ×1

newspaper3k ×1

python-newspaper ×1

selenium ×1

selenium-chromedriver ×1

xpath ×1

yfinance ×1

网页抓取 Python

标签 统计

小编bar_rny的帖子

标签统计