从使用 Power BI 的网站抓取数据 - 从网站上的 Power BI 检索数据

Question

从使用 Power BI 的网站抓取数据 - 从网站上的 Power BI 检索数据

am.*_*rez 16 python selenium web-scraping powerbi

我想从这个页面（和类似的页面）抓取数据：https : //cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx

此页面使用Power BI。不幸的是，找到一种抓取Power BI 的方法很困难，因为每个人都想抓取使用/进入 Power BI，而不是从中抓取。最接近的答案是这个问题。却又不相干。

首先，我使用了Apache tika，很快我意识到加载页面后正在加载表数据。我需要页面的渲染版本。

因此，我使用了Selenium。我想Select All在开始时（发送Ctrl+A组合键），但它不起作用。可能是受页面事件限制（我也尝试使用开发人员工具删除所有事件，但仍然Ctrl+A不起作用。

我还尝试阅读 HTML 内容，但 Power BIdiv使用position:absolute并区分div表中 a 的位置（行和列）将元素放在屏幕上是一项费力的活动。

由于 Power BI 使用 JSON，我尝试从那里读取数据。然而，它是如此复杂，我无法找到规则。它似乎将关键字放在某处并在表中使用它们的索引。

注意：我意识到所有数据都没有加载，甚至没有同时显示。甲div类scroll-bar-part-bar是负责作为滚动条，并移动该加载/节目的数据的其他部分。

我用来读取数据的代码如下。如前所述，生成数据的顺序与浏览器上呈现的顺序不同：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

options = webdriver.ChromeOptions()
options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe"
driver = webdriver.Chrome(options=options, executable_path="C:/Drivers/chromedriver.exe")

driver.get("https://app.powerbi.com/view?r=eyJrIjoiYjVjM2MyNjItZDE1Mi00OWI1LWE5YWYtODY4M2FhYjU4ZDU1IiwidCI6ImExMmNlNTRiLTNkM2QtNDM0Ni05NWVmLWZmMTNjYTVkZDQ3ZCJ9")
parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')
values = [child.get_attribute('title') for child in children]

Run Code Online (Sandbox Code Playgroud)

我很欣赏上述任何问题的解决方案。不过，对我来说最有趣的是以 JSON 格式存储 Power BI 数据的约定。

Answer 1

am.*_*rez 7

把滚动部分和 JSON 放在一边，我设法读取了数据。关键是读取父级中的所有元素（在问题中完成）：

parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')

Run Code Online (Sandbox Code Playgroud)

然后使用它们的位置对它们进行排序：

x = [child.location['x'] for child in children]
y = [child.location['y'] for child in children]
index = np.lexsort((x,y))

Run Code Online (Sandbox Code Playgroud)

要对我们在不同行中阅读的内容进行排序，此代码可能会有所帮助：

rows = []
row = []
last_line = y[index[0]]
for i in index:
    if last_line != y[i]:
        row.append[children[i].get_attribute('title')]
    else:
        rows.append(row)
        row = list([children[i].get_attribute('title')]
rows.append(row)

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，10 月前
查看次数：	4487 次
最近记录：	4 年，8 月前