Python requests.get()返回破碎的源代码而不是预期的源代码？

Question

Python requests.get()返回破碎的源代码而不是预期的源代码？

Fat*_*ror 5 python python-3.x python-requests

在上面的维基百科页面上提出了请求.具体来说,我需要从https://en.wikipedia.org/wiki/2017%E2%80%9318_La_Liga#Results中删除"结果矩阵"

selectedSeasonPage = requests.get('https://en.wikipedia.org/wiki/2017–18_La_Liga', features='html5lib')

Run Code Online (Sandbox Code Playgroud)

做pprint.pprint(selectedSeasonPage.text)和跳转到矩阵的源代码,可以看出它是不完整的.

requests.get()返回的HTML片段:

<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
.
.
<th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th>
<td style="font-weight: normal;background-color:transparent;">— </td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:transparent;"></td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>

Run Code Online (Sandbox Code Playgroud)

request.get()返回的HTML通过浏览器查看,并且预期它不完整. 可以查看此图片以供参考.

来自view-source的片段和所需的输出.

<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">
.
.
<a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alavés">Alavés</a></th>
<td style="font-weight: normal;background-color:transparent;">&#8212;</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">3–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">0–2</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">2–1</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#BBF3FF;">1–0</td>
<td style="white-space:nowrap;font-weight: normal;background-color:#FFBBBB;">1–2</td>

Run Code Online (Sandbox Code Playgroud)

发布样本HTML以供参考,因为无法发布整个输出.如果需要,可以发布更具体的部件.

我的问题是如何获得矩阵的整个来源而不会导致价值损失？

根据我的理解,通过以前的问题,requests如果页面的某些部分由JavaScript呈现,则无法返回预期的输出.但是这个页面似乎是简单的HTML和CSS(至少是需要的部分).不能使用Selenium需要刮多页.将不胜感激使用requests或等效的解决方案.

请求版本为2.19.1.Python版本是3.7.0.

有什么遗失？我对这些东西不熟悉,任何帮助表示赞赏.

Answer 1

小智 1

几乎您的确切代码在 get 调用中没有“features”参数：

\n\n

import requests\nselectedSeasonPage = requests.get(\'https://en.wikipedia.org/wiki/2017\xe2\x80\x9318_La_Liga\')\nprint(selectedSeasonPage.text)\n

Run Code Online (Sandbox Code Playgroud)\n\n

给我：

\n\n

<th scope="row" style="text-align:right;"><a href="/wiki/Deportivo_Alav%C3%A9s" title="Deportivo Alav\xc3\xa9s">Alav\xc3\xa9s</a>\n</th>\n<td style="font-weight:normal;background:transparent;">&#8212;</td>\n<td style="white-space:nowrap;font-weight:normal;background:#BBF3FF;">3\xe2\x80\x931</td>\n<td style="white-space:nowrap;font-weight:normal;background:#FBB;">0\xe2\x80\x931</td>\n<td style="white-space:nowrap;font-weight:normal;background:#FBB;">0\xe2\x80\x932</td>\n<td style="white-space:nowrap;font-weight:normal;background:#BBF3FF;">2\xe2\x80\x931</td>\n<td style="white-space:nowrap;font-weight:normal;background:#BBF3FF;">1\xe2\x80\x930</td>\n<td style="white-space:nowrap;font-weight:normal;background:#FBB;">1\xe2\x80\x932</td>\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	7 年，3 月前
查看次数：	87 次
最近记录：	7 年，1 月前