如何获取 HTML 中每个 href 的所有元素作为行并将其添加到 pandas 数据框中?

Alv*_*nez 5 python dataframe web-scraping pandas python-requests

我试图从以下网站获取每个 href 元素内的不同值: https: //www.bmv.com.mx/es/mercados/capitales

\n

href对于HTML 文件中的每个不同元素,应有 1 行与提供的标题上的每个字段相匹配。

\n

这是我试图抓取的 HTML 部分之一:

\n
\n  <tbody>\n    \n  <tr role="row" class="odd">\n<td class="sorting_1"><a href="/es/mercados/cotizacion/1959">AC\n  \n</a></td><td><span class="series">*</span>\n</td><td>03:20</td><td><span class="color-2">191.04\n\n</span></td><td>191.32</td>\n<td>194.51</td>\n<td>193.92</td>\n<td>191.01</td>\n<td>380,544</td>\n<td>73,122,008.42</td>\n<td>2,793</td>\n<td>-3.19</td><td>-1.64</td></tr><tr role="row" class="even">\n  <td class="sorting_1"><a href="/es/mercados/cotizacion/203">ACCELSA</a>\n  </td>\n  <td><span class="series">B</span>\n  </td><td>03:20</td><td>\n    <span class="">22.5</span></td><td>0</td>\n    <td>22.5</td><td>0</td><td>0\n\n    </td><td>3</td><td>67.20</td>\n    <td>1</td><td>0</td><td>0</td></tr>\n    <tr role="row" class="odd">\n      <td class="sorting_1">\n        <a href="/es/mercados/cotizacion/6096">ACTINVR</a></td>\n      <td><span class="series">B</span></td><td>03:20</td><td>\n        <span class="">15.13</span></td><td>0</td><td>15.13</td><td>0</td>\n        <td>0</td><td>13</td><td>196.69</td><td>4</td><td>0</td>\n        <td>0</td></tr><tr role="row" class="even"><td class="sorting_1">\n          <a href="/es/mercados/cotizacion/339083">AGUA</a></td>\n          <td><span class="series">*</span>\n          </td><td>03:20</td><td>\n            <span class="color-1">29</span>\n          </td><td>28.98</td><td>28.09</td>\n            <td>29</td><td>28</td><td>296,871</td>\n            <td>8,491,144.74</td><td>2,104</td><td>0.89</td>\n            <td>3.17</td></tr><tr role="row" class="odd"><td class="sorting_1">\n              <a href="/es/mercados/cotizacion/30">ALFA</a></td><td><span class="series">A</span></td>\n              <td>03:20</td>\n              <td><span class="color-2">13.48</span>\n              </td><td>13.46</td>\n              <td>13.53</td><td>13.62</td><td>13.32</td>\n              <td>2,706,398</td>\n              td>36,494,913.42</td><td>7,206</td><td>-0.07</td>\n              <td>-0.52</td>\n            </tr><tr role="row" class="even"><td class="sorting_1">\n              <a href="/es/mercados/cotizacion/7684">ALPEK</a></td><td><span class="series">A</span>\n              </td><td>03:20</td><td><span class="color-2">10.65</span>\n            </td><td>10.64</td><td>10.98</td><td>10.88</td><td>10.53</td>\n            <td>1,284,847</td><td>13,729,368.46</td><td>6,025</td><td>-0.34</td>\n            <td>-3.10</td></tr><tr role="row" class="odd"><td class="sorting_1">\n              <a href="/es/mercados/cotizacion/1729">ALSEA</a></td><td><span class="series">*</span>\n            </td><td>03:20</td><td><span class="color-2">65.08</span></td><td>64.94</td><td>65.44</td><td>66.78</td><td>64.66</td><td>588,826</td><td>38,519,244.51</td><td>4,442</td><td>-0.5</td><td>-0.76</td></tr>\n            <tr role="row" class="even"><td class="sorting_1">\n              <a href="/es/mercados/cotizacion/424518">ALTERNA</a></td><td><span class="series">B</span></td><td>03:20</td><td><span class="">1.5</span></td><td>0</td><td>1.5</td>\n              <td>0</td><td>0</td><td>2</td><td>3</td><td>1</td><td>0</td><td>0</td></tr><tr role="row" class="odd"><td class="sorting_1">\n              <a href="/es/mercados/cotizacion/1862">AMX</a></td>\n              <td><span class="series">B</span></td><td>03:20</td>\n              <td><span class="color-2">14.56</span></td><td>14.58</td>\n              <td>14.69</td><td>14.68</td><td>14.5</td><td>86,023,759</td>\n              <td>1,254,412,623.59</td><td>41,913</td><td>-0.11</td>\n              <td>-0.75</td></tr><tr role="row" class="even">\n                <td class="sorting_1"><a href="/es/mercados/cotizacion/6507">ANGELD</a>\n              </td><td><span class="series">10</span></td><td>03:20</td><td>\n                <span class="color-2">21.09</span>\n              </td><td>21.1</td><td>21.44</td><td>21.23</td><td>21.09</td>\n              <td>51,005</td><td>1,076,281.67</td>\n              <td>22</td><td>-0.34</td><td>-1.59</td></tr>\n      </tbody>\n
Run Code Online (Sandbox Code Playgroud)\n

我当前的代码结果为空dataframe

\n
# create empty pandas dataframe\nimport pandas as pd\nimport requests\nfrom bs4 import BeautifulSoup\n\n\n# get response code from webhost\npage = requests.get(\'https://www.bmv.com.mx/es/mercados/capitales\')\nsoup = BeautifulSoup(page.text, \'lxml\')\n#print(soup.p.text)\n# yet it doesn\'t bring the expected rows!\n\nprint(\'Read html!\')\n\n# get headers\n\ntbody = soup.find("thead")\ntr = tbody.find_all("tr")\n\nheaders= [t.get_text().strip().replace(\'\\n\', \',\').split(\',\') for t in tr][0]\n\n#print(headers)\n\ndf = pd.DataFrame(columns=headers)\n\n# fetch rows into pandas dataframe# You can find children with multiple tags by passing a list of strings\nrows = soup.find_all(\'tr\', {"role":"row"})\n#rows\n\nfor row in rows:\n    cells = row.findChildren(\'td\')\n    for cell in cells:\n        value = cell.string\n\n        #print("The value in this cell is %s" % value)\n\n        # append row in dataframe\n\n
Run Code Online (Sandbox Code Playgroud)\n

我想知道是否可以获得一个pandas数据帧,其字段是标题列表中描述的字段,行是 href 中的每个元素。

\n

为了获得更好的视角,预期输出应等于所提供网站底部的表格。其第一行具有下一个架构:

\n
EMISORA SERIE   HORA    \xc3\x9aLTIMO   PPP    ANTERIOR    M\xc3\x81XIMO  M\xc3\x8dNIMO VOLUMEN  IMPORTE OPS.    VAR PUNTOS  VAR %\nAC        *    3:20    191.04   191.32  194.51     193.92   191.01  380,544  73,122,008.42   2,793  -3.19    -1.64\n\n
Run Code Online (Sandbox Code Playgroud)\n

是否可以创建这样的数据集?

\n

Hed*_*Hog 4

如前所述,该表是通过 JavaScript 动态加载和呈现的,这是您无法处理的requests因为它只是获得静态响应,并且行为不像浏览器。

\n

@thetaco 使用给出了模仿浏览器行为的解决方案selenium,但您也可以通过以下方式实现您的目标requests但您也可以在使用数据来源时

\n
    \n
  1. 获取请求 url,使用浏览器开发工具检查网络流量,在此示例中为: https: //www.bmv.com.mx/es/Grupo_BMV/BmvJsonGeneric ?idSitioPagina=4

    \n
  2. \n
  3. 从响应中提取字符串(它不是有效的 JSON)

    \n
    requests.get(\'https://www.bmv.com.mx/es/Grupo_BMV/BmvJsonGeneric?idSitioPagina=4\').text.split(\';(\', 1)[-1].split(\')\')[0]\n
    Run Code Online (Sandbox Code Playgroud)\n
  4. \n
  5. 将字符串转换为 JSON ( json.loads()) 并将其转换pandas.json_normalize()为数据帧。您的数据在该路径下[\'response\'][\'resultado\'][\'A\']

    \n
  6. \n
  7. 列名称可能略有不同,因为它们是基于 JSON 的键构建的,但可以轻松映射。

    \n
  8. \n
\n

响应包含所有内容,包括其他组 ( ACCIONES, CKD\'S, FIBRAS, T\xc3\x8dTULOS OPCIONALES) 的内容,也可以提取 (A, CKDS, F, TO ) 为缩写,可以类似地用于选择。

\n
示例(来自 XHR 请求的 ACCIONES 的所有可用信息)
\n
import json, requests\nimport pandas as pd\n\ndf = pd.json_normalize(\n    json.loads(\n        requests.get(\'https://www.bmv.com.mx/es/Grupo_BMV/BmvJsonGeneric?idSitioPagina=4\')\\\n            .text\\\n            .split(\';(\', 1)[-1]\\   \n            .split(\')\')[0]\n        )[\'response\'][\'resultado\'][\'A\']\n)\\\n.dropna(axis=1, how=\'all\')\n
Run Code Online (Sandbox Code Playgroud)\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n
id发射身份价值cve系列科塔伊迪米索拉统计数据马克西莫数据最小数据数据导入Acomulado统计数据数据变化PuntosdatosEstadistica.variacion波尔森特数据统计.precioUltimoHecho数据统计.ppp数据统计.precioAnterior统计数据.volumenOperado统计数据.anioEjercicio统计数据
01959年1*交流电608103:20192.98189.019.54831e+073333-2.59-1.35189.3189.32191.9150229700
12031ACCELSA501503:200022.410022.5022.5100
...
1034048331B19VMEX3434703:2045.2945.2911007.980.140.3145.29045.1524300
1043273361A掌侧3002303:2012.7612.421.5744e+0750060.241.9312.6712.6812.44124639700
10551*沃美克斯521403:2070.3767.831.21326e+0919593-2.02-2.8668.768.7270.741763958800
\n
\n

更接近您的结果,您可以dataframe根据您的需要进行后期处理:

\n
import re\n# exclude all columns referencing an id information\ndf = df.loc[:, ~df.columns.str.startswith(\'id\')]\n# adjust the column names\ndf.columns = [re.sub(r"(?<=\\w)([A-Z])", r" \\1", c).split(\'.\')[-1].lstrip(\'cve\').upper() for c in df.columns]\ndf\n
Run Code Online (Sandbox Code Playgroud)\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n
意甲联赛科塔霍拉马克西莫迷你摩进口阿科穆拉多没有任何操作阿里亚西翁蓬托斯阿里阿西翁·波森图尔PRECIO ULTIMO HECHO购买力平价优质前牙奥鲁门歌剧阴离子电刺激因苏莫斯PU
*交流电03:20191.17187.81.14863e+0841750.640.34189.65189.96189.3260463200
ACTINVR03:2015.0315.0336614.4140015.03015.03243600
...
A掌侧03:2012.9712.511.48613e+0728320.070.5512.8312.7512.68116268400
*沃美克斯03:2069.0367.667.2698e+0822462-0.71-1.036868.0168.721067227000
\n

或者简单地映射到列,以获得准确的列名称:

\n
map_dict = {\'cveSerie\':\'SERIE\', \'cveCorta\':\'EMISORA\', \'datosEstadistica.hora\':\'HORA\', \'datosEstadistica.maximo\':\'M\xc3\x81XIMO\',\n       \'datosEstadistica.minimo\':\'M\xc3\x8dNIMO\', \'datosEstadistica.importeAcomulado\':\'IMPORTE\', \'datosEstadistica.noOperaciones\':\'OPS.\', \'datosEstadistica.variacionPuntos\':\'VAR PUNTOS\',\n       \'datosEstadistica.variacionPorcentual\':\'VAR %\', \'datosEstadistica.precioUltimoHecho\':\'\xc3\x9aLTIMO\', \'datosEstadistica.ppp\':\'PPP\',\n       \'datosEstadistica.precioAnterior\':\'ANTERIOR\', \'datosEstadistica.volumenOperado\':\'VOLUMEN\'}\ndf.loc[:,[c for c in df.columns if c in map_dict.keys()]].rename(columns=map_dict)\n
Run Code Online (Sandbox Code Playgroud)\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n
意甲联赛埃米索拉霍拉M\xc3\x81XIMOM\xc3\x8dNIMO导入操作。瓦尔蓬托斯增值率%\xc3\x9aLTIMO购买力平价前部体积
*交流电03:20191.17187.81.14863e+0841750.640.34189.65189.96189.32604632
\n

...

\n