最近我一直在研究python中的一个项目,涉及为某些代理抓取一些网站.我遇到的问题是,当我试图刮掉一个众所周知的代理站点时,当我要求它找到代理表中IP的位置时,Beautiful Soup并不能达到我的预期.我将尝试为每个代理的IP寻找,当我.get_text()在相应的元素上使用Beautiful Soup的方法时,我会得到这样的输出.
...
.UbZT{display:none}
.f5fa{display:inline}
.Glj2{display:none}
.cUce{display:inline}
.zjUZ{display:none}
.GzLS{display:inline}
98120169.117.186373161218218.83839393101138154165203242
...
Run Code Online (Sandbox Code Playgroud)
这是我要解析的元素(包含IP的td标记):
<td><span><style>
.lLXJ{display:none}
.qRCB{display:inline}
.qC69{display:none}
.V0zO{display:inline}
</style><span style="display: inline">190</span><span class="V0zO">.</span><span
style="display:none">2</span><div style="display:none">20</div><span
style="display:none">51</span><span style="display:none">56</span><div
style="display:none">56</div><span style="display:none">61</span><span
class="lLXJ">61</span><div style="display:none">61</div><span
class="qC69">110</span><div
style="display:none">110</div><span style="display:none">135</span><div
style="display:none">135</div><span class="V0zO">221</span><span
style="display:none">234</span><div style="display:none">234</div><span class="147">.
</span><span style="display: inline">29</span><div style="display:none">44</div><span
style="display:none">228</span><span></span><span class="qC69">248</span>.<span
style="display:none">7</span><span></span><span style="display:none">44</span><span
class="qC69">44</span><span class="qC69">80</span><span></span><span
style="display:none">85</span><span class="lLXJ">85</span><div
style="display:none">85</div><span class="qC69">100</span><div
style="display:none">100</div><span></span><span class="qC69">130</span><div
style="display:none">130</div><div style="display:none">168</div>212<span
style="display:none">230</span><span class="qC69">230</span><div
style="display:none">230</div></span></td>
Run Code Online (Sandbox Code Playgroud)
该元素的实际文本只是代理的IP.
这是我的代码片段:
# Hide My Ass
pages = ['https://www.hidemyass.com/proxy-list']
for page in pages:
hidemyass = Soup(requests.get(page).text)
rows = hidemyass.find_all(lambda …Run Code Online (Sandbox Code Playgroud)