Mat*_*Law 0 python beautifulsoup
我正在尝试使用beautifulsoup从网页上抓取数据,并将其(最终)输出到csv中。作为第一步,我尝试获取相关表的文本。我设法做到了,但是当我重新运行它时,代码不再为我提供相同的输出:运行for循环时,它不会保存所有的12372条记录,而只是保存了最后一条。
我的代码的缩写版本是:
from bs4 import BeautifulSoup
BirthsSoup = BeautifulSoup(browser.page_source, features="html.parser")
print(BirthsSoup.prettify())
# this confirms that the soup has captured the page as I want it to
birthsTable = BirthsSoup.select('#t2 td')
# selects all the elements in the table I want
birthsLen = len(birthsTable)
# birthsLen: 12372
for i in range(birthsLen):
print(birthsTable[i].prettify())
# this confirms that the beautifulsoup tag object correctly captured all of the table
for i in range(birthsLen):
birthsText = birthsTable[i].getText()
# this was supposed to compile the text for every element in the table
Run Code Online (Sandbox Code Playgroud)
但是for循环仅保存表中最后一个(即12372nd)元素的文本。我是否需要做其他事情以使它在循环通过时保存每个元素?我认为我先前的(期望的)输出在换行符中包含了每个元素的文本。
这是我第一次使用python,如果我犯了一个明显的错误,因此深表歉意。
您正在执行的操作是在每次迭代时覆盖您的birthText字符串,因此到结束时,只会保存最后一个。要解决此问题,请创建一个列表并追加每行:
birthsLen = len(birthsTable)
birthsText = []
for i in range(birthsLen):
birthsText.append(birthsTable[i].getText())
Run Code Online (Sandbox Code Playgroud)
或者,更简洁地说:
birthsText = [line.getText() for line in birthsTable]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
174 次 |
| 最近记录: |