优化BeautifulSoup(Python)代码

Question

优化BeautifulSoup(Python)代码

dev*_*per 5 python optimization beautifulsoup

我有使用该BeautifulSoup库进行解析的代码,但速度非常慢.编写代码的方式是不能使用线程.谁能帮我这个？

我BeautifulSoup用于解析而不是保存到数据库中.如果我注释掉该save语句,它仍然需要很长时间,因此数据库没有问题.

def parse(self,text):                
    soup = BeautifulSoup(text)
    arr = soup.findAll('tbody')                

    for i in range(0,len(arr)-1):
        data=Data()
        soup2 = BeautifulSoup(str(arr[i]))
        arr2 = soup2.findAll('td')

        c=0
        for j in arr2:                                       
            if str(j).find("<a href=") > 0:
                data.sourceURL = self.getAttributeValue(str(j),'<a href="')
            else:  
                if c == 2:
                    data.Hits=j.renderContents()

            #and few others...

            c = c+1

            data.save()

Run Code Online (Sandbox Code Playgroud)

有什么建议？

注意:我已在此处提出此问题,但由于信息不完整而被关闭.

Answer 1

int*_*jay 6

soup2 = BeautifulSoup(str(arr[i]))
arr2 = soup2.findAll('td')

Run Code Online (Sandbox Code Playgroud)

不要这样做:只需打电话arr2 = arr[i].findAll('td').

这也会很慢:

if str(j).find("<a href=") > 0:
    data.sourceURL = self.getAttributeValue(str(j),'<a href="')

Run Code Online (Sandbox Code Playgroud)

假设getAttributeValue为您提供了href属性,请改用:

a = j.find('a', href=True)       #find first <a> with href attribute
if a:
    data.sourceURL = a['href']
else:
    #....

Run Code Online (Sandbox Code Playgroud)

通常,如果您只想解析它并提取值,则不需要将BeautifulSoup对象转换回字符串.由于find和findAll方法可以为您提供可搜索的对象,因此您可以通过调用find/ findAll/ etc 继续搜索.关于结果的方法.

归档时间：	15 年，10 月前
查看次数：	1885 次
最近记录：	15 年，10 月前