使用BeautifulSoup从div中的所有p元素获取文本

Question

使用BeautifulSoup从div中的所有p元素获取文本

Ren*_*ene 2 beautifulsoup web-scraping python-2.7

我正在尝试获取给定div中所有p元素的文本(没有标记的内容):

import requests
from bs4 import BeautifulSoup

def getArticle(url):
    url = 'http://www.bbc.com/news/business-34421804'
    result = requests.get(url)
    c = result.content
    soup = BeautifulSoup(c)

    article = []
    article = soup.find("div", {"class":"story-body__inner"}).findAll('p')
    for element in article:
        article = ''.join(element.findAll(text = True))
    return article

Run Code Online (Sandbox Code Playgroud)

问题是这只返回最后一段的内容.但是如果我只使用print,代码就能完美运行:

    for element in article:
        print ''.join(element.findAll(text = True))
    return

Run Code Online (Sandbox Code Playgroud)

我想在别处调用这个函数,所以我需要它来返回文本,而不仅仅是打印它.我搜索了stackoverflow并搜索了很多,但没有找到答案,我不明白可能是什么问题.我使用Python 2.7.9和bs4.提前致谢!

Answer 1

Vik*_*jha 7

以下代码应该工作 -

import requests
from bs4 import BeautifulSoup

def getArticle(url):
    url = 'http://www.bbc.com/news/business-34421804'
    result = requests.get(url)
    c = result.content
    soup = BeautifulSoup(c)

    article_text = ''
    article = soup.find("div", {"class":"story-body__inner"}).findAll('p')
    for element in article:
        article_text += '\n' + ''.join(element.findAll(text = True))
    return article_text

Run Code Online (Sandbox Code Playgroud)

您的代码中存在几个问题 -

相同的变量名称"article"已用于存储元素和文本.
应返回的变量仅被赋予值而不附加,因此只有最后一个值保留在该变量中.

归档时间：	10 年，1 月前
查看次数：	5868 次
最近记录：	10 年，1 月前