如何从电子邮件正文中解析 HTML - Python

Question

如何从电子邮件正文中解析 HTML - Python

skm*_*kme 4 html python email beautifulsoup email-parsing

我正在尝试用 python 解析传入的电子邮件。我收到的电子邮件部分是文本，部分是 HTML。我想获取 HTML 部分并在 HTML 中找到一个表格。

我尝试使用 beautifulsoup。但是当尝试下一个代码时，bs 只获取第一个 "" 部分，而不是所有 HTML 部分：

# connecting to the gmail imap server
m = imaplib.IMAP4_SSL("imap.gmail.com")
m.login(user,pwd)
# use m.list() to get all the mailboxes, "INBOX" to get only inbox
m.select("INBOX")
resp, items = m.search(None, '(UNSEEN)') # you could filter using the IMAP rules here (check http://www.example-code.com/csharp/imap-search-critera.asp)
items = items[0].split() # getting the mails id

for emailid in items:
    # getting the mail content
    resp, data = m.fetch(emailid, '(UID BODY[TEXT])')
    text = str(data[0][1])
    soup = bs(text)

Run Code Online (Sandbox Code Playgroud)

如何对整个 HTML 部分使用“bs”？或者，有没有其他方法可以从电子邮件正文中解析出 html 表？

'bs' 似乎最适合我，因为我想找到包含特定关键字的特定 HTML Body，而 'bs' 搜索可以检索整个表并让我迭代它。

Answer 1

skm*_*kme 5

显然，我使用了错误的解析器。

一旦我换成“lxml”解析器，它就工作得很好。

需要更改下一行：

soup = bs(text,"lxml");

Run Code Online (Sandbox Code Playgroud)

归档时间：	12 年，7 月前
查看次数：	9331 次
最近记录：	12 年，6 月前