使用Python解析电子邮件

Man*_*ron 13 python email parsing mime

我正在编写一个Python脚本来处理从Procmail返回的电子邮件.正如这个问题所示,我正在使用以下Procmail配置:

:0:
|$HOME/process_mail.py
Run Code Online (Sandbox Code Playgroud)

我的process_mail.py脚本通过stdin接收电子邮件,如下所示:

From hostname Tue Jun 15 21:43:30 2010
Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
Received: from mail-fx0-f44.google.com (209.85.161.44)
by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
Received: by fxm19 with SMTP id 19so170709fxm.3
for <username@domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15
Jun 2010 18:47:33 -0700 (PDT)
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Date: Tue, 15 Jun 2010 20:47:33 -0500
Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB@mail.gmail.com>
Subject: TEST 12
From: Full Name <username@sender.com>
To: username@domain.com
Content-Type: text/plain; charset=ISO-8859-1

ONE
TWO
THREE
Run Code Online (Sandbox Code Playgroud)

我试图以这种方式解析消息:

>>> import email
>>> msg = email.message_from_string(full_message)
Run Code Online (Sandbox Code Playgroud)

我想获取"From","To"和"Subject"等消息字段.但是,消息对象不包含任何这些字段.

我究竟做错了什么?

Ale*_*lli 10

您必须确保线条不会被意外损坏(因为它们在上面,虽然很难说这是否是复制粘贴问题) - 使用完整的消息,例如:

Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
Received: from mail-fx0-f44.google.com (209.85.161.44) by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
Received: by fxm19 with SMTP id 19so170709fxm.3 for <username@domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Date: Tue, 15 Jun 2010 20:47:33 -0500
Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB@mail.gmail.com>
Subject: TEST 12
From: Full Name <username@sender.com>
To: username@domain.com
Content-Type: text/plain; charset=ISO-8859-1

ONE
TWO
THREE
Run Code Online (Sandbox Code Playgroud)

然后

msg = email.message_from_string(msgtxt)
print msg['Subject']
Run Code Online (Sandbox Code Playgroud)

TEST 12根据需要打印.


Mic*_*zek 5

看起来您的换行符没有在附加行前面添加空格,根据RFC 2822 \xc2\xa72.3.2 ,这是非法的:

\n
\n


每个标头字段在逻辑上都是由字段名称、冒号和字段主体组成的单行字符。然而,为了方便
起见,并处理每行 998/78 个字符的限制,
标头字段的字段正文部分可以拆分为多个
行表示;这称为“折叠”。一般规则是,
只要该标准允许折叠空白(不只是
WSP 字符),就可以在任何 WSP 之前插入 CRLF。对于
\n示例,标头字段:

\n
    Subject: This is a test\n
Run Code Online (Sandbox Code Playgroud)\n

可以表示为:

\n
    Subject: This\n     is a test\n
Run Code Online (Sandbox Code Playgroud)\n
\n

它应该看起来像这样:

\n
From hostname Tue Jun 15 21:43:30 2010\nReceived: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400\nReceived: from mail-fx0-f44.google.com (209.85.161.44)\n    by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400\nReceived: by fxm19 with SMTP id 19so170709fxm.3\n    for <username@domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)\nMIME-Version: 1.0\nReceived: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15\n    Jun 2010 18:47:33 -0700 (PDT)\nReceived: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)\nDate: Tue, 15 Jun 2010 20:47:33 -0500\nMessage-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB@mail.gmail.com>\nSubject: TEST 12\nFrom: Full Name <username@sender.com>\nTo: username@domain.com\nContent-Type: text/plain; charset=ISO-8859-1\n\nONE\nTWO\nTHREE\n
Run Code Online (Sandbox Code Playgroud)\n