清理电子邮件链以进行文本分析python

Mat*_* W. 5 python text

我有一些文字:

text = """From: 'Mark Twain' <mark.twain@gmail.com>
To: 'Edgar Allen Poe' <eap@gmail.com>
Subject: RE:Hello!

Ed,

I just read the Tell Tale Heart. You\'ve got problems man.

Sincerely,
Marky Mark

From: 'Edgar Allen Poe' <eap@gmail.com>
To: 'Mark Twain' <mark.twain@gmail.com>
Subject: RE: Hello!

Mark,

The world is crushing my soul, and so are you.

Regards,
Edgar"""
Run Code Online (Sandbox Code Playgroud)

看起来像这样:

"From: 'Mark Twain' <mark.twain@gmail.com>\nTo: 'Edgar Allen Poe' <eap@gmail.com>\nSubject: RE:Hello!\n\nEd,\n\nI just read the Tell Tale Heart. You've got problems man.\n\nSincerely,\nMarky Mark\n\nFrom: 'Edgar Allen Poe' <eap@gmail.com>\nTo: 'Mark Twain' <mark.twain@gmail.com>\nSubject: RE: Hello!\n\nMark,\n\nThe world is crushing my soul, and so are you.\n\nRegards,\nEdgar"
Run Code Online (Sandbox Code Playgroud)

我正在尝试解析其中的消息。最终,我想要一个列表或字典,其中包含 From 和 To,然后是用于进行一些分析的消息正文。

我尝试通过将所有内容调低,然后进行字符串拆分来解析它。

text = text.lower()
text = text.translate(string.punctuation)
text_list = text.split('+')
text_list = [x for x in text_list if len(x) != 0]
Run Code Online (Sandbox Code Playgroud)

有一个更好的方法吗?

And*_*ely 4

您可以使用re来拆分消息(外部站点上对此正则表达式的解释)。结果是带有键'from''to''subject'的字典列表'message'

text = """From: 'Mark Twain' <mark.twain@gmail.com>
To: 'Edgar Allen Poe' <eap@gmail.com>
Subject: RE:Hello!

Ed,

I just read the Tell Tale Heart. You\'ve got problems man.

Sincerely,
Marky Mark

From: 'Edgar Allen Poe' <eap@gmail.com>
To: 'Mark Twain' <mark.twain@gmail.com>
Subject: RE: Hello!

Mark,

The world is crushing my soul, and so are you.

Regards,
Edgar"""

import re
from pprint import pprint

groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text, flags=re.DOTALL|re.M)
emails = []
for g in groups:
    d = {}
    d['from'] = g[0].strip()
    d['to'] = g[1].strip()
    d['subject'] = g[2].strip()
    d['message'] = g[3].strip()
    emails.append(d)

pprint(emails)
Run Code Online (Sandbox Code Playgroud)

印刷:

[{'from': "'Mark Twain' <mark.twain@gmail.com>",
  'message': 'Ed,\n'
             '\n'
             "I just read the Tell Tale Heart. You've got problems man.\n"
             '\n'
             'Sincerely,\n'
             'Marky Mark',
  'subject': 'RE:Hello!',
  'to': "'Edgar Allen Poe' <eap@gmail.com>"},
 {'from': "'Edgar Allen Poe' <eap@gmail.com>",
  'message': 'Mark,\n'
             '\n'
             'The world is crushing my soul, and so are you.\n'
             '\n'
             'Regards,\n'
             'Edgar',
  'subject': 'RE: Hello!',
  'to': "'Mark Twain' <mark.twain@gmail.com>"}]
Run Code Online (Sandbox Code Playgroud)