我有一些文字:
text = """From: 'Mark Twain' <mark.twain@gmail.com>
To: 'Edgar Allen Poe' <eap@gmail.com>
Subject: RE:Hello!
Ed,
I just read the Tell Tale Heart. You\'ve got problems man.
Sincerely,
Marky Mark
From: 'Edgar Allen Poe' <eap@gmail.com>
To: 'Mark Twain' <mark.twain@gmail.com>
Subject: RE: Hello!
Mark,
The world is crushing my soul, and so are you.
Regards,
Edgar"""
Run Code Online (Sandbox Code Playgroud)
看起来像这样:
"From: 'Mark Twain' <mark.twain@gmail.com>\nTo: 'Edgar Allen Poe' <eap@gmail.com>\nSubject: RE:Hello!\n\nEd,\n\nI just read the Tell Tale Heart. You've got problems man.\n\nSincerely,\nMarky Mark\n\nFrom: 'Edgar Allen Poe' <eap@gmail.com>\nTo: 'Mark Twain' <mark.twain@gmail.com>\nSubject: RE: Hello!\n\nMark,\n\nThe world is crushing my soul, and so are you.\n\nRegards,\nEdgar"
Run Code Online (Sandbox Code Playgroud)
我正在尝试解析其中的消息。最终,我想要一个列表或字典,其中包含 From 和 To,然后是用于进行一些分析的消息正文。
我尝试通过将所有内容调低,然后进行字符串拆分来解析它。
text = text.lower()
text = text.translate(string.punctuation)
text_list = text.split('+')
text_list = [x for x in text_list if len(x) != 0]
Run Code Online (Sandbox Code Playgroud)
有一个更好的方法吗?
您可以使用re来拆分消息(外部站点上对此正则表达式的解释)。结果是带有键'from'、'to'和'subject'的字典列表'message':
text = """From: 'Mark Twain' <mark.twain@gmail.com>
To: 'Edgar Allen Poe' <eap@gmail.com>
Subject: RE:Hello!
Ed,
I just read the Tell Tale Heart. You\'ve got problems man.
Sincerely,
Marky Mark
From: 'Edgar Allen Poe' <eap@gmail.com>
To: 'Mark Twain' <mark.twain@gmail.com>
Subject: RE: Hello!
Mark,
The world is crushing my soul, and so are you.
Regards,
Edgar"""
import re
from pprint import pprint
groups = re.findall(r'^From:(.*?)To:(.*?)Subject:(.*?)$(.*?)(?=^From:|\Z)', text, flags=re.DOTALL|re.M)
emails = []
for g in groups:
d = {}
d['from'] = g[0].strip()
d['to'] = g[1].strip()
d['subject'] = g[2].strip()
d['message'] = g[3].strip()
emails.append(d)
pprint(emails)
Run Code Online (Sandbox Code Playgroud)
印刷:
[{'from': "'Mark Twain' <mark.twain@gmail.com>",
'message': 'Ed,\n'
'\n'
"I just read the Tell Tale Heart. You've got problems man.\n"
'\n'
'Sincerely,\n'
'Marky Mark',
'subject': 'RE:Hello!',
'to': "'Edgar Allen Poe' <eap@gmail.com>"},
{'from': "'Edgar Allen Poe' <eap@gmail.com>",
'message': 'Mark,\n'
'\n'
'The world is crushing my soul, and so are you.\n'
'\n'
'Regards,\n'
'Edgar',
'subject': 'RE: Hello!',
'to': "'Mark Twain' <mark.twain@gmail.com>"}]
Run Code Online (Sandbox Code Playgroud)