Tim*_*Tim 14 java email duplicate-data
我正在解析电子邮件.当我看到对电子邮件的回复时,我想删除引用的文本,以便我可以将文本附加到上一封电子邮件中(即使是回复).
通常情况下,您会看到:
第一封电子邮件(会话开始)
This is the first email
Run Code Online (Sandbox Code Playgroud)
第2封电子邮件(回复第一封)
This is the second email
Tim said:
This is the first email
Run Code Online (Sandbox Code Playgroud)
这个输出只是"这是第二封电子邮件".虽然不同的电子邮件客户端引用文本的方式不同,但如果只是为了获得大部分新的电子邮件文本,那也是可以接受的.
smu*_*has 12
我使用以下正则表达式匹配引用文本的引导(最后一个是重要的):
/** general spacers for time and date */
private static final String spacers = "[\\s,/\\.\\-]";
/** matches times */
private static final String timePattern = "(?:[0-2])?[0-9]:[0-5][0-9](?::[0-5][0-9])?(?:(?:\\s)?[AP]M)?";
/** matches day of the week */
private static final String dayPattern = "(?:(?:Mon(?:day)?)|(?:Tue(?:sday)?)|(?:Wed(?:nesday)?)|(?:Thu(?:rsday)?)|(?:Fri(?:day)?)|(?:Sat(?:urday)?)|(?:Sun(?:day)?))";
/** matches day of the month (number and st, nd, rd, th) */
private static final String dayOfMonthPattern = "[0-3]?[0-9]" + spacers + "*(?:(?:th)|(?:st)|(?:nd)|(?:rd))?";
/** matches months (numeric and text) */
private static final String monthPattern = "(?:(?:Jan(?:uary)?)|(?:Feb(?:uary)?)|(?:Mar(?:ch)?)|(?:Apr(?:il)?)|(?:May)|(?:Jun(?:e)?)|(?:Jul(?:y)?)" +
"|(?:Aug(?:ust)?)|(?:Sep(?:tember)?)|(?:Oct(?:ober)?)|(?:Nov(?:ember)?)|(?:Dec(?:ember)?)|(?:[0-1]?[0-9]))";
/** matches years (only 1000's and 2000's, because we are matching emails) */
private static final String yearPattern = "(?:[1-2]?[0-9])[0-9][0-9]";
/** matches a full date */
private static final String datePattern = "(?:" + dayPattern + spacers + "+)?(?:(?:" + dayOfMonthPattern + spacers + "+" + monthPattern + ")|" +
"(?:" + monthPattern + spacers + "+" + dayOfMonthPattern + "))" +
spacers + "+" + yearPattern;
/** matches a date and time combo (in either order) */
private static final String dateTimePattern = "(?:" + datePattern + "[\\s,]*(?:(?:at)|(?:@))?\\s*" + timePattern + ")|" +
"(?:" + timePattern + "[\\s,]*(?:on)?\\s*"+ datePattern + ")";
/** matches a leading line such as
* ----Original Message----
* or simply
* ------------------------
*/
private static final String leadInLine = "-+\\s*(?:Original(?:\\sMessage)?)?\\s*-+\n";
/** matches a header line indicating the date */
private static final String dateLine = "(?:(?:date)|(?:sent)|(?:time)):\\s*"+ dateTimePattern + ".*\n";
/** matches a subject or address line */
private static final String subjectOrAddressLine = "((?:from)|(?:subject)|(?:b?cc)|(?:to))|:.*\n";
/** matches gmail style quoted text beginning, i.e.
* On Mon Jun 7, 2010 at 8:50 PM, Simon wrote:
*/
private static final String gmailQuotedTextBeginning = "(On\\s+" + dateTimePattern + ".*wrote:\n)";
/** matches the start of a quoted section of an email */
private static final Pattern QUOTED_TEXT_BEGINNING = Pattern.compile("(?i)(?:(?:" + leadInLine + ")?" +
"(?:(?:" +subjectOrAddressLine + ")|(?:" + dateLine + ")){2,6})|(?:" +
gmailQuotedTextBeginning + ")"
);
Run Code Online (Sandbox Code Playgroud)
我知道在某些方面这是过度的(可能会很慢!)但它的效果非常好.如果您发现任何与此不符的内容,请告诉我,以便我可以改进!
小智 6
查看Google专利:http://www.google.com/patents/US7222299
总之,它们对文本的部分进行散列(可能类似于句子),然后在先前的消息中查找与散列的匹配.速度非常快,他们也可能将此作为线程算法的输入.真是个好主意!
当以前的电子邮件存储在磁盘上或可用时,您可以检查由特定收件人发送的所有邮件,以确定哪些是响应文本。
您还可以尝试通过检查最后几行的第一个字符来确定引号字符。通常,最后几行总是以相同的字符开头。
当最后两行以不同的字符开头时,您可以尝试第一行,因为有时答案会附加在文本末尾。
如果检测到这些字符,则可以删除以此字符开头的最后几行,直到检测到空行或以另一个字符开头的行。
未经测试,更像是伪代码
String[] lines;
// Check the size of the array first, length > 2
char startingChar = lines[lines.length - 1].charAt(0);
int foundCounter = 0;
for (int i = lines.length - 2; i >=0; --i) {
String line = lines[i];
// Check line size > 0
if(startingChar == line.charAt(0)){
++foundCounter;
}
}
final int YOUR_DECISION = 2; // You can decide
if(foundCounter > YOUR_DECISION){
deleteLastLinesHere(startingChar, foundCounter);
}
Run Code Online (Sandbox Code Playgroud)