I am trying to process e-mails into my application and everything seems to work fine till I get an e-mail from a user whose mail server is enforcing a word wrap of the mail text. I know that the word wrap is part of a RFC specification, so I'm just looking for the best way to handle it to get a nicely displayed message.
Original E-mail:
Here is my main issue. When I email a message, the text is broken up rather oddly. It almost looks as though the message itself is broken. I'm not sure why this is the case though because my original email looks nothing like that.
Here is what the received e-mail looks like (marked with CRLF to show where mail server is inserting them):
Here is my main issue. When I email a message, the text is broken up rather CRLF
oddly. It almost looks as though the message itself is broken. I'm not sure CRLF
why this is the case though because my original email looks nothing like CRLF
that.
My processing code runs through the following and would then insert the result into the database.
$dirty_string = nl2br($dirty_string);
$config = HTMLPurifier_Config::createDefault();
$config->set('AutoFormat.RemoveEmpty', 'true');
$config->set('AutoFormat.RemoveEmpty.RemoveNbsp', 'true');
$config->set('HTML.Allowed', 'a[href],br,p');
$purifier = new HTMLPurifier($config);
$clean_string = $purifier->purify($dirty_string);
The following is the result that gets displayed. If the div on my page is not wide enough for the line the browser will automatically word wrap it but the line-break from nl2br() cause causes the next line to be short.
Here is my main issue. When I email a message, the text is
broken up rather
oddly. It almost looks as though the message itself is
broken. I'm not sure
why this is the case though because my original email looks
nothing like
that.
I thought that maybe I could just change double CRLF's to new paragraphs and strip all the single CRLF to concatenate the lines to a single line which word-wrap would display correctly. But if someone posts the following bullet list in an e-mail, that would break the list.
This is my List CRLF
- Item 1 CRLF
- Item 2 CRLF
etc...
Any help would greatly appreciated.
Mail parsing is probably the quintessential example of a problem that appears simple, but is actually filled with oddball edge cases that break simple parsers. However, it's also not exactly a new problem, so there are plenty of existing solutions that work fine. Some options:
Maybe you've already written a great parser that just needs this one little change to be perfect, but more likely you'll save yourself much time and heartache by using the already existing tools to do the job.
How about this: for any line where the following line contains words and does not begin with a whitespace character (such as the indentation in a list), check if the length of the line is between 65 and 80 characters long. If it is, remove the trailing CR (and add a space if the end of the line doesn't contain space or punctuation). This will get most of your word wrap cases and leave most of your lists alone.
You could try using TinyMCE editor to view the e-mail message. It will format it correctly. I've used TinyMCE a few times to input data and save it to the database, and each time it correctly displayed it after I retrieved the data no matter how weird the formatting was.
How about a hack like this: Remove CLRF characters in any positions that are multiples of 78, (+ say 5 characters to account for this fact: the mail server won't just cut a line mid-word
).
So you would look for CLRF characters in these positions:
78
or 79 or 80 or 81 or 82 or 83 AND 156
or 157 or 158 or 159 or 160 or 161 ANDThis is of course assuming that the longest words are 5 characters in length. You should tweak this based on the emails you need to parse.
Here's a function that does the job pretty well:
function PlaintextEmailBrokenLineCombine($lineSet, $startIndex = 0) {
$result = '';
$lineCount = count($lineSet);
for($i=$startIndex; $i < $lineCount; $i++) {
$thisLine = $lineSet[$i];
$nextLine = ($i < $lineCount-1 ? $lineSet[$i+1] : '');
$nextLineFirstWord = substr($nextLine, 0, strpos($nextLine, ' '));
$lineSeparator = "\n"; // we assume until we detect invocation of the 78char rule
if(strlen($thisLine) + strlen($nextLineFirstWord) + 1 > 75) {
// A line break was PROBABLY put in here where a space once was, so switch back:
$lineSeparator = ' ';
}
$result .= $thisLine . ($i == $lineCount-1 ? '' : $lineSeparator); // no separator for the last line
}
return $result;
}
It's a little esoteric because it expect an array of lines from the plain text email. Here's the usage:
$Parser = new MimeMailParser();
$Parser->setText($rawEmailText);
$plaintext = $Parser->getMessageBody('text'); // or however you get it, many ways
$lineSet = explode("\n", $plaintext);
$niceText = PlaintextEmailBrokenLineCombine($lineSet);
$niceText is what you want: it's a pretty accurate way of getting the text you want with those pesky server-added line breaks gone, and replaced with the original spaces.