Processing e-mails that are word wrapped (Content-Type: text/plain)

Go To StackoverFlow.com

6

I am trying to process e-mails into my application and everything seems to work fine till I get an e-mail from a user whose mail server is enforcing a word wrap of the mail text. I know that the word wrap is part of a RFC specification, so I'm just looking for the best way to handle it to get a nicely displayed message.

Original E-mail:

Here is my main issue. When I email a message, the text is broken up rather oddly. It almost looks as though the message itself is broken. I'm not sure why this is the case though because my original email looks nothing like that.

Here is what the received e-mail looks like (marked with CRLF to show where mail server is inserting them):

Here is my main issue. When I email a message, the text is broken up rather CRLF
oddly. It almost looks as though the message itself is broken. I'm not sure CRLF
why this is the case though because my original email looks nothing like CRLF
that.

My processing code runs through the following and would then insert the result into the database.

$dirty_string = nl2br($dirty_string);
$config = HTMLPurifier_Config::createDefault();
$config->set('AutoFormat.RemoveEmpty', 'true');
$config->set('AutoFormat.RemoveEmpty.RemoveNbsp', 'true');
$config->set('HTML.Allowed', 'a[href],br,p');
$purifier = new HTMLPurifier($config);
$clean_string = $purifier->purify($dirty_string);

The following is the result that gets displayed. If the div on my page is not wide enough for the line the browser will automatically word wrap it but the line-break from nl2br() cause causes the next line to be short.

Here is my main issue. When I email a message, the text is
broken up rather
oddly. It almost looks as though the message itself is
broken. I'm not sure
why this is the case though because my original email looks
nothing like
that.

I thought that maybe I could just change double CRLF's to new paragraphs and strip all the single CRLF to concatenate the lines to a single line which word-wrap would display correctly. But if someone posts the following bullet list in an e-mail, that would break the list.

This is my List CRLF
- Item 1 CRLF
- Item 2 CRLF
etc...

Any help would greatly appreciated.

2012-04-04 16:49
by Matt D.
Does the RFC specification you mentioned also give you the width of each line - ErJab 2012-04-11 06:43
Yeah, http://www.ietf.org/rfc/rfc2822.txt. The spec states that lines SHOULD BE no more than 78 characters. Its not that easy though because you have to factor in the fact that the mail server won't just cut a line mid-word - Matt D. 2012-04-11 10:36


1

Mail parsing is probably the quintessential example of a problem that appears simple, but is actually filled with oddball edge cases that break simple parsers. However, it's also not exactly a new problem, so there are plenty of existing solutions that work fine. Some options:

Maybe you've already written a great parser that just needs this one little change to be perfect, but more likely you'll save yourself much time and heartache by using the already existing tools to do the job.

2012-04-09 18:44
by blahdiblah
Have you personally used any of these or would you recommend one over the other? None of them appear overly complicated - Matt D. 2012-04-10 12:51
@MatthewDevine I don't have particular recommendations among these, my email dealings have been largely not in PHP - blahdiblah 2012-04-10 15:43
I've used MailParse with good results. Can highly recommend it - Christian Riesen 2012-04-12 12:19
For purpose of this question, skip Plancake: it does a nice enough job of parsing email bodies, but the problematic line breaks remain completely intact - John Larson 2012-10-18 12:46
Yikes--PHP Mime Mail Parser also fails to remove the problematic line breaks when parsing plain text! And since it wraps MailParse, I reckon that also is of no help to the problem at hand. Perhaps some heuristic hack is the way to go? I'm now considering parsing HTML bodies and splitting on
to get the real lines out, PHP Mime Mail Parser and Plancake both appear to get those right - John Larson 2012-10-18 13:08


0

How about this: for any line where the following line contains words and does not begin with a whitespace character (such as the indentation in a list), check if the length of the line is between 65 and 80 characters long. If it is, remove the trailing CR (and add a space if the end of the line doesn't contain space or punctuation). This will get most of your word wrap cases and leave most of your lists alone.

2012-04-06 22:04
by Aerik
If you've got a better idea, I'd love to hear it - Aerik 2012-04-07 02:53
This type of solution would be more of a last ditch effort. A solution like this would force having to continuously make tweaks. I know you could never get it 100% but still - Matt D. 2012-04-10 12:41
I don't see where anyone really nailed it, so I stand by my original answer. I believe there is no really clean solution - Aerik 2012-06-19 23:49


0

You could try using TinyMCE editor to view the e-mail message. It will format it correctly. I've used TinyMCE a few times to input data and save it to the database, and each time it correctly displayed it after I retrieved the data no matter how weird the formatting was.

2012-04-07 18:43
by Stanislav Palatnik
Doesn't TinyMCE require user interaction? This processing is all going to be automated - Matt D. 2012-04-10 12:39
The reason I presented TinyMCE is because it would do all the formatting for you, without you needing to process anything at all(well, aside from escaping HTML for basic XSS protection). I don't understand your comment, because you said someone will "view" the E-mail. Isn't that user interaction - Stanislav Palatnik 2012-04-10 17:14
The message would also get sent to the user within an e-mail as well - Matt D. 2012-04-11 10:37


0

How about a hack like this: Remove CLRF characters in any positions that are multiples of 78, (+ say 5 characters to account for this fact: the mail server won't just cut a line mid-word).

So you would look for CLRF characters in these positions:

  • 78 or 79 or 80 or 81 or 82 or 83 AND
  • 156 or 157 or 158 or 159 or 160 or 161 AND
  • so on.

This is of course assuming that the longest words are 5 characters in length. You should tweak this based on the emails you need to parse.

2012-04-11 19:49
by ErJab


0

Here's a function that does the job pretty well:

function PlaintextEmailBrokenLineCombine($lineSet, $startIndex = 0) {
    $result = '';
    $lineCount = count($lineSet);
    for($i=$startIndex; $i < $lineCount; $i++) {
        $thisLine = $lineSet[$i];
        $nextLine = ($i < $lineCount-1 ? $lineSet[$i+1] : '');
        $nextLineFirstWord = substr($nextLine, 0, strpos($nextLine, ' '));

        $lineSeparator = "\n"; // we assume until we detect invocation of the 78char rule
        if(strlen($thisLine) + strlen($nextLineFirstWord) + 1 > 75) {
            // A line break was PROBABLY put in here where a space once was, so switch back:
            $lineSeparator = ' ';
        }
        $result .= $thisLine . ($i == $lineCount-1 ? '' : $lineSeparator); // no separator for the last line
    }
    return $result;
}

It's a little esoteric because it expect an array of lines from the plain text email. Here's the usage:

$Parser = new MimeMailParser();
$Parser->setText($rawEmailText); 
$plaintext = $Parser->getMessageBody('text'); // or however you get it, many ways
$lineSet = explode("\n", $plaintext);
$niceText = PlaintextEmailBrokenLineCombine($lineSet);

$niceText is what you want: it's a pretty accurate way of getting the text you want with those pesky server-added line breaks gone, and replaced with the original spaces.

2012-10-18 15:24
by John Larson
Ads