Insert space after semi-colon, unless it's part of an HTML entity

I'm trying to insert a space after each semi-colon, unless the semi-colon is part of an HTML entity. The examples here are short, but my strings can be quite long, with several semi-colons (or none).

Coca&#8209;Cola =>     Coca&#8209;Cola  (&#8209; is a non-breaking hyphen)
Beverage;Food;Music => Beverage; Food; Music

I found the following regular expression that does the trick for short strings:

<?php
$a[] = 'Coca&#8209;Cola';
$a[] = 'Beverage;Food;Music';
$regexp = '/(?:&#?\w+;|[^;])+/';
foreach ($a as $str) {
    echo ltrim(preg_replace($regexp, ' $0', $str)).'<br>';
}
?>

However, if the string is somewhat large, the preg_replace above actually crashes my Apache server (The connection to the server was reset while the page was loading.) Add the following to the sample code above:

$a[] = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. '.
   'In blandit metus arcu. Fusce eu orci nulla, in interdum risus. '.
   'Maecenas ut velit turpis, eu pretium libero. Integer molestie '.
   'faucibus magna sagittis posuere. Morbi volutpat luctus turpis, '.
   'in pretium augue pellentesque quis. Cras tempor, sem suscipit '.
   'dapibus lacinia, dolor sapien ultrices est, eget laoreet nibh '.
   'ligula at massa. Cum sociis natoque penatibus et magnis dis '.
   'parturient montes, nascetur ridiculus mus. Phasellus nulla '.
   'dolor, placerat non sem. Proin tempor tempus erat, facilisis '.
   'euismod lectus pharetra vel. Etiam faucibus, lectus a '.
   'scelerisque dignissim, odio turpis commodo massa, vitae '.
   'tincidunt ante sapien non neque. Proin eleifend, lacus et '.
   'luctus pellentesque;odio felis.';

The code above (with the large string) crashes Apache but works if I run PHP on the command line.

Elsewhere in my program I use preg_replace on much larger strings without problem, so I'm guessing something it the regular expression overwhelms PHP/Apache.

So, is there a way to 'fix' the regex so it works on Apache with large strings or is there another, safer, way to do this?

I'm using PHP 5.2.17 with Apache 2.0.64 on Windows XP SP3, if it's any help. (Unfortunately, upgrading either PHP or Apache is not an option for now.)

2012-04-04 20:40
by Goozak

I would suggest this match expression:

\b(?<!&)(?<!&#)\w+;

...which matches a series of characters (letters, numbers, and underscore) which is not preceded by an ampersand (or an ampersand followed by a hash symbol) but which is followed by a semicolon.

it breaks down to mean:

\b          # assert that this is a word boundary
(?<!        # look behind and assert that you cannot match
 &          # an ampersand
)           # end lookbehind
(?<!        # look behind and assert that you cannot match
 &#         # an ampersand followed by a hash symbol
)           # end lookbehind
\w+         # match one or more word characters
;           # match a semicolon

replace with the string '$0 '

let me know if this doesn't work for you

Of course, you could also use [a-zA-Z0-9] instead of \w to avoid matching a semicolon, but I don't think that would ever give you any trouble

Also, you might need to escape the hash symbol as well (because that is the regex comment symbol), like so:

\b(?<!&)(?<!&\#)\w+;

EDIT Not sure, but I'm guessing that putting the word boundary at the beginning is going to make it a bit more efficient (and thus less likely to crash your server), so I changed that in the expressions and the break-down...

EDIT 2 ... and a bit more info on why your expression might be making your server crash: Catastrophic Backtracking -- I think this applies (?) hmmm.... good info nonetheless

FINAL EDIT if you are looking to only add a space after a semicolon if there is not already whitespace after it (i.e. add one in the case of pellentesque;odio but not in the case of pellentesque; odio), then add an additional lookahead at the end, which will prevent extra unnecessary spaces being added:

\b(?<!&)(?<!&\#)\w+;(?!\s)

2012-04-04 21:12
by Code Jockey

This works great! Don't really mind the extra whitespace (final edit) since browsers usually don't show them, but a nice touch. Why does the solutions for my RegEx headaches always look so simple?... :- - Goozak 2012-04-05 11:47

@Goozak like lots of things, you gotta know all the quirks and capabilities of a tool before you can use it in an elegant way - someone using a hammer might be able to drive a nail in with one hit (I've seen this done, but more often a tap then hit) or they could end up with an unfinished job and a really injured thumb. They might also be able to accomplish the job just well enough without impressing anybody - it all depends on how much effort and practice you want to put into it and to a certain extent who you have to help you learn : - Code Jockey 2012-04-05 17:17

Trying to catch html entities like - (your # catch is for this, right?) – got issue with this # : http://rubular.com/r/yM0shbE9i2 should not catch last thirds, right - Joan 2015-09-17 16:13

And got issue with this anyway: neither http://rubular.com/r/ENM2mnWkqi (yours) nor http://rubular.com/r/e7C1OI2Ult (adaptation for - html entities) can catch the second ;. Normal? (Trying to add space after and before ;, unless it's HTML entities, both &\w; and &#\d; patterns… - Joan 2015-09-17 16:42

@Joan - I think you're looking for this - Code Jockey 2015-09-18 11:09

@Joan - If your question is not answered, ask a new one. I'm not sure whether you're asking about trying to catch "html entities like -" - this question was specifically about ignoring html entities (or rather the semicolons ending them). I'll try to periodically check this chat room if you'd prefer ther - Code Jockey 2015-09-18 11:24

You could use a negative look-behind:

preg_replace('/(?<=[^\d]);([^\s])/', '; \1', $text)

Not tested since I've got no computer at hand, but this or a slight variation of it should work.

2012-04-04 20:56
by ckruse

\D and \S are shorthands for [^\d] and [^\s] respectively - Joey 2012-04-04 21:03

I always forget about them. Thanks : - ckruse 2012-04-04 21:05

Catch me if I'm wrong here, but isn't that a positive lookbehind? :-D -- it looks like you're trying to match a semicolon followed by something other than whitespace as long as there is a character that is not a digit before that semicolon ...and I don't think it quite works - First, you're not allowing for numbered entities like the ‑ used in the question; Second, it would also change ;; into ; ; or ;-) into ; -), thus breaking up all the cute little smilies people like to use :- - Code Jockey 2012-04-04 21:45

Actually, it works for numbered entities, but not for the named ones ( ), which wasn't mentioned in question. And it leaves number lists (34;45;-23) space-less (not part of original question either) - Goozak 2012-04-05 12:26

With a problem like this a callback might help.

(&(?:[A-Za-z_:][\w:.-]*|\#(?:[0-9]+|x[0-9a-fA-F]+)))?;

Expanded

(          # Capture buffer 1
   &                              # Ampersand '&'
   (?: [A-Za-z_:][\w:.-]*         # normal words
     | \#                         # OR, code '#'
       (?: [0-9]+                       # decimal
         | x[0-9a-fA-F]+                # OR, hex 'x'
       )
   )
)?         # End capture buffer 1, optional
;          # Semicolon ';'

Testcase http://ideone.com/xYrpg

<?php

$line = '
  Coca&#8209;Cola
  Beverage;Food;Music
';

$line = preg_replace_callback(
        '/(&(?:[A-Za-z_:][\w:.-]*|\#(?:[0-9]+|x[0-9a-fA-F]+)))?;/',
        create_function(
            '$matches',
            'if ($matches[1])
               return $matches[0];
             return $matches[0]." ";'
        ),
        $line
    );
echo $line;
?>

2012-04-04 22:46
by sln

Seems a bit overkill, and generate some PHP Notices about Undefined offset — I'm a stickler for notices and warnings... :- - Goozak 2012-04-05 11:52

@Goozak - Its actually underkill for xml, leaves out PE refs, and excludes many many U-chars in Name. Html only should be (&(?:[A-Za-z][\w:.-]*|\#(?:[0-9]+|[xX][0-9a-fA-F]+)))?; Don't know about your PHP warnings, ideone showed no problems. Finally, what makes you think your accepted answer is correct? \w doesn't include all the valid characters an entity can have, let alone positions. Most important, you can't break up the entity expression and put half in what you don't want, and half in what you do! \b(?<!&)(?<!&#)\w+; won't match 'this';. Or your requirements aren't real world - sln 2012-04-05 18:08

And if you want to bind (ignore) the ; to a preceeding character inject a negative lookbehind before it (&(?:[A-Za-z][\w:.-]*|\#(?:[0-9]+|[xX][0-9a-fA-F]+)))?(?<=\S); and possibly after it as well (&(?:[A-Za-z][\w:.-]*|\#(?:[0-9]+|[xX][0-9a-fA-F]+)))?(?<=\S);(?!\s)sln 2012-04-05 18:33