How I can remove new lines only inside the HTML tags - 【StackMirror】|php|preg-replace

How I can remove new lines only inside the HTML tags with preg_replace ?

Example:

<table>

<tr>

<td></td>
</tr>
</table>

Text here. Text here

Text here.

So after the functions process the above code the return should be:

<table>    <tr>    <td></td>    </tr>    </table>

Text here. Text here

Text here.

2012-04-05 22:18
by BetterMan21

Why do you want to do this - Felix Kling 2012-04-05 22:21

remove new lines only inside the HTML tags: That's really innovative, why do you want to do this - anubhava 2012-04-05 22:21

You can't. Sorry. http://stackoverflow.com/a/1732454/122083 - Basti 2012-04-05 22:30

This is not actually parsing entire page, just small HTML code - BetterMan21 2012-04-05 22:34

only the following tags are available: table, ol,li,ul,a,blockquoute, tr,td,tbod - BetterMan21 2012-04-05 22:35

How I can remove new lines only inside the HTML tags with preg_replace ?

Technically yes, but actually, HTML does not care for newlines that much, every multiple whitespace characters are actually read as a single one. As your example shows, you replace \n with space or \t, so it's actually the same which brings me to the point you can just do the following:

$html = preg_replace('~(>[^>]*)(*BSR_ANYCRLF)\R([^<]*<)~', '$1 $3', $html);

See as well: php regex to match outside of html tags and How to replace different newline styles in PHP the smartest way?.

A more safe way is to use a HTML parser like DOMDocument and load your fragment as body. Then replace all newlines within textnodes that are childnodes of body childnodes.

2012-04-05 22:31
by hakre

I get the following error:

Warning: preg_replace() [function.preg-replace]: Compilation failed: (*VERB) not recognized at offset 13 i - BetterMan21 2012-04-05 22:51

@user1316394: Update your PCRE library or remove that verb (*BSR_ANYCRLF), you can add the u modifier instead in case the string is utf-8 encoded - hakre 2012-04-05 23:06

how I can do this with : http://simplehtmldom.sourceforge.ne - BetterMan21 2012-04-05 23:37

@user1316394: No Idea, I don't use simplehtmldom for different reasons and I won't suggest you to use it either. Take a look at <code>DOMDocument</code> and/or <code>SimpleXML</code> which are part of PHP - hakre 2012-04-06 06:31

There might be smarter ways to do this, but however, this will do your job.

$str = "test\n\n test2 <table>\n\n\n test 3</table>\n\n\n test4 test5";

while ($str2 = preg_replace('/(>[^<]*)\n([^<]*<)/', '\\1\\2', $str)) {
    if ($str2 == $str) break;
    $str = $str2;
}

echo ($str);

It looks for newlines in between the > char and the < char, and removes them.

2012-04-05 22:35
by Alfred Godoy

Hi, what about if there are actually chars like: > and < in page - BetterMan21 2012-04-05 22:48

I have tested the code and it also replaces text between outer html elements:

efewf

Test test test test test. Test test test test test.

Test test test test test.

efewf

BetterMan21

2012-04-05 23:03

cod between the first and the second

BetterMan21

2012-04-05 23:04

So you only want to remove new lines if there are no child elements inside the element - Alfred Godoy 2012-04-05 23:07

@user1316394: That's because that text is between the first and second <ul> tag - which you have asked for - hakre 2012-04-05 23:08

@user1316394: If you want to have control over how deep down in the xml you are, you should use something like DOMDocument. Just using regexp, that would be a (close to) impossible task - Alfred Godoy 2012-04-05 23:16

If there are actually chars like: > and < in page that is not part of an element, your HTML is very bad and your web browser might go nuts. < should be encoded as < and > as > - Alfred Godoy 2012-04-05 23:23