Need some HTML regex help (and yes, I know it's not supposed to be done)

Go To StackoverFlow.com

2

So as a preface, I need this regex to move past a bug I'm waiting on server people to fix.

Basically I get JSON back with unescaped " characters in HTML. I need a regex that looks inbetween <> characters and replaces " with a \".

<div style="padding: 0%; width: 100%;"><span style="font-family:verdana;"><span style="font-size: 72px;">Demo!</span></span></div>

UPDATED INFO: The text though is inside some json sent back as a string that I eventually need to parse into regular JSON, and the parsing fails.

The string looks something like this:

"{
"overlay": "overlay1",
"type": "text", 
"text": "<div style="padding: 0%; width: 100%;"><span style="font-family:verdana;"><span style="font-size: 72px;">Demo!</span></span></div>"
}"

This is the regex so far that I have found (I know some regex stuff, just not a lot with look ahead or behind

/(?<=\<)(.*?)(?=\>)/g

But using that only gets me to retrieving this:

<div style="padding: 0%; width: 100%;"><span style="font-family:verdana;"><span style="font-size: 72px;">Demo!</span></span></div>

(basically just everything inside the <> characters. When I only really want to target the " inside the <>.)

Can anyone recommend a quick temporary fix? Thanks!

2012-04-04 19:18
by Jonathan Romanowski
What language or regex dialect are you using? Is PCRE an option, or does this need to happen in ERE - ghoti 2012-04-04 19:50


3

The following should work (as a temporary fix):

/"(?=[^<]*>)/g

This will match all double quotes where there are no < characters before the next >.

2012-04-04 19:27
by Andrew Clark
This works! At least for the testing I am doing. Thank you - Jonathan Romanowski 2012-04-04 20:06
@JonathanRomanowski No problem, if my answer worked you can accept it by clicking the outline of the check mark next to the answer - Andrew Clark 2012-04-04 20:15


0

Try replacing the (.*?) in the center part with a ([^<>]*?)
You have to be careful with the dot operator.

2012-04-04 19:25
by user1308985


0

Try this:

string = string.replace(/\"/g, "\\\"");

//--- EDIT---

var someString = "{
"overlay": "overlay1",
"type": "text", 
"text": "<div style="padding: 0%; width: 100%;"><span style="font-family:verdana;"><span style="font-size: 72px;">Demo!</span></span></div>"
}";

someString = someString.replace(/"(?=[^<]*>)/g, "\\\"");  //Props @F.J for this RegEx
Obj = $.parseJSON(someString);
console.log(Obj.text);
2012-04-04 19:30
by Relic
Agh, sorry. I guess my question is a little different. The html is actually inside some JSON. I'll have to update the question - Jonathan Romanowski 2012-04-04 19:33
The idea is the same... consider the edit - Relic 2012-04-04 19:36
I agree the idea is the same, but if I do a global replace for all the " it'll escape much more than is needed. I updated the question, sorry for the confusion earlier - Jonathan Romanowski 2012-04-04 19:38
Than you focus it on the 'text' of the object like I did... not only that but your JSON object isn't in proper notation. - Relic 2012-04-04 19:45
Ohh you're waiting till after the response, the JSON is bad... I GET IT. Gimme a sec, I whip ya something up.... unless.... dude, can you console.log(response) so I know what I'm working with... there's no way you get a string that [with the quotes as they are] and being able to do anything about it... it's simply a bad string, which will make your JS fail every time - Relic 2012-04-04 19:46
I'm serious, that is exactly the type of string I get back, and obviously we know it isn't supposed to be like that which is why this is a interm solution for testing. I'll try the parseJSO - Jonathan Romanowski 2012-04-04 19:53
parseJSON won't do it... you have to have a correctly formatted string, and that's the issue we're dealing with.... I say, in the 'interim' make your own properly formatted string and hard code it for testing. Then wait for your backend guys to get you the JSON service response you need in the correct format when you get there, no sense wasting time on this as there isn't a real good solution except messier RegEx - Relic 2012-04-06 18:31


0

If the JSON is otherwise well-formed and you don't have any attribute-type syntax outside of tags, the following should work for a one-line fix:

var str = '<div style="padding: 0%; width: 100%;"><span style="font-family:verdana;"><span style="font-size: 72px;">Demo!</span></span></div>'
str.replace(/([\w-]+)=\"(.*?)\"/g, '$1=\\\"$2\\\"')
>> "<div style=\"padding: 0%; width: 100%;\"><span style=\"font-family:verdana;\"><span style=\"font-size: 72px;\">Demo!</span></span></div>"

That adds slashes before all HTML attributes wherever they appear (not necessarily inside of tags). If you need to target it better, do your first search to isolate tags then loop through each tag doing this regex replacement.

By the way, why do you say "I know it's not supposed to be done"? This is a perfect use for regular expressions!

2012-04-04 19:41
by Richard Connamacher
Actually RegEx is NOT meant to parse HTML... so he's very right when he says it's not supposed to be done. It should be turned into a DOM element and parsed with a document reader actually - Relic 2012-04-04 19:52
It is a good use for regular expressions, but in general regex is not supposed to be used to modify or parse HTML, which is kind of what it is doing here (its a string in this context but it is supposed to be HTML - Jonathan Romanowski 2012-04-04 19:57
He's (You're, Jonathan) using regular expressions to fix a JSON syntax error. Whether it'll be eventually rendered as HTML or not, at this moment it's nothing more than malformed JSON text. So it's perfectly appropriate to use regular expressions to work around it (at least until the source bug is fixed) - Richard Connamacher 2012-04-04 21:01


-1

If this is perl, you could do something like:

$string =~ s/(?!\\)"/\\"/g;
2012-04-04 19:30
by Glen Solsberry
perl is not a client side language, and the regex parser is a little different - Relic 2012-04-04 19:31
Nowhere does the question say anything about a client side language. Just that he's getting JSON - Glen Solsberry 2012-04-04 19:32
JSON = Javascript Object Notation... now I could be mistaken, but I believe Javascript is client side... (other than node, which this clearly isn't) And he talks about waiting on server side guys for a fix, so this isn't a server side language the OP is dealing with - Relic 2012-04-04 19:35
perl supports json. As do many other server-based languages. He could also be talking to an API on another server. Don't immediately assume that just because he's talking about HTML and JSON that he has to be using Javascript, or even a browser - Glen Solsberry 2012-04-04 19:37
I used the entire context of the question to draw my conclusion, but thanks for the lecture on those other languages, I've never even heard of an API... you mean you can use AJAX with things other than XML? huh... - Relic 2012-04-04 19:38
Ads