Remove protocol and result with full base_url from html input string - 【StackMirror】|php|regex|while-loop|preg-replace

I was looking to strip out base_url from input supplied via html input=text and pulled from _POST. The input itself is most likely expected to contain the full uri, but also/and quite possibly a port assignment followed by a few more path delimiters.

example: https://lab1.sfo1.transparentpixel.com:554/rtmp/_definst_

I needed up to 3 instantiations of the result and those values end up getting placed into an array.

So to test things in a stand alone script, I ended up with the following code:

OLD FOR HISTORICAL REVIEW:

<?php
$var1 = "https://lab1.sfo1.transparentpixel.com:1935/rtsp/_definst_";
$var2 = "http://lab1.sfo1.transparentpixel.com:1935/rtmp/_definst_";
$var3 = "lab1.sfo1.transparentpixel.com";

$count = 1;
while ( $count <= 3 )
{
$test[] = 'var'.$count.' = ' . preg_replace(array("#^.*/([^\:]+)\:.*#"), '$1', ${var.$count});
$count++;
}

var_dump($test);
?>

CORRECTED AFTER EDIT:

<?php

    $url1 = "https://lab1.sfo1.transparentpixel.com:1935/rtsp/_definst_";
    $url2 = "http://lab1.sfo1.transparentpixel.com:1935/rtmp/_definst_";
    $url3 = "lab1.sfo1.transparentpixel.com";

$count = 1;
while ( $count <= 3 )
{
$test[] = 'url'.$count.' = ' . preg_replace(array("#^.*/([^\:]+)\:.*#"), '$1', ${url.$count});
$count++;
}

print_r($test);
?>

My result:

$ php tpixel_url_replace.php 
Array
(
    [0] => url1 = lab1.sfo1.transparentpixel.com
    [1] => url2 = lab1.sfo1.transparentpixel.com
    [2] => url3 = lab1.sfo1.transparentpixel.com
)

While this works as I intended, I'm certainly missing some iterations. Anyone care to elucidate things I may be overlooking? Yes, I know I could have used str_replace but the cost of running preg_ over str_ is minimal in the overall scheme of things.

I'm simply looking for insight as I'm 100% sure I'm not a master of anything regarding reg-ex nor preg_replace.

Input?

2012-04-03 20:02
by msmithng

are those three urls you've given the only possible types of url? For example could you also have http://someurl.com or someurl.com:1935/rtmp/_definst_ - Robbie 2012-04-03 20:52

You say that this code "works as I intented" but when i run it, it doesn't work because you've put ${var.$count}. Which is wrong (i think). Also, i'm not sure what your question is? Are you trying to loop though a list of urls while adding just the host part into a new array - Robbie 2012-04-03 20:57

I've corrected the code above Robbie. Thanks! Using a variable variable in this case is correct, but I had copied code in which I had declared the variable as "var" which php apparently doesn't like. :| TIL don't use $var - msmithng 2012-04-05 16:14

... and the list of urls is basically input from an end-user but most likely copy pasta from our dashboard, so I can anticipate that the format will be similar to that which I'm using in the example. But yes, I only want the base_url sans protocol - msmithng 2012-04-05 16:34

I hope I understand your question correctly. Are you having trouble with the regex or the code for looping over the urls? Or both?

I'm going to assume both...

Instead of matching the whole thing and grouping the bit you want to extract, I'd suggest you match just what you want to extract. With that in mind, the regex could look like this:

[^/]+\.[^/:]{2,3}

In english this says:

Match anything except a forward slash until there is a dot, then match between 2 and 3 more of anything except a forward slash or a colon

This seems simple, but i think it gets you what you need.

Here is a bit of php code that creates an array of urls in various formats and then loops though each one and extracts just the bit i think you want. I've switched to using preg_match instead of preg_replace because i think it makes more sense in this case:

<?php
    $urls = array(
                "https://lab1.sfo1.transparentpixel.co.jp:1935/rtsp/_definst_",
                "http://lab1.sfo1.transparentpixel.com:1935/rtmp/_definst_",
                "http://lab1.sfo1.transparentpixel.com/rtmp/_definst_",
                "lab1.sfo1.transparentpixel.com",
                "someurl.com:1935/rtmp/_definst_",
                "someurl.com/_definst_",
                "http://someurl.co.uk");

    foreach($urls as $url)
    {
        preg_match('%[^/]+\.[^/:]{2,3}%m', $url, $matches);         
        echo $matches[0]; // instead of this you could do $test[] = $matches[0];  
    }
?>

You'll notice that I'm looping over the array using a foreach loop which means we are not limited to a fixed number of iterations as in your example.

The output of this is:

lab1.sfo1.transparentpixel.co.jp
lab1.sfo1.transparentpixel.com
lab1.sfo1.transparentpixel.com
lab1.sfo1.transparentpixel.com
someurl.com
someurl.com
someurl.co.uk

2012-04-03 23:03
by Robbie

Robbie, the loop itself is working as I expect (see my edit above). The regex was my concern. Thanks for your reply! With respect to the iterations, I'm only setting up for up to 3 inputs for this particular parameter. So having additional isn't necessary, but I see your point in that regard. Matching the piece I want to extract may work better. I'll give that a shot. Thanks for the extra eyes on the regex! I presume this will work in preg_replace just the same? Is there a difference programmatically - msmithng 2012-04-05 16:28

@msmithng yeah, there is a difference in how you use it because it only matches the bit you want, so you would use it to extract that bit from the input into a new variable. However, i'm still not sure i get your question as your code seems to already do what you seem to want? What problem do you have that needs fixing, or are you just asking for comments on the approach you have taken - Robbie 2012-04-05 18:54

@msmithng sorry, i should also have mentioned that the difference is that if you use my regex with your code it will return exactly the opposite of what you want (eg. https://:1935/rtsp/_definst_ for the first url). To be honest, the regex change i suggested was only because i think using preg_match in the code is more readable (in my opinion), if yours works and the method makes sense to you, then go with it. One question about your original post... what did you mean by "I'm certainly missing some iterations" - Robbie 2012-04-05 19:24

Thanks for the additional insight. Very helpful. WRT my notion of missing iterations; I felt, as I went o.O at the regex... that I was going to miss some odd copy pasta from our dashboard (users always do what they do best). I guess I should have phrased my question as 'will I miss something given the expected input'. Also, you're correct in that I was looking for comments to my approach - msmithng 2012-04-06 15:53