We have clients who build their site on a UserDir URL before their real domain goes live. The UserDir URL is always in the format:
Sometimes, Google crawls these UserDir URLs and the temporary site will show up in results even after the site is live on http://johndoe.com
So, once a client is live on http://johndoe.com, how can I prevent Google from crawling the UserDir address?
(of course, I need Google to crawl the real domain because SEO is important to our clients)
I use the canonical tag for this purpose. If you put the canonical tag on the index.html file like such:
<link rel="canonical" href="http://johndoe.com/" />
Then when Googlebot finds it at http://1.2.3.4/~johndoe it will know that it is a duplicate of http://johndoe.com/ and Google will index the correct one. Googlebot will see the same tag when it crawls the real site and not have a problem with the self-referential canonical.