BLOCK DYNAMIC URLS FROM GOOGLEBOT USING YOUR ROBOTS.TXT FILE
I was trying to figure out how to block some dynamic urls from googlebot. The search bots for Yahoo! Slurp and MSNBot use the same or very similar syntax to block dynamic URLs. As an example, I have this one line in my htaccess file that allows me to use static pages instead of dynamic pages, but I found that sometimes the googlebot still crawls my dynamic pages. This can lead to duplicate content, which is not condoned by any of the major search engines.
Currently rank
I’m trying to clean up my personals site as it currently ranks well on yahoo but not google. I believe MSN Live has algorithms similar to Google’s, but this is by no means scientifically proven. I’m just saying this from my own personal experience with SEO Twitch Viewer Bot and my clients’ websites. I think I found some answers about ranking well on Google, MSN and possibly Yahoo. I’m in the middle of testing right now. I’ve managed to get a client’s website to rank well on Google for relevant keywords. Anyway, here’s how to block Google’s dynamic pages using your robots.txt file. The following is an excerpt of my htaccess file:
In case you’re wondering, this rule allows me to create static pages like personals-dating-4525.html from the dynamic link index.php ?page =view_profile&id=4525. However, this has led to problems, since the Googlebot can now do this and has “burdened” me with duplicate content. Duplicate content is frowned upon and creates more work for Googlebot as it now has to crawl additional pages and can be considered spam by the algorithm. The moral is duplicate content should be avoided at all costs.
What follows is an excerpt of my robots.txt file:
User agent: Googlebot
Disallow: /index.php ?page =view_profile&id=*
Note the “*” (asterisk) character at the end of the second line. This just tells Googlebot to ignore any number of characters instead of the asterisk. For example, Googlebot ignores index.php ?page =view_profile&id=4525 or any other number or group or character. In other words, these dynamic pages will not be indexed. You can verify that your rules are working correctly in your robots.txt file by logging in to your Google Account for Webmaster Control Panel. If you don’t have a Google account, all you have to do is create one in Gmail, AdWords or AdSense and have access to the Google Webmaster Tools and Control Panel. If you want to get higher rankings, you should have one. Then all you have to do is login to your Gmail, AdWords or AdSense accounts to have an account. They make it pretty easy to set up an account and it’s free. Click the Diagnostics tab, then click the robots.txt analyzer tool link under the Tools section in the left column.
BTW, your robots.txt file should be in your webroot folder. Once a day, Googlebot checks your website’s robots.txt file and updates it in the “robots.txt analyzer” section of your Google webmaster control panel.
To test your robots.txt file and validate that your rules are working correctly with Googlebot, simply enter the URL that you want to test in the “Test URLs against this robots.txt file” field. I added the following line to this field:
Then I clicked the “Verify” button at the bottom of the page. Googlebot will block this URL under the given conditions. I believe this is a better way to block Googlebot than using the “url stripping” tool you may be using. The URL Removal tool is located in the left column of your Google Webmaster Control Panel. I’ve read in some cases on the google groups that people have had problems with the “url stripping” tool.
0