CA BD NY
  • Categories

  • Recent Posts

  • RSS MySQL Hacker

  • RSS CentOS Hacker

  • RSS Editor's Lists

    • An error has occurred; the feed is probably down. Try again later.
  • Meta

  • Switching robots.txt between HTTP and HTTPS mode

    Published November 28th, 2008

    Problem Statement

    Google penalizes Web sites that have duplicate contents. So when your Web site has both HTTP and HTTPS (SSL) mode and they both point to the same contents, you have a good chance of being downgraded for duplicate contents, says an SEO expert that one of our customers love to swear by. So in addition to making sure that access to https://server/pages are automatically redirected with a HTTP 301 (Moved Permanently), we thought making the robots.txt contents different for HTTP and HTTPS would also help. Here is how we switched robots.txt between HTTP and HTTPS access.

    Step 1: Create a mod_rewrite rule for /robots.txt

    Even though we dislike using mod_rewrite due its performance issues, we use it on customer projects from time to time where performance is not too critical. We added the following rule in the virtual host configuration file for the server, which can be also added in .htaccess file:

    RewriteEngine on
    
    # Rule: When robots.txt is requested in HTTPS (port 443) mode, send robots_ssl.txt instead
    RewriteCond %{SERVER_PORT} =443
    RewriteRule ^robots\.txt$ robots_ssl.txt [L]
    

    Step 2: Create a HTTPS version of robots.txt

    The HTTPS version of robots.txt is a separate file called robots_ssl.txt, which should have the following contents:

    User-agent: Googlebot
    Disallow: /
    

    This file tells Google’s Web site crawler to not index anything in the site in HTTPS mode. Of course, if you wish to advice all crawlers the same thing, than change it to be:

    User-agent: *
    Disallow: /
    

    Step 3: Manually test the changes

    Now to test the setup access the site as follows:

    • Point your web browser to http://server/robots.txt and see if you get the original robots.txt contents shown on the browser
    • Point your web browser to https://server/robots.txt and see if you get the original robots_ssl.txt contents shown on the browser

    If the above tests show appropriate contents, you are done. If not, you might not have setup the mod_rewrite rule correctly; check the rule again. Also, make sure you *have* mod_rewrite enabled in your Web server configuration. Of course, you should also check if it is installed as well.

    If you are like us and build mod_rewrite as part of the core Apache httpd process, you can test if it is installed by running:

    $ /path/to/bin/httpd -l | grep mod_rewrite
    

    Example:

    $ /home/apache/bin/httpd -l | grep mod_rewrite
    

    This shows mod_rewrite.c module as part of the httpd binary, which is exactly what needs to there for mod_rewrite to work.

    Get a Trackback link

    1 Comments

    1. nicelikelove on December 29, 2008

      ip 213.160.112.83 user administrator forgot password

    Leave a comment

    Comment Policy: First time comments are moderated. Please be patient.

    You must be logged in to post a comment.