CA BD NY
  • Categories

  • Recent Posts

  • RSS MySQL Hacker

  • RSS CentOS Hacker

  • RSS Editor's Lists

    • An error has occurred; the feed is probably down. Try again later.
  • Meta

  • Writing a Custom Apache Log File Using a PHP Script in Real Time

    Published December 14th, 2008 by kabir

    Problem Statement

    When you are developing Web applications, Apache logs — access log and/or error log — can be really useful tool. However, by default Apache logs a lot of stuff that gets in the way of debugging if you are focused on solving a specific problem with your Web app. In this article, we will show you how you can write a very simple PHP script to customize what is logged or not.

    Creating a PHP Script for Processing Apache Log in Real Time

    Instead of creating a PHP based Apache log parser, which would be not in real-time, we will create a simple PHP script as shown below to process Apache log entries in real-time as they are created by Apache log module. Look at the following PHP script in Listing 1.

    Listing 1: simple_apache_logger.php

    #!/bin/env php
    <?
    $logDir  = '/var/data/logs';
    $logFile = $logDir . '/access.log';
    $fp      = fopen($logFile,"a+");
    $stdin   = fopen("php://stdin", "r");
    
    // Use unbuffered output
    ob_implicit_flush (true);
    
    while ($line = fgets($stdin))
    {
       fwrite($fp, $line);
    }
    
    fclose($fd);
    fclose($stdin);
    ?>
    

    To run this script, modify your CustomLog entry to be:

    CustomLog "|/path/to/simple_apache_logger.php" common
    

    Make sure that the script is executable by running chmod 750 simple_apache_logger.php.
    What this script does is as follows:

    1. Opens a log file called $logFile in append mode in $logDir using $fp file pointer
    2. Open the STDIN (standard input) as a file called $stdin
    3. Tells PHP to flush output every time a file I/O is done
    4. In a while loop, reads a line of data from $stdin into a variable called $line
    5. The $line is then appended to the log file

    When run, this script is loaded once and keeps on running as long as Apache runs. So there is no load cost per log entry. It runs and appends the same log data given by Apache to a file. This means nothing interesting is being done in this version of the script as it is simply writing a log file, which would be identical to the original Apache log file. So to makes this interesting, lets update Listing 1 as shown in Listing 2.

    Listing 2: modified while() loop for simple_apache_logger.php

    while ($line = fgets($stdin))
    {
       // Ignore all log requests for common image, cascading stylesheets, JavaScripts, and flash video
       if (preg_match("/(\w+)\.(gif|png|jpg|css|js|swf)/", $line))
       {
           continue;
    }
    

    If you replace the original while() loop that simply wrote the Apache log entry in a file to the above-mentioned while() loop that ignores the common image, cascading style sheets, and JavaScript requests from being logged, you end up with a clean log of requests for pages instead of recording all the external components (images, JavaScripts, CSS) that make up an HTML page. The reduction in log entries makes it easier to deal with debugging GET parameters or other SEO related matter much easier.

    Creating a PHP Script for Rotating Apache Log

    Unfortunately, you cannot pipe multiple programs with CustomLog to do something like:

    CustomLog "|/usr/local/sbin/cronolog /logs/%Y/%m/%d/access.log|/path/to/simple_apache_logger.php" common
    

    So when using a PHP logging tool, you cannot use CronoLog. In such a case, you might need to invent your own log rotation schema. For example, Listing 3 shows an updated version of simple_apache_logger.php that does exactly that.
    Listing 1: simple_apache_logger.php

    #!/bin/env php
    <?
    
    $logDir  = '/var/data/logs/ekblogs';
    $logFile = $logDir . '/access.'. date("d-m-Y") . '.log';
    $fp      = fopen($logFile,"a+");
    $stdin   = fopen("php://stdin", "r");
    
    // Use unbuffered output
    ob_implicit_flush (true);
    
    $lastLogDate = date('Ymd');
    
    while ($line = fgets($stdin))
    {
       // Ignore images, javascripts, css, ico and flash file requests
       if (preg_match("/(\w+)\.(gif|png|jpg|css|js|swf|ico)/", $line))
       {
           continue;
       }
    
       // Following section is for rotating log when day change is detected
       // This will *only rotate* if requests are coming in daily (which is expected)
    
       $today = date('Ymd');
    
       // If today is different than last log date, time to write a new log file
       if ($today > $lastLogDate )
       {
           $lastLogDate = $today;
           fclose($fp);
           $logFile = $logDir . '/access.'. date("d-m-Y") . '.log';
           $fp      = fopen($logFile, "a+");
       }
    
       fwrite($fp, $line);
    }
    
    fclose($fd);
    fclose($stdin);
    ?>
    

    Every time Apache injects a log entry into the STDIN of this script, it checks if current date is same as the last date it had. If the current date has changed, it then creates a new log file and sets the last date to current date. This allows it to rotate the logs by day.

    However, this works only when you have reasonable expectations that your site will get hit every day. If your site does not get hit every day, you will have log entries going into previous log file as the script only gets data to process when there is a new request. But this is not too bad as we only recommend this kind of PHP script based logging in development environments.

    Using Awk to do Quick and Dirty Analysis of Apache Logs

    Published December 13th, 2008 by kabir

    Problem Statement

    Often, we need to quickly analyze the Apache log files for certain sites without running extensive log analysis program that take a long time to run or runs on a schedule. For quick and dirty probing of Apache logs, you are better of with simple command-line tools. Here we will show you how to perform a few quick and dirty log analysis using standard Linux commands and a scripting language called Awk.

    Finding unique IP addresses for a given day

    Say you want to find out which unique IP addresses visited your site for a given day, you can run the following command from your shell prompt:

    $ grep '[date string]' /path/to/access.log | awk '{print $1}' | sort | uniq

    For example, to find out the list of unique IP addresses that have visited the ApacheHacker.com blog yesterday (12/Dec/2008), we can run:

    grep '12/Dec/2008' /logs/apachehacker/access.log | awk '{print $1}' | sort | uniq

    Finding which IP address visited how many times for a given day

    To find out which IP address visited your site how many times a day, run:

    $ grep '[date string]' /path/to/access.log |  \
       awk '{cnt[$1]++;} END{for (ip in cnt){printf("%-15s visited: %04d time(s).\n", ip, cnt[ip])}}'

    For example, to find out which IP address visited the ApacheHacker.com blog on Dec 12, 2008, we can run:

    $ grep '12/Dec/2008' /logs/apachehacker/access.log | \
      awk '{cnt[$1]++;} END{for (ip in cnt){printf("%-15s visited: %04d time(s).\n", ip, cnt[ip])}}'

    Here is a sample output:

    74.6.8.116      visited: 0001 time(s).
    74.6.18.246     visited: 0001 time(s).
    93.126.3.33     visited: 0016 time(s).
    80.48.192.249   visited: 0044 time(s).
    69.62.207.163   visited: 0058 time(s).
    89.45.49.247    visited: 0022 time(s).
    82.159.52.211   visited: 0015 time(s).
    71.226.202.64   visited: 0017 time(s).
    203.83.248.74   visited: 0002 time(s).
    206.196.125.113 visited: 0001 time(s).
    76.188.138.185  visited: 0016 time(s).
    203.112.77.18   visited: 0006 time(s).
    142.179.135.57  visited: 0016 time(s).
    91.121.201.145  visited: 0016 time(s).
    74.6.18.215     visited: 0002 time(s).
    66.249.73.21    visited: 0001 time(s).
    66.150.96.121   visited: 0027 time(s).
    75.147.236.233  visited: 0002 time(s).
    75.119.230.207  visited: 0020 time(s).
    203.129.155.4   visited: 0022 time(s).
    142.166.3.122   visited: 0002 time(s).
    64.1.215.163    visited: 0003 time(s).

    Disabling Weak SSL v2 Support in Apache Server

    Published December 4th, 2008 by kabir

    Problem Statement

    By default, Apache 2.x with SSL enabled uses SSL v2, which was introduced by  Netscape Communications Corporation with the launch of Netscape Navigator 1.0 in 1994 and it contains several well-known weaknesses. For example, SSLv2 doesn’t provide any protection against man-in-the-middle attacks during the handshake, and uses the same cryptographic keys for message authentication and for encryption. If you use any third-party scanner service such as McAfee, ScanAlert, etc. you will get a high-level vulnerability flag for SSL v2. Here we will show you how to upgrade Apache’s SSL support.

    Where is SSL Support Going for Popular Web browsers

    By default, Internet Explorer 7 (IE7) disables SSLv2 support and enables the stronger TLSv1 instead.
    IE7 will only negotiate HTTPS connections using SSLv3 or TLSv1. Mozilla Firefox is expected to drop support for SSLv2 in its future versions.  Since nearly all Web browsers now support SSLv3, disabling support for the weaker SSL method should have minimal impact. The following browsers support SSLv3:

    • Internet Explorer 5.5 or higher (PC)
    • Internet Explorer 5.0 or higher (Mac)
    • Netscape 2.0 (Domestic) or higher (PC/Mac)
    • Firefox 0.8 or higher (PC/Mac/Linux)
    • Mozilla 1.7 or higher (PC/Mac/Linux)
    • Camino 0.8 or higher (Mac)
    • Safari 1.0 or higher (Mac)
    • Opera 1.7 or higher (PC/Mac)
    • Omniweb 3.0 or higher (Mac)
    • Konqueror 2.0 or higher (Linux)

    According to https://www.pcisecuritystandards.org/pdfs/pcissc_assessors_nl_2008-11.pdf, an Assessor’s update report, “…it is imperative that an ASV identify the use of SSL 2.0 to transmit cardholder data as a failure.”

    Updating Apache Configuration for SSL

    For Apache Server, you need to modify the SSLCipherSuite directive in the httpd.conf or ssl.conf file. Here is the configuration you need to add or edit in these files:

    SSLProtocol -ALL +SSLv3 +TLSv1
    SSLCipherSuite ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:-LOW:-SSLv2:-EXP
    

    Once added/edited, restart Apache and confirm that your SSL-enabled site is still working as expected.

    Running Web Sites Under Multiple User Accounts with mod_itk

    Published November 28th, 2008 by kabir

    Problem Statement

    When lots of people are working on the same Apache Web server running multiple virtual hosts, creating an effective and secure file/dir permission schema is difficult using Linux’s simplistic user/group concepts. Here we will show you how you can run Apache using different Linux user accounts so that each virtual host runs using its owner’s file/dir permissions. This effectively makes each of the virtual hosts more secure as files accessible (r+w) for one virtual host is not accessible from another.

    Step 1: Installing mod_itk source

    The mod_itk module is available as a source patch for Apache Web server source distribution. So if you are running Apache from a RPM distribution, you cannot use it. It is for those of us who love to compile Apache from source distribution. We will assume you have compiled and installed Apache from a source distribution and the source code is kept at /usr/local/src/httpd-[version]. Follow the steps below:

    1. Download the mod_itk patch file from http://mpm-itk.sesse.net/.
    2. Change directory to your Apache source distribution and run patch -p1 < /path/to/[downloaded patch file] and run autoconf
    3. Edit your config.nice and add: "--with-mpm=itk" \ before the last line. Here is a sample config.nice:
      #! /bin/sh
      #
      # Created by configure
      
      "./configure" \
      "--prefix=/home/apache" \
      "--enable-so" \
      "--with-ssl=/usr" \
      "--enable-ssl" \
      "--enable-deflate" \
      "--disable-cgi" \
      "--enable-rewrite" \
      "--disable-userdir" \
      "--with-mpm=itk" \
      "$@"
      
    4. Run compile and install Apache as follows: ./configure && make && make install

    Step 2: Configuring your virtual host using a specific user account

    Now create a new linux user and group (or you can use an existing one too) for your virtual host that you want to run using a specific user account. Then follow the steps below:

    1. Change file/directory ownership of your virtual host’s document root for the chosen Linux user
    2. Edit your virtual host configuration file and add the following lines:
      
      <IfModule mpm_itk_module>
          AssignUserId [username] [groupname]
      </IfModule>
      
    3. Now restart Apache and access your Web site via a Web browser
    4. On the server’s command-line, run: ps auxww | grep httpd and notice that one or more processes are run using the chosen username.

    Speeding Up Your Web Site Using mod_deflate Module for Apache

    Published November 28th, 2008 by kabir

    Problem Statement

    If your Web site serve a lot of textual data vs. image or video, you might be able to take advantage of mod_deflate and speed up your Web site without changing any PHP code or MySQL database schema. OK, sure it works for non-PHP/MySQL sites too. But hey, we are biased as we do PHP/MySQL for a living.
    In this article, we will show you how to use mod_deflate to compress your textual contents to speed up delivery time, which results in bandwidth savings in addition to higher user sanctification.

    Step 1: Did you install mod_deflate

    Make sure that you have mod_deflate already installed as part of your Apache Web server. To check you can run /path/to/bin/httpd -l | grep mod_deflate and if you get mod_deflate.c back as a return string, you are in good shape. Otherwise, you will have to install the mod_deflate module for Apache.

    Step 2: Creating a deflate.conf configuration file

    Copy the following Apache directives in a file called deflate.conf and put it in your Apache conf directory.

    <IfModule deflate_module>
    
    <Location />
       SetOutputFilter DEFLATE
       SetEnvIfNoCase Request_URI \.(?:gif|jpe?g|png)$ no-gzip dont-vary
       SetEnvIfNoCase Request_URI \.(?:exe|t?gz|zip|bz2|sit|rar)$ no-gzip dont-vary
       SetEnvIfNoCase Request_URI \.pdf$ no-gzip dont-vary
    
       BrowserMatch ^Mozilla/4 gzip-only-text/html
       BrowserMatch ^Mozilla/4\.0[678] no-gzip
       BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
    </Location>
    
    # Testing purposes (remove from production environment)
    DeflateFilterNote Input inbytes
    DeflateFilterNote Output outbytes
    DeflateFilterNote Ratio percent
    LogFormat '%h "%r" %{outbytes}n/%{inbytes}n (%{percent}n%%)' deflate
    CustomLog logs/deflate.log deflate env=!no-gzip
    </IfModule>
    

    Here we have a Apache configuration specific to mod_deflate that only works if mod_deflate is already loaded by Apache. This configuration snippet tells Apache the following:

    • Set the output filter for all files to be DEFLATE — which is the mod_deflate code that compresses contents passed on to the filter code
    • Sets environment variable no-gzip for common image files such as .gif, .jpg, .jpeg, .png
    • Sets environment variable no-gzip for common binary/compressed files such as .exe, .gz, .zip, .bz2, .sit and .rar
    • Sets environment variable no-gzip for PDF file with .pdf extension
    • Ignores older browsers that do nto support on-the-fly decompression on their end
    • For testing purposes, we set a note called inbytes which stores the input (content) size in byte
    • The outbytes note stores the output (contents) bytes and the percent note stores the compression ratio (output / input) * 100
    • A custom log format allows us to create a custom log that records all compressed (!no-gzip) requests

    Step 3: Testing your setup

    Now using your favorite Web browser, request compressible pages of your site and monitor the deflate.log in the logs directory of your Apache installation. The deflate.log should show how good the compression ratio. The better the ratio, the faster you are able to send out compressed contents to the Web clients.

    Why is compressed HTML pages better?

    In the Web’s request and response model, the bottlenecks are in the following order:

    1. Network latency/bandwidth/response
    2. Disk I/O
    3. Other System Calls

    By compressing the HTML contents, which adds to CPU processing time on the server side but no additional disk I/O and also reduces network response time as the size of the content is less than the original. It adds decompression processing on client side. So the additional cost of compression/decompression is only CPU processing on server and client side. Since modern CPU is hardly the bottleneck, the benefit is: network response time reduction.

    See Also
    Apache 2.0 mod_deflate documentation

    Making Volunerability Scanners Happy with Your Apache Server

    Published November 28th, 2008 by kabir

    Problem Statement

    If you run e-commerce sites that accept credit cards, you are bound to run into Payment Card Industry Data Security Standards (PCI DSS) related issues if you run Apache with default configuration settings. Most e-commerce sites that use third-party PCI compliance scanners for vulnerability scanning will report a number of problems, here we will discuss them and show you how to eliminate them.

    Trim down Apache signature

    Most commercial scanners will report that they can detect your Apache software version and other information from your default Apache configuration. For example, if you telnet to a Web server’s HTTP port (80) and enter HEAD / HTTP/1.0 and two newline characters following it, you will see the Web server signature information. Here is a sample of this reqest:

    $ telnet www.example.com 80
    Trying 100.101.102.103...
    Connected to www.example.com (100.101.102.103).
    Escape character is '^]'.
    HEAD / HTTP/1.0
    
    HTTP/1.1 200 OK
    Date: Fri, 28 Nov 2008 20:02:48 GMT
    Server: Apache/2.2.3 (CentOS)
    X-Powered-By: PHP/5.1.6
    Connection: close
    Content-Type: text/html
    

    As you can see this Web serer is running Apache 2.2.3 on CentOS platform and even has PHP/5.1.6. PCI compliance scanners do not like to see such info as the more info you hand out to the potential hackers, the bigger your chance of getting hacked against a known vulnerability is. So you can tell Apache to show minimal signature using the following directives in your Apache configuration (httpd.conf):

      ServerTokens Prod
      ServerSignature Off
    

    Disable TRACK/TRACE support

    By default, Apache has TRACE and TRACK request methods enabled. The TRACE method allows a client to get data sent to the server returned. For example:

    $ telnet localhost 80
    Trying 127.0.0.1...
    Connected to k2.evoknow.com (127.0.0.1).
    Escape character is '^]'.
    TRACE / HTTP/1.0
    X: 100
    Y: 101
    
    HTTP/1.1 200 OK
    Date: Fri, 28 Nov 2008 22:09:28 GMT
    Server: Apache/2.2.3 (CentOS)
    Connection: close
    Content-Type: message/http
    
    TRACE / HTTP/1.0
    X: 100
    Y: 101
    

    Here you can see that we have sent a TRACE request with two custom headers (X, Y) and the server returned the data back in a message/http type response.

    The PCI compliance scanners also complain about the TRACK/TRACE support that is available in Apache, which you can turn off using either a mod_rewrite rule as shown below:

    <IfModule rewrite_module>
      # Block TRACE/TRACK XSS vector
      RewriteEngine On
      RewriteCond %{REQUEST_METHOD} ^TRAC(E|K)
      RewriteRule .* - [F]
    </IfModule>
    

    Or you can use TraceEnable off directive to disable TRACE in your httpd.conf file. Once disabled, you need to verify that it is truly off, so you can test your site using a manual telnet session as shown below:

    [root@cassini conf]# telnet localhost 80
    Trying 127.0.0.1...
    Connected to k2.evoknow.com (127.0.0.1).
    Escape character is '^]'.
    TRACE / HTTP/1.0
    X: 100
    Y: 101
    
    HTTP/1.1 405 Method Not Allowed
    Date: Fri, 28 Nov 2008 22:19:26 GMT
    Server: Apache
    Allow:
    Content-Length: 223
    Connection: close
    Content-Type: text/html; charset=iso-8859-1
    
    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <html><head>
    <title>405 Method Not Allowed</title>
    </head><body>
    <h1>Method Not Allowed</h1>
    <p>The requested method TRACE is not allowed for the URL /.</p>
    </body></html>
    Connection closed by foreign host.
    

    Restrict Downloading of Files by Extension

    Published November 28th, 2008 by kabir

    Problem Statement

    Often Web developers forget to delete the .old, .bak, .sql files when putting files in production servers. This could have serious security implications as a file called database.conf.php.old can easily be viewed on the Web browser using http://server/path/to/database.conf.php.old as it is not processed by Apache as a PHP script because of the .old extension. Such a scenario could spell security disaster for sites. Here we will show you how to block these extensions so that they are not downloadable/viewable via Web browser.

    Disallowing file browsing/downloading for a given extension

    In your virtual host configuration file or .htaccess file add:

    <FilesMatch "\.(sql|bak|old)$">
        Order allow,deny
        Deny from all
    </FilesMatch>
    

    This tells Apache to not allow downloading of files with .sql, .bak, .old extensions.

    Disallowing files that leading dot (period) character in the name

    In Linux and other Unix environment, files with a leading period are often so-called hidden files. These files often contain history, commands and settings that are not to be shared with public. It is best that you setup your Apache Web server configuration as follows to disable browsing or downloading of these files:

    <FilesMatch "^\.">
        Order allow,deny
        Deny from all
    </FilesMatch>
    

    Avoiding .htaccess is a Good Thing for Apache Performance

    Published November 28th, 2008 by kabir

    Problem Statement

    When you are running a busy Web site, you need to count on every ounce of juice to gain performance almost bit by bit. If your site is blessed with a lot of traffic and you are struggling to keep it fast, we hope you are not using .htaccess or any other file pointed by AccessFileName directive in your Apache configuration (httpd.conf). Let us explain why.

    Why .htaccess is bad for performance

    When you run a site with .htaccess enabled, the site is likely to be configured with an Apache configuration like below:

    
    AccessFileName .htaccess
    <Directory "/path/to/your/docroot">l;
       AllowOverride All
    </Directory>
    

    This tells Apache that you can have a .htaccess file anywhere in your Web site and it can have a lot of different directives to customize how Apache handles your site. This sounds great but the catch is that for every access to any page on your Web site, Apache now must also check if there is a .htaccess file that exists.

    Since disk I/O is the slowest component of your server response time, adding an additional disk I/O to read the .htaccess file makes every access slower. For example, say you have placed an .htaccess file in the document root (/var/data/web/site/htdocs) and you are accessing an image file in http://server/dir1/dir2/dir3/myphoto.png. Here is what Apache checks:

    Does /var/data/web/site/htdocs/.htaccess exists?
    Does /var/data/web/site/htdocs/dir1/.htaccess exists?
    Does /var/data/web/site/htdocs/dir1/dir2/.htaccess exists?
    Does /var/data/web/site/htdocs/dir1/dir2/dir3/.htaccess exists?
    

    As you can see, this can be really hurting a busy site’s performance as many unnecessary disk I/O requests are made just to check the existence of the .htaccess file.

    How to avoid .htaccess performance penalty

    Here are your options:

    • Avoid .htaccess and place your configuration in your virtual host configuration file that gets loaded once by Apache server at startup
    • If you must use .htaccess, consider placing it inside the directory where you need it

    Reducing Duplicate Contents between HTTP and HTTPS

    Published November 28th, 2008 by kabir

    Problem Statement

    Google penalizes Web sites that have duplicate contents. A site that runs on both HTTP and HTTPS mode has a good chance of getting penalized for duplicate contents if the site can be browsed in both mode. The googlebot simply has a possiblity of indexing both HTTP and HTTPS pages and thus creating a potential for duplicate contents even though you really have one set of pages — just happened to be available in both HTTP and HTTPS mode. In this article, we will show you how to reduce this duplicate content risk by using a simple mod_rewrite rule.

    Step 1: Creating a mod_rewrite rule

    Most of the sites with e-commerce has HTTPS pages for shopping cart checkout pages where credit card information is collected or user is asked to login. So if you could enable HTTPS for such pages and keep the rest of the site in standard HTTP mode, you will have no duplicate content issue at least from the HTTP/HTTPS point of view.

    Say you want to redirect all HTTPS request for every page in your Web site to HTTP version except for the following:

    • http://server/something/checkout_[string]
    • http://server/something/login
    • http://server/admin/

    Here we will show you how such a setup can be done using mod_rewrite rule. First add the following mod_rewrite rule in your Web server’s virtual host configuration file or the .htaccess file:

    RewriteCond %{SERVER_PORT} =443
    RewriteCond %{REQUEST_URI} !^/admin(.*)$
    RewriteCond %{QUERY_STRING} !(checkout_(.+)|login)$
    RewriteRule (.*) http://server/$1 [R=301,L]
    

    Now lets interpret how Apache Web server sees this mod_rewrite rule:

    If SERVER_PORT = 443 (i.e. HTTPS mode) that is the page being requested is called using https://server/page AND (the requesting URL does not start with /admin OR the URL does not have a query string such as checkout_[anything] or login) THEN redirect the request using a 301 redirect to the HTTP version of the same URL.

    Step 2: Testing your rule in HTTP and HTTPS mode

    To test the above setup, you can access your Web site using:

    • https://server will redirect to http://server
    • https://server/page.html and it will redirect to http://server/page.html
    • https://server/index.php?main_page=login will remain in HTTPS as it should
    • https://server/index.php?main_page=checkout_shipping will remain in HTTPS mode as well

    Switching robots.txt between HTTP and HTTPS mode

    Published November 28th, 2008 by kabir

    Problem Statement

    Google penalizes Web sites that have duplicate contents. So when your Web site has both HTTP and HTTPS (SSL) mode and they both point to the same contents, you have a good chance of being downgraded for duplicate contents, says an SEO expert that one of our customers love to swear by. So in addition to making sure that access to https://server/pages are automatically redirected with a HTTP 301 (Moved Permanently), we thought making the robots.txt contents different for HTTP and HTTPS would also help. Here is how we switched robots.txt between HTTP and HTTPS access.

    Step 1: Create a mod_rewrite rule for /robots.txt

    Even though we dislike using mod_rewrite due its performance issues, we use it on customer projects from time to time where performance is not too critical. We added the following rule in the virtual host configuration file for the server, which can be also added in .htaccess file:

    RewriteEngine on
    
    # Rule: When robots.txt is requested in HTTPS (port 443) mode, send robots_ssl.txt instead
    RewriteCond %{SERVER_PORT} =443
    RewriteRule ^robots\.txt$ robots_ssl.txt [L]
    

    Step 2: Create a HTTPS version of robots.txt

    The HTTPS version of robots.txt is a separate file called robots_ssl.txt, which should have the following contents:

    User-agent: Googlebot
    Disallow: /
    

    This file tells Google’s Web site crawler to not index anything in the site in HTTPS mode. Of course, if you wish to advice all crawlers the same thing, than change it to be:

    User-agent: *
    Disallow: /
    

    Step 3: Manually test the changes

    Now to test the setup access the site as follows:

    • Point your web browser to http://server/robots.txt and see if you get the original robots.txt contents shown on the browser
    • Point your web browser to https://server/robots.txt and see if you get the original robots_ssl.txt contents shown on the browser

    If the above tests show appropriate contents, you are done. If not, you might not have setup the mod_rewrite rule correctly; check the rule again. Also, make sure you *have* mod_rewrite enabled in your Web server configuration. Of course, you should also check if it is installed as well.

    If you are like us and build mod_rewrite as part of the core Apache httpd process, you can test if it is installed by running:

    $ /path/to/bin/httpd -l | grep mod_rewrite
    

    Example:

    $ /home/apache/bin/httpd -l | grep mod_rewrite
    

    This shows mod_rewrite.c module as part of the httpd binary, which is exactly what needs to there for mod_rewrite to work.