Reducing Duplicate Contents between HTTP and HTTPS
Published November 28th, 2008Problem Statement
Google penalizes Web sites that have duplicate contents. A site that runs on both HTTP and HTTPS mode has a good chance of getting penalized for duplicate contents if the site can be browsed in both mode. The googlebot simply has a possiblity of indexing both HTTP and HTTPS pages and thus creating a potential for duplicate contents even though you really have one set of pages — just happened to be available in both HTTP and HTTPS mode. In this article, we will show you how to reduce this duplicate content risk by using a simple mod_rewrite rule.
Step 1: Creating a mod_rewrite rule
Most of the sites with e-commerce has HTTPS pages for shopping cart checkout pages where credit card information is collected or user is asked to login. So if you could enable HTTPS for such pages and keep the rest of the site in standard HTTP mode, you will have no duplicate content issue at least from the HTTP/HTTPS point of view.
Say you want to redirect all HTTPS request for every page in your Web site to HTTP version except for the following:
- http://server/something/checkout_[string]
- http://server/something/login
- http://server/admin/
Here we will show you how such a setup can be done using mod_rewrite rule. First add the following mod_rewrite rule in your Web server’s virtual host configuration file or the .htaccess file:
RewriteCond %{SERVER_PORT} =443
RewriteCond %{REQUEST_URI} !^/admin(.*)$
RewriteCond %{QUERY_STRING} !(checkout_(.+)|login)$
RewriteRule (.*) http://server/$1 [R=301,L]
Now lets interpret how Apache Web server sees this mod_rewrite rule:
If SERVER_PORT = 443 (i.e. HTTPS mode) that is the page being requested is called using https://server/page AND (the requesting URL does not start with /admin OR the URL does not have a query string such as checkout_[anything] or login) THEN redirect the request using a 301 redirect to the HTTP version of the same URL.
Step 2: Testing your rule in HTTP and HTTPS mode
To test the above setup, you can access your Web site using:
- https://server will redirect to http://server
- https://server/page.html and it will redirect to http://server/page.html
- https://server/index.php?main_page=login will remain in HTTPS as it should
- https://server/index.php?main_page=checkout_shipping will remain in HTTPS mode as well
mahbub on November 30, 2008
Here is a simple spelling error: possiblity (possibility)
Google penalizes Web sites that have duplicate contents. A site that runs on both HTTP and HTTPS mode has a good chance of getting penalized for duplicate contents if the site can be browsed in both mode. The googlebot simply has a possiblity of indexing both HTTP and HTTPS pages and thus creating a potential for duplicate contents even though you really have one set of pages — just happened to be available in both HTTP and HTTPS mode.
liz on December 4, 2008
I have gone through this.