How To Stop Search Engines From Indexing Specific Posts And Pages In WordPress
For example, if you are developing a new website, it is usually best if you block search engines from indexing your website so that your incomplete website is not listed on search engines. This can be done easily through the reading settings page at http://www.yourwebsite.com/wp-admin/options-reading.php.
All you have to do is scroll down the search engine visibility section and enable the option entitled “Discourage search engines from indexing this site”.
Unfortunately, WordPress does not let you stop pages being indexed on a page by page basis. The only option you have is allow search engines to index everything or to index nothing at all.
Stopping search engines from indexing specific pages is necessary from time to time. For example, on my personal blog I block search engines from indexing my newsletter email confirmation page. I also stop them from indexing the page where my free eBook can be downloaded. Most blogs do not take the step to block search engines from indexing their download pages. This means that people can download eBooks and other digital files from private pages by simply doing a quick search online.
There are a number of different ways in which you can stop search engines from indexing posts and pages on your website. In this article, I would like to show a few solutions that are available to you.
* All methods detailed in this article have been tested on a test WordPress installation and have been verified to work correctly
A Brief Overview of the Robots Meta Tag
Google advises websites owner to block URL’s using the robots meta tag. The robots meta tag follows this format:< meta name = "value" content = "value" > |
< meta name = "robots" content = "noindex" > |
If you want to block content from a specific search engine, you need to need to replace the value of robots with the name of the search engine spider. Some common search engine spiders are:
- googlebot – Google
- googlebot-news – Google News
- googlebot-image – Google Images
- bingbot – Bing
- teoma – Ask
All you have to do to block a specific crawler is replace robots with the name of the spider.
< meta name = "googlebot-news" content = "noindex" > |
< meta name = "googlebot-news,bingbot" content = "noindex" > |
For reference, here is a list of the most common directives that are available to you:
- all – No restrictions on indexing or linking
- index – Show the page in search results and show a cached link in search results
- noindex – Do not show the page in search results and do not show a cached link in search results
- follow – Follow links on the page
- nofollow – Do not follow links on the page
- none – The same as using “noindex, nofollow”
- noarchive – Do not show a cached link in search results
- nocache – Do not show a cached link in search results
- nosnippet – Do not show a snippet for the page in search results
- noodp – Do not use the meta data from the Open Directory Project for titles or snippets for this page
- noydir – Do not use the meta data from the Yahoo! Directory for titles or snippets for this page
- notranslate – Do not offer translation for the page in search results
- noimageindex – Do not index images from this page
- unavailable_after: [RFC-850 date/time] – Do not show the page in search results after a date and time specified in the RFC 850 format
A few of these directives are redundant too. <meta name=”robots” content=”all”>, for example, will give the same result as <meta name=”robots” content=”index, follow”>. And there is no point using either of those meta tags since search engines will index content and follow links by default anyway.
If you are trying to block search engines from indexing a page, then the nofollow directive cannot be used on its own. The nofollow directive advises search engines not to follow the links on a page. You can use this to stop search engines from crawling a page. The result is the same as applying the nofollow link attribute to every link on the page.
Consider a blog that only links to a download area from a thank you page. You could add a nofollow meta tag to the header of the thank you page so that search engine spiders will never visit the download page. This will stop search engine spiders from crawling the page and subsequently indexing it. All you would have to do would be ensure that the thank you page was the only area that the download page was linked from.
Unquestionably, someone else will link to that download page, whether you like it or not. This means that the nofollow directive is ineffective on its own. I have checked the incoming traffic to my own blog and found direct links to my download page from black hat forums. It is near impossible to stop others linking to a page that is known to people other than yourself.
That is why you also need to use the noindex directive. The directive ensures that a page is not shown in search results. It also ensures that a cached link for the page is not displayed; therefore you do not need to use the noarchive directive if you are using noindex.
So to stop all search engines from indexing a page and stop them from following links, we should use add this to the header of our page:
< meta name = "robots" content = "noindex,nofollow" > |
If you wanted to remove a page from the index, but still wanted search engines to crawl the pages that were linked on the page, you can use:
< meta name = "robots" content = "noindex" > |
By default, Googlebot will index a page and follow links to it. So there’s no need to tag pages with content values of INDEX or FOLLOW.When using the robots meta tag on your website, please note that:
- Meta tags are not case sensitive. Therefore, <meta name=”ROBOTS” content=”NOINDEX”> is interpreted in the exact same way as <meta name=”robots” content=”noindex”> and <meta name=”RoBoTS” content=”nOinDeX”>.
- If the robots.txt and meta tag instructions conflict, Google follows the most restrictive rule (I’m 99% sure other search engines follow this same rule, however I could not find any clarification about this issue from other search engines).
To ensure that only specific posts and pages are blocked, we need to use an if statement that only applies the noindex directive to specified pages. Let’s take a closer look at how we can do that
Add the Robots Meta Tag to Your Theme Header: Method 1
I will show you three methods of adding a meta tag to your website by modifying your header.php template. The end result is the same for all three, however you may prefer using one method over anotherIn order to block a specific post or page, you need to know its post ID. The easiest way to find the ID of a page is to edit it. When you edit any type of page on WordPress, you will see a URL such as https://www.yourwebsite.com/wp-admin/post.php?post=15&action=edit in your browser address bar. The number denoted in the URL is the post ID. It refers to the row in the wp_posts database table.
Once you know the ID of the post or page you want to block, you can block search engines from indexing it by adding the code below to the head section of your theme’s header.php template. That is, between <head> and </head>. You can place code anywhere within the head section; however I recommend placing it underneath, or above, your other meta tags, as it makes it easier to reference later.
<?php if ( $post ->ID == X) { echo '<meta name="robots" content="noindex,nofollow">' ; } ?> |
<?php if ( $post ->ID == 15) { echo '<meta name="robots" content="noindex,nofollow">' ; } ?> |
You can block additional pages on your website by using the OR operator.
<?php if ( $post ->ID == X || $post ->ID == Y) { echo '<meta name="robots" content="noindex,nofollow">' ; } ?> |
<?php if ( $post ->ID == 15 || $post ->ID == 137 || $post ->ID == 4008) { echo '<meta name="robots" content="noindex,nofollow">' ; } ?> |
You also need to check the source code of a page that you are not trying to block from search engines. This will verify that you have not blocked all pages on your website in error.
Add the Robots Meta Tag to Your Theme Header: Method 2
You can also block pages from search engines by utilizing WordPress conditional tags. In order to use this technique correctly, you need to use the appropriate conditional tag. For example, you would use is_single for a blog post and is_page for a WordPress page.Once again, we need to add the code to the head section of our theme’s header.php template. In the example below, X denotes the ID of a blog post that has to be blocked from searched engines.
<?php if (is_single(X)): ?> |
<meta name= "robots" content= "noindex,nofollow" > |
<?php endif ; ?> |
Consider the first blog post that is added to WordPress. It has a post ID of 1, post title of “Hello World”, and a post slug of “hello-world”. We can therefore define the post in our code by using:
<?php if (is_single(1)): ?> |
<?php if (is_single( '1' )): ?> |
<?php if (is_single( 'Hello World' )): ?> |
<?php if (is_single( 'hello-world' )): ?> |
<?php if ( is_single( 'big-announcement' ) || is_single( 'new-update-coming-soon' ) ) ) : ?> |
<?php if (is_page( array ( 'about-page' , 'Testimonials' , '658' ) )): ?> |
<?php if ( is_single( 'big-announcement' ) || is_page( 'About' ) ) ) : ?> |
<?php if (is_single( array ( '45' , '68' , '159' , '543' ) ) || is_page( array ( 'about-page' , 'Contact Us' , '1287' ) ) ): ?> |
<?php if (is_single( array ( '45' , '68' , '159' , '543' ) ) || is_page( array ( 'about-page' , 'Contact Us' , '1287' ) ) ): ?> |
<meta name= "robots" content= "noindex,nofollow" > |
<?php endif ; ?> |
If you referenced the post or page title and slug in your code, then the code would cease to work if someone modified the title or slug. Every time you modified the post or page title and slug, you would need to update the meta tag code in the header.php template. This is why I recommend using post ID. In the long term, it is a more practical solution if you are hiding many posts and pages.
Add the Robots Meta Tag to Your Theme Header: Method 3
Another technique you can use is to utilize the WordPress custom field feature. Hardeep Asrani explained this technique earlier this year in a tutorial entitled “How To Disable Search Engine Indexing On A Specific WordPress Post“.The first thing you need to do is add the following code to the head section of your theme’s header.php template.
<?php |
$noindex = get_post_meta( $post ->ID, 'noindex-post' , true); |
if ( $noindex ) { |
echo '<meta name="robots" content="noindex,nofollow" />' ; |
} |
?> |
Simply repeat the above step for any post type you want to block from search engines.
I believe this is one of the most user friendly techniques that a developer can configure for a client, as it so simple to block additional posts and pages. However, it does not give you a quick way of seeing which posts and pages are blocked from search engines, and which are not. Therefore, if you use this technique and are blocking a lot of pages, it may be prudent to take a note of every page that you have blocked.
Block Search Engines Using a WordPress Plugin
If you need to block search engines from more than a few posts and pages, you might find using a WordPress plugin a more practical solution. The plugin that I have used to do this in the past is PC Hide Pages.To remove a page from a search engine using the plugin, all you have to do is select the page from a list of your pages. When you do this, the plugin applies the appropriate meta tag to the page in question. For me, it is one of the best solutions for removing pages from search engines as you can see at a glance what pages you have hidden, and do it directly through the WordPress admin area (which is not something you can natively do with the robots.txt method).
The only downside to the plugin is that it only supports WordPress pages. It does not support blog posts or other custom post types. This is unlikely to be a problem for many of you as content that needs to be hidden from search engines is generally published as a WordPress page e.g. a thank you page, a download page etc.
If your website uses a popular search engine WordPress plugin, such as WordPress SEO or All in One SEO Pack, then you already have the functionality to remove content from search engines.
Yoast was one of the first developers to create a plugin that helped website owners block search engines. He later integrated his Robots Meta plugin into WordPress SEO.
The Titles & Metas settings area in WordPress SEO has a section entitled Sitewide meta settings. This section lets you apply the noindex directive to subpages of archives and prevent titles and snippets from Open Directory Project and Yahoo! Directory being used.
WordPress SEO gives you a huge amount of control over how search engines treat a page on your website. The first option controls whether a page is indexed on a search engine. Six additional robots meta tag directives can be applied, including follow, nofollow, none, and noarchive. You can also exclude a page from your website sitemap and set its sitemap priority. A 301 URL redirect can also be configured if you need to redirect traffic from that page to another location.
The general settings page of All in One SEO Pack has a section called Noindex Settings. You can apply the nofollow meta tag to many different areas of your website in this section. For example, categories, author archives, and tag archives. You can also stop titles and snippets from Open Directory Project and Yahoo! Directory being used. As you can see, it offers a few more global options than WordPress SEO.
Just like WordPress SEO, All in One SEO Pack adds a settings area to the post editor page. In addition to applying noindex and nofollow, you can exclude the page from your sitemap and disable Google Analytics. On a post level, All in One SEO Pack offers less control than WordPress SEO.
Both WordPress SEO and All in One SEO Pack work in the same way as the custom field method I explained earlier i.e. by selecting “noindex,nofollow” via the post editor. If you are already using one of these plugins, you may want to use them to select which posts and pages should be hidden from search engines.
Stop Search Engines Crawling a Post or Page Using Robots.txt
The Robot Exclusion Standard determines what search engine spiders should index and what they should not index. To do this, you need to create a new text file and save the file as robots.txt.The concept behind the Robots.txt protocol is the same as the robots meta tag that I have discussed in length in this article. There are only a few basic rules.
- User-agent – The search engine spider that the rule should be applied to
- Disallow – The URL or directory that you want to block
With the Disallow rule, the URL or directory you block is defined using a relative path from your domain. Therefore, / would block search engines from indexing your whole website and /wp-admin/ would block search engines from your WordPress admin area.
Here are a few examples to help you understand how easy it is to use a robots.txt file to block search engines.
The code below will block search engines from indexing your whole website. Only add this to your robots.txt file if you do not want any pages on your website indexed.
User-agent: * |
Disallow: / |
User-agent: * |
Disallow: /2014/06/big-announcement/ |
User-agent: * |
Disallow: /email-subscription-confirmed/ |
Another rule that is available to you is Allow. This rules allows you to specify user agents that are permitted. The example below shows you how this works in practice. The code will block all search engines, but it will allow Google Images to index the content inside your images folder.
User-agent: * |
Disallow: / |
User-agent: Googlebot-Image |
Allow: /images/ |
Once you have created and saved your robots.txt file, you should upload it to the root of your domain i.e. www.yourwebsite.com/robots.txt.
Robots.txt is a relatively straight forward standard to understand. If you are looking for more help on creating a robots.txt file, you may want to check out the help pages from Bing and Google. However, I believe the best way to learn how to build a robots.txt page is to look at the robots.txt page of other websites. This can be done easily as robots.txt files can be seen by anyone. All you have to do is visit www.nameofwebsite.com/robots.txt for any website you want to check. Note that some websites do not use Robots.txt, so you may get a 404 error..
Here are some examples of Robots.txt files to illustrate how you can use it to control what search engines do:
- Amazon’s Robots.txt File
- Facebook’s Robots.txt File
- Google’s Robots.txt File
- YouTube’s Robots.txt File
How to Remove Content From Public View
Stopping search engines from indexing a page is not always the best solution. If you want to hide a page from the world, it may be more practical to restrict access to it. I spoke about this in great detail last month in my review of WordPress Membership Plugins.A membership plugin such as Paid Memberships Pro, for example, will allow you to restrict access to content to those who are eligible. This is particularly useful for protecting downloads and premium content.
For a complete list of membership plugins, please check out my recent article “Using WordPress Membership Plugins To Create Your Own Website Membership“.
How to Remove Page From Search Engine Results
Search engine crawlers occasionally do not see the noindex directive denoted on a page. Therefore, your pages may be incorrectly indexed, even though you advised them not to. You may also have pages that have been correctly indexed, but now you want them removed from search engines.“Note that because we have to crawl your page in order to see the noindex meta tag, there’s a small chance that Googlebot won’t see and respect the noindex meta tag. If your page is still appearing in results, it’s probably because we haven’t crawled your site since you added the tag.” – GoogleThe most efficient way of removing a page from a search engine’s index is by using a search engine URL removal tool. In Google Webmaster Tools, you will see an option to remove URLS in the Google Index section.
Simply click on the “Create a new removal request” button and enter your URL. Be aware that you need to enter the page slug that comes after your domain. For example, if you want to remove a page located at www.yourwebsite.com/news/big-announcement, you would enter news/big-announcement.
You can choose to remove a page from search results and cache or remove a page from cache. There is also an option to remove a complete directory. This can be used to completely remove a website from their search results.
Google will then display a message that states that the page or directory has been added for removal. Take this opportunity to double check that the URL you submitted is correct.
Removing a URL from Bing is even easier. Within their Bing Webmaster Tools service is the Bing Content Removal Tool.
To remove a page from Bing’s index, all you have to do is enter the page URL. Then select whether you want to remove the page from the index or remove an outdated cached version of the page.
Once you have submitted the URL, you will see a history of the pages you have submitted for removal.
Unfortunately, neither Google or Bing offer an option to upload a CSV list of files that you want removed from the index. Therefore, you need to submit requests one by one.
Final Thoughts
Unfortunately, all search engines don’t play nice. It is up to a search engine as to whether they honor your request to not index a page. The most popular search engines do follow the rules set out by website owners; whereas poor search engines and nasty software from hackers and spammers, tend to do whatever they want.I hope you have found this tutorial on stopping search engines indexing your content useful. If you know of any other good techniques to stop search engines from indexing content on a WordPress powered website, please leave a comment below