Where to FOLLOW or to NOFOLLOW

Published Wed. Oct. 31, 2012

Direct web crawlers which files and directories they can index and follow and which advertising based or untrusted links you don't want to transfer link juice to

I started a job working for a company selling hazmat material handling equipment. I came on as a web developer and designer to help out with a rebranding and reworking of the site(s).

The site has great page rankings for certain key phrases in the industry, but the original hard html files that were generated in the mid-1990's are still being using. No templating….just copy the original file and modify it. It worked in the past, but this is why we starting developing dynamic sites years ago. Updating any common element globally is rather cumbersome.

So I'm comparing all the html files and noticed some have the
<meta name="robots" content="index, follow" />
in the head tag and other pages do not. (along with the GA tracking code…some pages have the newest version, some use the oder urchin, and some are missing)

This type of inconsistency is inevitable while using the old hard file standard.

But, foolishly, the boss and I thought that the pages without the meta robots index and follow tag would not be by web crawlers.

So I did some real world tests…I simply searched Google for one of our industries keywords and noticed the page that was ranking (fairly high) DID NOT contain the meta tag for robots to follow and index.

I then quickly viewed the source of my own site (the one your reading on) and noticed I too have the meta robots tag (although in a nice php include so all pages follow suite when updated).

Upon further investigation, it is not needed to add this meta tag to your site if you wish web crawlers to index and follow links outbound. ONLY, if you wish certain pages to NOFOLLOW or NOINDEX. From what I understand these meta tag variations placed within the head tag of each page is more explicit than specifying disallow a certain directory or file in your robots.txt file. Some bots ignore the file and some blackhats will use it as a roadmap to find the stuff your trying to hide. You could disallow a contact page with email address (although you should be using a form) but email harvesting bots ignore the robots file anyway. You could also disallow your diary.php page since it's private to you and you don't want your wife reading about it through a google search. (Although it may be more safe to place the meta robots tag to NOINDEX in a condition for those pages in your header template file.

You can write conditions around multiple meta robots tags in your common template header file for having certain conditions met to display NOFOLLOW and/or NOINDEX…but remember default is indexing and following.

You can still disallow a configuration directory, but remember, that is a dead give away to hackers there is sensitive stuff in there. Sensitive stuff should be encrypted regardless.

I do still include a robots.txt file and direct bots to my SITEMAP xml file.
I'm still not sure if this is the same as "submitting" your sitemap to Google but it has been working for me for year.

NOFOLLOW on anchor tags

Those are two ways to direct a bot to index or not per page or directory. There is another NOFOLLOW attribute we can place an anchor tags. The main reason we instruct certain a tags to not follow is to prevent spam. Basically, if you post a link to your site on a million message boards you won't get all their collective link juice. Some CMS systems like Wordpress sets their posted comments links to NOFOLLOW by default. If you control your form and approve and trust your commentors you should spread the link juice love back to them.

Many sites are now placing NOFOLLOW on links for advertisement banners. This is supposed to control spam but I'm not sure they don't deserve the buckling since they are paying for it, right? Maybe if they have the money to advertise they already have enough authority?

It seems now it is standard to place the NOFOLLOW attribute on links in comments and advertisements.

Conclusion

It also seems standard to not need to include a robots meta tag if you wish crawlers to INDEX and FOLLOW. If you wish google to NOINDEX (like on an a testing site or a new site to not duplicate crawlers findings with the original) you can disallow the directory or file in the robots.txt file but is is more explicit to have NOINDEX in the robots meta tag on each page.

Keywords:INDEX, NOINDEX, FOLLOW, NOFOLLOW