Robots.txt: The Deceptively Important File All Websites Need
By Jill Caren
The robots.txt file helps major search engines understand where they’re allowed to go on your website.
But, while the major search engines do support the robots.txt file, they may not all adhere to the rules the same way.
Below, let’s break down what a robots.txt file is, and how you can use it.
What is a robots.txt file?
Every day, there are visits to your website from bots — also known as robots or spiders. Search engines like Google, Yahoo, and Bing send these bots to your site so your content can be crawled and indexed and appear in search results.
Bots are a good thing, but there are some cases where you don’t want the bot running around your website crawling and indexing everything. That’s where the robots.txt file comes in.
By adding certain directives to a robots.txt file, you’re directing the bots to crawl only the pages you want crawled.
However, it’s important to understand that not every bot will adhere to the rules you write in your robots.txt file. Google, for instance, won’t listen to any directives that you place in the file about crawling frequency.
Do you need a robots.txt file?
No, a robots.txt file is not required for a website.
If a bot comes to your website and it doesn’t have one, it will just crawl your website and index pages as it normally would.
A robot.txt file is only needed if you want to have more control over what is being crawled.
Some benefits to having one include:
- Help manage server overloads
- Prevent crawl waste by bots that are visiting pages you do not want them to
- Keep certain folders or subdomains private
Can a robots.txt file prevent indexing of content?
No, you cannot stop content from being indexed and shown in search results with a robots.txt file.
Not all robots will follow the instructions the same way, so some may index the content you set to not be crawled or indexed.
In addition, If the content you are trying to prevent from showing in the search results has external links to it, that will also cause the search engines to index it.
The only way to ensure your content is not indexed is to add a noindex meta tag to the page. This line of code looks like this and will go in the html of your page.
It’s important to note that if you want the search engines to not index a page, you will need to allow the page to be crawled in robots.txt.
Where is the robots.txt file located?
The robots.txt file will always sit at the root domain of a website. As an example, our own file can be found at https://www.hubspot.com/robots.txt.
In most websites you should be able to access the actual file so you can edit it in an FTP or by accessing the File …read more
Source:: HubSpot Blog