You might notice when checking your logs that you are getting crawled a lot from Screaming Frog. How to block Screaming Frog from crawling my site? This means they could end up crawling sections of the site you don’t want them to and this could waste potential crawl budget – which is not a good thing. Google has confirmed that if you don’t have a robots file, that they will assume that its ok to crawl your entire site. What happens if I don’t have a robots file? Scrapper and other potential black-hat-type bots don’t even look at the robots file and usually just ignore it. After all, it’s in their best interests to do so. Most friendly bots will honour the robots.txt file. Do Robots have to follow the instructions? ![]() You might already know some of this stuff, but some might come as a surprise. There are some questions that I hear often about robots.txt files so I thought I’d take a little time to go through them. To do this add the following code to your robots file: The most common reason for this is the site in question is your staging/development site. There are some legitimate cases where you might not want bots from crawling your site. This might seem like a stupid idea on the face of it, but bear with me. ![]() Why would you want to block all Search engines from crawling your site? After all, if they can’t crawl your site they can’t index it. So, with most tools you can enter a URL and choose a bot and it will tell you if it’s allowed to be crawled. The great thing is that the robots.txt file allows you to be very specific and block sections, pages or even whole folders. Some of the actions you could take with such a checker include: Check if a URL is blocked by robots.txt But because it’s Google, it checks for Google Bots only. Google offers a very good tester within Google Search Console. Using a robots.txt checker can be a really handy way to make sure you’re doing everything right and there are several great robot testers out there. Some people try and store the file in a sub folder so /files/robots.txt – but you’d be wasting yours and Googlebot’s time with this – as they won’t find and use it there. (although i’m beginning to sound a little bossy, it’s for your own good) Your robots.txt file must also be located in the root file – for example, for this site it’s /robots.txt This is the ONLY file format that bots will look for. You’d be surprised at how easy it is to misspell such a simple filename when you’re not concentrating, so check and check again before you do anything further. ![]() It can’t be named robot.txt or Robots.txt – only robots.txt will do. I’m going to be crystal clear about this. But firstly, you need to ensure that it’s accurately named. Where and how you upload your robots.txt file is crucial. However, this one helps all search engine and other bots find your files. There are, of course, several other ways to let Google know about your sitemaps. I’d recommend including them ALL here – it’s good practice to do so. LIsting your sitemap URLs here is a good way to ensure that Google (and other search engines) can find your files. (You can call your folder something fancier if you like – keeping it simple works for me!) Sitemaps The above instruction would stop any bot from crawline any of the pages in that folder. You can use wildcard instructions in your robots file to stop them crawling certain sections. Then it’s time to choose what parts of your site you DON’T want them to crawl, excluding these at page, section or folder level – it’s entirely up to you. Ideally, you’ll need to analyse your log files to see what bots are crawling your site, blocking any bots that you don’t want to crawl your sites using the robots.txt file – i’ll explain how to do this further down this blog. It needs to include two simple things: Clear instructions to bots
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |