Gazley Posted February 23, 2013 Share Posted February 23, 2013 Hey guys, It's been a while! Any rules of thumb as to exactly what are the minimum directories that need to be made open in robots.txt so that search engines can crawl only what they need to see? By this, I mean all published pages and the images they contain. As PW is based on the front-controller pattern, I assume that pages can be reached via access to index.php in the root folder. I'd be interested in your thoughts/experiences. Cheers! Link to comment Share on other sites More sharing options...
neildaemond Posted February 23, 2013 Share Posted February 23, 2013 I use the XMLsitemap generator module to create a sitemap.xml then in the robots.txt is just put: sitemap: http://www.domain.com/sitemap.xml seems to get indexed quite easily by google... Edit: wait, I submit the xml sitemap specifically to google, but the sitemap in the robots seems to help all those random bots crawling my page (probably not the best thing since those ones are prob just copying content to a third party site...) 1 Link to comment Share on other sites More sharing options...
ryan Posted February 25, 2013 Share Posted February 25, 2013 Any rules of thumb as to exactly what are the minimum directories that need to be made open in robots.txt so that search engines can crawl only what they need to see? By this, I mean all published pages and the images they contain. If you have a robots.txt, I would use it to specify what directories you want to exclude, not include. In a default ProcessWire installation, you do not need to have a robots.txt at all. It doesn't open up anything to crawlers that isn't public. You don't need to exclude your admin URL because the admin templates already have a robots meta tag telling them to go away. In fact, you usually wouldn't want to have your admin URL in a robots file because that would be revealing something about your site that you may not want people to know. The information in robots.txt IS public and accessible to all. So use a robots.txt only if you have specific things you need to exclude for one reason or another. And consider whether your security might benefit more from a robots <meta> tag in those places instead. As for telling crawlers what to include: just use a good link structure. So long as crawlers can traverse it, you are good. A sitemap.xml might help things along too in some cases, but it's not technically necessary. In most cases, I don't think it matters to the big picture. I don't use a sitemap.xml unless a client specifically asks for it. It's never made any difference one way or the other. Though others may have a different experience. 12 Link to comment Share on other sites More sharing options...
Gazley Posted February 25, 2013 Author Share Posted February 25, 2013 Thanks a lot for the replies! They really helped Cheers. Link to comment Share on other sites More sharing options...
gebeer Posted December 21, 2013 Share Posted December 21, 2013 Hello, this may be a dumb question, but I'm fairly new to PW and just have finished my first project. Now I'm asking myself: Where in the PW directory structure would I put the robots.txt file? My first guess would be /site/templates. Second is root. I'm asking this because I had problems with the favicon.ico. I still don't know where to put it so it gets picked up automatically by browsers. Itried various locations and then ended up putting it in templates dir and have a <link rel=icon" href="path/to/favicon.ico"> in the head section of my template. But I'd rather go without that link tag. Any enlightenment on the correct locations for robots.txt and favicon.ico would be great. Thanks gerhard Link to comment Share on other sites More sharing options...
teppo Posted December 21, 2013 Share Posted December 21, 2013 (edited) First of all, robots.txt needs to be in your site root (not /site/, but before that) to work properly. It's also possible to place it somewhere else, but then you'll have to find a way to make your web server serve it from site root, which can get tricky, depending on your web host and it's capabilities. I'd go with root to keep things simple. Favicon can be placed somewhere else than root, but root seems to work best as some browsers look from there automatically. I'm not sure how many browsers do this on their own, but the link tag is definitely recommended. Probably the only way to make sure that your favicon is found is a) placing it in the root directory and b) adding link tag pointing to it. Edit: Wikipedia provides some details about how different favicon settings are interpreted. This seems to support what I said above, i.e. placing favicon in site root and adding link tag, though that link tag is at least partially optional (I'd still leave it to make sure, though.) Edited December 21, 2013 by teppo 3 Link to comment Share on other sites More sharing options...
gebeer Posted December 21, 2013 Share Posted December 21, 2013 @teppo thanks a ton. Strange is that I tried the root for favicon and it wouldn't get picked up by neither FF nor Chrome. I had purged my cache to ensure that it wasn't a cache issue. Not sure, maybe I should've closed and reopened the browser,too. I will give it another try. There's an interesting read "rel="shortcut icon" considered harmful" about link rel="shortcut icon" vs. rel="icon". It discusses html5 validation and browser compatibility issues. That is why I would like to go without the link tag in the head. Cheers gerhard Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now