robots.txt file is has incorrect urls
My mongrel servers all crashed, I believe due to memory pressure from indexing robots. While looking into this I found that in Redmine 0.8 the public/robots.txt file has the following lines:
The gantt and calendar lines are invalid, however, and will not block robots. If you navigate to the calendar or gantt for a particular project the actual URLs are:
Since robots.txt does not support wild cards, perhaps the URLs should be modified to allow a stock robots.txt to work. Something more like repository URLs:
#1 Updated by Jean-Philippe Lang over 8 years ago
Indeed, this file wasn't fixed when the URLs changed (eg. /projects/gantt/foo => /projects/foo/issues/gantt). And as you said, the robots.txt can not be fixed since wildcards are not supported.
In the future, all project related URLs should start with /projects/foo/. And I don't really want to change this just for this purpose.What could be done:
- using wildcards: they are supported by Googlebot and Yahoo Slurp, the top 2 spiders here (maybe not the best solution)
- or adding an action that responds to /robots.txt and returns a valid file (no wildcards) for all visible projects.
What do you think?
#2 Updated by Brad Schick over 8 years ago
It only takes one spider that doesn't accept wildcards to crash a site, so I don't think that is a good option.
Generating robots.txt sounds like a good idea (as long as it is cached). And at the risk of adding extra work, there could be settings to add/remove Redmine areas from it. That way people could simply check options like "repository", "repository diffs", "gantt charts", "wiki", etc. without having to think about their URLs.
#4 Updated by Brad Schick over 8 years ago
Exploring my site performance again I suspect that a fair amount of wasted CPU time is going into producing PDFs, CSV, and Atom representations on all issues, issue lists, wiki, forum pages, etc.
Along with the discussed changes for robots.txt, it would be very helpful to have options to add rel="nofollow" attributes to potentially expensive links like these. Here is a discussion of that attribute: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=96569
#5 Updated by Jean-Philippe Lang over 8 years ago
An action that responds to /robots.txt is added in r2319.
I'm not sure that rel="nofollow" attribute is appropriate here. Wikipedia says:
How the attribute is being interpreted differs between the search engines. While some take it literally and do not follow the link to the page being linked to, others still "follow" the link to find new web pages for indexing. In the latter case rel="nofollow" actually tells a search engine "Don't score this link" rather than "Don't follow this link." This differs from the meaning of nofollow as used within a robots meta tag, which does tell a search engine: "Do not follow any of the hyperlinks in the body of this document."
Yahoo Slurp for example will actually follow the link.
#7 Updated by Jean-Philippe Lang over 8 years ago
- Status changed from New to Closed
- Target version set to 0.9.0
- Affected version (unused) set to 0.8.0
- Resolution set to Fixed
rel=nofollow attribute is added in r2334. Spiders like Googlebot should no longer follow these links.
I close this ticket since the original request is fixed.
An other solution would be to look for common bots (using request's user-agent) to deny access to these links.
This could be done globally using a before_filter, maybe as a plugin.