robots.txt: disallow crawling dynamically generated PDF
While the auto-generated robots.txt contains URLS for /issues (the HTML issue list), it doesn't contain the same URLs for the PDF version.
At osmocom.org (where we use redmine), we're currently seeing lots of robot requests for /projects/*/issues.pdf?.... as well as /issues.pdf?....
- Status changed from Closed to Reopened
- Resolution deleted (
The robots.txt generated by Redmine 4.1 does not disallow crawlers to access "/issues/<id>.pdf" and "/projects/<project_identifier>/wiki/<page_name>.pdf".
I think the following line should be added to the robots.txt.
- Subject changed from robots.txt misses issues.pdf to robots.txt: disallow dynamically generated PDF
- Target version set to Candidate for next minor release
Since dynamically generated PDFs contain no more information than HTML pages and are useless for web surfers, the PDFs should not be indexed by search engines. In addition, In addition, generating a large number of PDFs in a short period of time is too much burden for a server.
I suggest disallowing web crawlers to fetch dynamically generated PDFs such as /projects/*/wiki/*.pdf and /issues/*.pdf by applying the following patch. The patch still allows crawlers to fetch static PDF files attached to issues or wiki pages (/attachments/*.pdf).
diff --git a/app/views/welcome/robots.text.erb b/app/views/welcome/robots.text.erb index 6f66278ad..9cf7f39a6 100644 --- a/app/views/welcome/robots.text.erb +++ b/app/views/welcome/robots.text.erb @@ -10,3 +10,5 @@ Disallow: <%= url_for(issues_gantt_path) %> Disallow: <%= url_for(issues_calendar_path) %> Disallow: <%= url_for(activity_path) %> Disallow: <%= url_for(search_path) %> +Disallow: <%= url_for(issues_path(:trailing_slash => true)) %>*.pdf$ +Disallow: <%= url_for(projects_path(:trailing_slash => true)) %>*.pdf$