https://www.redmine.org/https://www.redmine.org/favicon.ico?16793021292010-10-24T14:49:24ZRedmineRedmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=216882010-10-24T14:49:24ZВе Fio
<ul></ul><p>Aww, excuse me for putting this in "search engine", just realized the category doesn't actually fit this but report. >.<</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=216892010-10-24T15:10:44ZВе Fio
<ul></ul><p>From searching the Google index, it also appears that they have not indexed /projects/project/issues, but they did index /projects/project/issues?tracker_id=1, whether Googlebot is following the robots.txt mostly but not completely I do not know, but that page is indexed regardless, where it shouldn't be.</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=216902010-10-24T16:16:26ZВе Fio
<ul></ul><p>You can just ignore comment 2.</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=217042010-10-25T10:39:49ZFelix Schäfer
<ul><li><strong>Category</strong> deleted (<del><i>Search engine</i></del>)</li></ul><p>Do you have any idea/example on how to disable bots to navigate parametrized URLs?</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=218032010-10-25T18:15:35ZВе Fio
<ul></ul><p>Hi,</p>
<p>From what documentation I can get my hands on, this doesn't seem to be documented. I know that putting an entry like:<br />Disallow: /issues</p>
<p>Will work, however I am guessing that not disallowing that might have been intentional.</p>
<p>After searching a bit however, I came across a bit of code that is said to work, but I haven't been able to verify it yet.<br /><pre>
Disallow: *sort=
Disallow: *&sort=
Disallow: *? // This should disallow all URL's that request something, not necessarily a good idea, but it's just an example
Disallow: *sort=*
// if above's won't work, I heard that wildcards aren't supported, so maybe something like..
Disallow: /issues?sort=
</pre></p>
<p>I'm 75% sure the ones with the wildcards will work, and 90% sure the example without the wildcard will work.</p>
<p>I tried to put in as many examples as I could. Like I said, I couldn't and am unable to verify them though. Also, there may be more parameters that should be disallowed, but I missed (or they weren't yet navigated). I'll keep on the lookout for more, and update this report as needed. Hope that helps!</p>
<p>Please note: As you can see when visiting Redmine's robots.txt, it states some URL's to disallow. It appears that Googlebot disregards a lot of these even though it knows they're disallowed. I know this, because using Google Webmaster Tools, it showed me that the bot knows that they're disallowed URL's, even though it visited them.</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=218092010-10-25T21:43:25ZFelix Schäfer
<ul></ul><p>Ве Fio wrote:</p>
<blockquote>
<p>From what documentation I can get my hands on, this doesn't seem to be documented. I know that putting an entry like:<br />Disallow: /issues</p>
<p>Will work, however I am guessing that not disallowing that might have been intentional.</p>
</blockquote>
<p>I guess so too, the one rule about the issue list is to prevent bots indexing stuff twice.</p>
<blockquote>
<p>After searching a bit however, I came across a bit of code that is said to work, but I haven't been able to verify it yet.<br />[...]</p>
<p>I'm 75% sure the ones with the wildcards will work, and 90% sure the example without the wildcard will work.</p>
</blockquote>
<p>So sort of "official" documentation would be nice, or at least confirmation that this works. Care to share your sources?</p>
<blockquote>
<p>I tried to put in as many examples as I could. Like I said, I couldn't and am unable to verify them though. Also, there may be more parameters that should be disallowed, but I missed (or they weren't yet navigated). I'll keep on the lookout for more, and update this report as needed. Hope that helps!</p>
<p>Please note: As you can see when visiting Redmine's robots.txt, it states some URL's to disallow. It appears that Googlebot disregards a lot of these even though it knows they're disallowed. I know this, because using Google Webmaster Tools, it showed me that the bot knows that they're disallowed URL's, even though it visited them.</p>
</blockquote>
<p>That's a problem you should tackle with google, not with us ;-)</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=218132010-10-26T07:56:07ZВе Fio
<ul></ul><p>This isn't "official", but I tried to find as many sources as I could, hopefully these'll help out. From what I read, the wildcards will work (for some bots), but are a bad idea because others won't follow it.<br /><a class="external" href="http://www.webmasterworld.com/forum93/823.htm">http://www.webmasterworld.com/forum93/823.htm</a><br /><a class="external" href="http://www.ihelpyou.com/forums/showthread.php?t=27849">http://www.ihelpyou.com/forums/showthread.php?t=27849</a><br /><a class="external" href="http://www.velocityreviews.com/forums/t608728-robots-txt-and-regular-expressions.html">http://www.velocityreviews.com/forums/t608728-robots-txt-and-regular-expressions.html</a></p>
<p>This is already following the standards, so it's a safe fallback:<br /><pre>
Disallow: /issues?sort=
</pre></p>
<p>Conclusion: Wildcards are too risky, but we of course already know that the above will work normally as it conforms to the rules. It's up to you if you want to do something, or nothing. ;)<br />Official documentation: <a class="external" href="http://www.robotstxt.org/robotstxt.html">http://www.robotstxt.org/robotstxt.html</a></p>
<p>Felix Schäfer wrote:</p>
<blockquote>
<p>That's a problem you should tackle with google, not with us ;-)</p>
</blockquote>
<p>Oh, I was just noting that so that you guys know about it. Like a "warning" :)</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=218142010-10-26T07:58:22ZВе Fio
<ul></ul><p>Oh, and if /issues?sort= isn't the only one that bots might follow (because there's other parameterized stuff on the page), I suppose it'd probably be good to maybe put those in. I don't know all of the possible parameters, but you guys should. :)</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=239372011-01-12T03:33:59ZВе Fio
<ul></ul><p>An alternative, and MUCH MUCH better solution, is to add a noindex meta tag to the pages that shouldn't be indexed, which there are a lot of those on Redmine that robots.txt doesn't cover, and Google is going crazy indexing them.</p>
<pre># tell robots not to index that page
<meta name="robots" content="noindex">
# this page is the same as this other page (good for when /issues/21/?reply=2 is the same as /issues/21/)
<link rel="canonical" href="/"></pre>
<p>I highly suggest this gets implemented as soon as possible. :)</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=287422011-05-09T04:27:47ZAntoine Beaupré
<ul></ul><p>It seems like this could be easily fixed by the patch is <a class="issue tracker-3 status-1 priority-4 priority-default" title="Patch: add some additional URL paths to robots.txt (New)" href="https://www.redmine.org/issues/3754">#3754</a>.</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=925712019-06-24T13:59:30ZHarald Welte
<ul></ul><p>We've just observed that this issue still exists in redmine 3.4. I couldn't find any rationale here in this issue why the related patch was not merged during the past 8 years at some point?</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=983102020-06-22T19:38:08ZEduardo Ramos
<ul></ul><p>Harald Welte wrote:</p>
<blockquote>
<p>We've just observed that this issue still exists in redmine 3.4. I couldn't find any rationale here in this issue why the related patch was not merged during the past 8 years at some point?</p>
</blockquote>
<p>Still failing in redmine 4.1.1 stable, similar GETs on issues.<br />Receiving requests from various bots which exhaust my raspberry cpu:</p>
<p>172.162.119.114.in-addr.arpa domain name pointer petalbot-114-119-162-172.aspiegel.com.<br />146.168.229.46.in-addr.arpa domain name pointer crawl18.bl.semrush.com.<br />...</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=983172020-06-23T02:17:03ZGo MAEDA
<ul><li><strong>Category</strong> set to <i>SEO</i></li><li><strong>Target version</strong> set to <i>Candidate for next minor release</i></li></ul><p>Most people here think that the problem is that search engines indexes URLs of filters and queries (/issues/?...) rather than single issue pages (/issues/123).</p>
<p>I agree that indexing "/issues/?..." URLs is a waste of computer resources. However, I think "/issues/123" URLs should be indexed (I usually search for issues in <a class="external" href="http://www.redmine.org">www.redmine.org</a> with Google).</p>
<p>The following patch disallows all URLs that have a query string (?...). It disallows indexing "/issues/?" pages while allowing indexing "/issues/123" pages. The main contents we want search engines to index are issues and wiki pages, so I think it is not a problem to disallow all URLs that have a query string.</p>
<pre><code class="diff syntaxhl"><span class="gh">diff --git a/app/views/welcome/robots.text.erb b/app/views/welcome/robots.text.erb
index 6f66278ad..dbe9f04dd 100644
</span><span class="gd">--- a/app/views/welcome/robots.text.erb
</span><span class="gi">+++ b/app/views/welcome/robots.text.erb
</span><span class="p">@@ -1,4 +1,5 @@</span>
User-agent: *
<span class="gi">+Disallow: /*?
</span> <% @projects.each do |project| -%>
<% [project, project.id].each do |p| -%>
Disallow: <%= url_for(:controller => 'repositories', :action => :show, :id => p) %>
</code></pre> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=983302020-06-23T12:45:27ZEduardo Ramos
<ul></ul><p>Go MAEDA wrote:</p>
<blockquote>
<p>Most people here think that the problem is that search engines indexes URLs of filters and queries (/issues/?...) rather than single issue pages (/issues/123).</p>
<p>I agree that indexing "/issues/?..." URLs is a waste of computer resources. However, I think "/issues/123" URLs should be indexed (I usually search for issues in <a class="external" href="http://www.redmine.org">www.redmine.org</a> with Google).</p>
<p>The following patch disallows all URLs that have a query string (?...). It disallows indexing "/issues/?" pages while allowing indexing "/issues/123" pages. The main contents we want search engines to index are issues and wiki pages, so I think it is not a problem to disallow all URLs that have a query string.</p>
<p>[...]</p>
</blockquote>
<p>Thank u, with that modification (Disallow: /*?) my raspberry is not so stressed (at least cpu is under 15%, before it was about 90% due to crawlers requests)<br />How could it be patched on a docker-compose layout ? What I did, is to modify the 'robots.text.erb' in redmine container, and restart such container.</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=983312020-06-23T12:55:15ZGo MAEDA
<ul><li><strong>Target version</strong> changed from <i>Candidate for next minor release</i> to <i>4.0.8</i></li></ul><p>Eduardo Ramos wrote:</p>
<blockquote>
<p>Thank u, with that modification (Disallow: /*?) my raspberry is not so stressed (at least cpu is under 15%, before it was about 90% due to crawlers requests)</p>
</blockquote>
<p>Thank you for testing the patch and for giving feedback. I am setting the target version to 4.0.8.</p>
<blockquote>
<p>How could it be patched on a docker-compose layout ? What I did, is to modify the 'robots.text.erb' in redmine container, and restart such container.</p>
</blockquote>
<p>I don't know much about Docker. I suggest you ask questions on the <a href="/projects/redmine/boards">forums</a>.</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=983322020-06-23T14:48:51ZGo MAEDA
<ul></ul><p>Updated the patch. The previous patch posted in <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Defect: robots.txt: disallow crawling issues list with a query string (Closed)" href="https://www.redmine.org/issues/6734#note-13">#6734#note-13</a> has a problem that it prevents crawlers from accessing "/issues?page=". It means that crawlers can get only the first page of the issues list and will not index issues after the second page.</p>
<pre><code class="diff syntaxhl"><span class="gh">diff --git a/app/views/welcome/robots.text.erb b/app/views/welcome/robots.text.erb
index 6f66278ad..8c2732e00 100644
</span><span class="gd">--- a/app/views/welcome/robots.text.erb
</span><span class="gi">+++ b/app/views/welcome/robots.text.erb
</span><span class="p">@@ -10,3 +10,6 @@</span> Disallow: <%= url_for(issues_gantt_path) %>
Disallow: <%= url_for(issues_calendar_path) %>
Disallow: <%= url_for(activity_path) %>
Disallow: <%= url_for(search_path) %>
<span class="gi">+Disallow: <%= url_for(issues_path) %>?sort=
+Disallow: <%= url_for(issues_path) %>?query_id=
+Disallow: <%= url_for(issues_path) %>?*set_filter=
</span></code></pre> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=983362020-06-23T15:03:57ZGo MAEDA
<ul><li><strong>Related to</strong> <i><a class="issue tracker-2 status-5 priority-4 priority-default closed" href="/issues/31617">Feature #31617</a>: robots.txt: disallow crawling dynamically generated PDF documents</i> added</li></ul> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=983382020-06-23T15:42:05ZEduardo Ramos
<ul></ul><p>Go MAEDA wrote:</p>
<blockquote>
<p>Updated the patch. The previous patch posted in <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Defect: robots.txt: disallow crawling issues list with a query string (Closed)" href="https://www.redmine.org/issues/6734#note-13">#6734#note-13</a> has a problem that it prevents crawlers from accessing "/issues?page=". It means that crawlers can get only the first page of the issues list and will not index issues after the second page.</p>
<p>[...]</p>
</blockquote>
<p>Tested OK. The cpu even better. No activity registered in redmine logs regarding bots, neither at redmine access logs from nginx.<br />It could be casuality (no crawlers accessing now), i will monitor it in the following hours anyway.</p>
<p>Thank u</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=983412020-06-24T02:43:07ZGo MAEDA
<ul><li><strong>Subject</strong> changed from <i>Robots index /issues (which isn't disallowed in robots.txt)</i> to <i>robots.txt: disallow crawling issues list with a query string</i></li></ul> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=984002020-06-28T07:33:29ZGo MAEDA
<ul><li><strong>File</strong> <i>6734.diff</i> added</li></ul><p>Added test code.</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=984012020-06-28T07:39:02ZGo MAEDA
<ul><li><strong>File</strong> <a href="/attachments/25625">6734.patch</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/25625/6734.patch">6734.patch</a> added</li></ul> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=984022020-06-28T07:39:10ZGo MAEDA
<ul><li><strong>File</strong> deleted (<del><i>6734.diff</i></del>)</li></ul> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=984302020-07-02T03:02:53ZGo MAEDA
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Resolved</i></li><li><strong>Assignee</strong> set to <i>Go MAEDA</i></li><li><strong>Resolution</strong> set to <i>Fixed</i></li></ul><p>Committed the patch.</p> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=984472020-07-04T00:54:33ZGo MAEDA
<ul><li><strong>Status</strong> changed from <i>Resolved</i> to <i>Closed</i></li></ul> Redmine - Defect #6734: robots.txt: disallow crawling issues list with a query stringhttps://www.redmine.org/issues/6734?journal_id=1091512023-01-21T03:48:54ZGo MAEDA
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-5 priority-4 priority-default closed" href="/issues/38201">Defect #38201</a>: Fix robots.txt to disallow issue lists with a sort or query_id parameter in any position</i> added</li></ul>