https://www.redmine.org/https://www.redmine.org/favicon.ico?16793021292011-10-26T08:36:32ZRedmineRedmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=332712011-10-26T08:36:32ZToshi MARUYAMA
<ul><li><strong>Assignee</strong> set to <i>Toshi MARUYAMA</i></li></ul> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=332772011-10-26T09:42:32ZToshi MARUYAMA
<ul><li><strong>Assignee</strong> deleted (<del><i>Toshi MARUYAMA</i></del>)</li></ul><p>I can not accept this patch.<br />Because "save" is for reducing DB access (<a class="issue tracker-1 status-5 priority-3 priority-lowest closed" title="Defect: Git adapter lost commits before 7 days from database latest changeset (Closed)" href="https://www.redmine.org/issues/7146">#7146</a>, <a class="issue tracker-1 status-5 priority-3 priority-lowest closed" title="Defect: git tab,browsing, very slow -- even after first time (Closed)" href="https://www.redmine.org/issues/6013">#6013</a>).</p>
<p>"save" need to update "extra_info" column of "repositories" table (<a class="changeset" title="scm: add "extra_info" column to repositories table and set serialize (#7146, #7047). This column..." href="https://www.redmine.org/projects/redmine/repository/svn/revisions/5762">r5762</a>).</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=332782011-10-26T09:46:03ZToshi MARUYAMA
<ul></ul><blockquote>
<p>loading of large repositories to take a long time</p>
</blockquote>
<p>This is only in case of after upgrading from 1.1 to 1.2 or new branch at first time (<a class="issue tracker-1 status-5 priority-3 priority-lowest closed" title="Defect: Git: Too long in fetching repositories after upgrade from 1.1 or new branch at first time (Closed)" href="https://www.redmine.org/issues/8857">#8857</a>).</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=332872011-10-26T15:53:14ZJohn Kubiatowicz
<ul></ul><p>The problem is that the existing code prevented me from importing a large repository into redmine because it <strong>timed out</strong>. The problem, as I mentioned, is that every save causes an event which is observed by the plugin that I am using.</p>
<p>Why is putting the save at the end a bad thing? It would seem to be strictly higher performing, since it accesses the database much less... It appears that the extra_info hash is accumulated as the code runs through the loop, thus this will not lose information....</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333152011-10-27T00:47:20ZMischa The Evil
<ul></ul><p>Follow-up on <a class="message" href="https://www.redmine.org/boards/1/topics/26997">Specific question about git scm implementation: follow on...</a></p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333162011-10-27T00:48:51ZToshi MARUYAMA
<ul></ul><p>I add a comment of a "last_scmid" purpose at <a class="changeset" title="scm: git: recovery and improve comments of fetching from 1.1 about harmful influence that git doe..." href="https://www.redmine.org/projects/redmine/repository/svn/revisions/7658">r7658</a>.</p>
<p>For web server timeout, you can use "ruby script/runner 'Repository.fetch_changesets' -e production".<br />See <a class="external" href="http://www.redmine.org/issues/8857#note-15">http://www.redmine.org/issues/8857#note-15</a></p>
<p>For preventing "repositories" table update, you can split "extra_info" column to new table.</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333292011-10-27T16:32:17ZJohn Kubiatowicz
<ul></ul><p>Forgive me for being obtuse, but your comment doesn't seem to explain why you need to execute self.save in the loop over the revisions. Why can't it be at the end of the loop? It seems to work fine (been using it on a complex repository for a couple of days now). I'm getting the impression by your answers that you misunderstand my query -- I don't want to remove the save, just put it at the end (rather than looping over it).</p>
<p>As for executing it with script/runner, this doesn't work as shown by <a class="issue tracker-1 status-5 priority-3 priority-lowest closed" title="Defect: Git: Too long in fetching repositories after upgrade from 1.1 or new branch at first time (Closed)" href="https://www.redmine.org/issues/8857">#8857</a> -- there is still so much traffic that the process doesn't complete. Presumably I don't want to hack your code to split a separate table (and not sure that does the right thing, since I believe that the plugin wants to know when the repository information has been updated).</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333302011-10-27T16:58:43ZToshi MARUYAMA
<ul></ul><p>John Kubiatowicz wrote:</p>
<blockquote>
<p>Why can't it be at the end of the loop?</p>
</blockquote>
<p>Because "self.save" saves "last_scmid".</p>
<p>If "last_scmid" does not save in case of time out,<br />next time, redmine reads all revisions that already been saved in database.</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333412011-10-27T17:38:29ZEtienne Massip
<ul></ul><p>Toshi MARUYAMA wrote:</p>
<blockquote>
<p>If "last_scmid" does not save in case of time out,<br />next time, redmine reads all revisions that already been saved in database.</p>
</blockquote>
<p>Actually, if it timed out, then there may be something wrong already, and we can expect this situation to be rare, so re-fetching revisions is not that harmful?</p>
<p>By "all revisions", you mean all revisions that hadn't been fetched before the call to <code>#fetch_changesets</code>, don't you?</p>
<p>What you're saying is that thanks to the <code>save</code> in the loop, subsequent calls to <code>#fetch_changesets</code> can success?</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333452011-10-27T18:53:35ZJohn Kubiatowicz
<ul></ul><p>Toshi MARUYAMA wrote:</p>
<blockquote>
<p>John Kubiatowicz wrote:</p>
<blockquote>
<p>Why can't it be at the end of the loop?</p>
</blockquote>
<p>Because "self.save" saves "last_scmid".</p>
<p>If "last_scmid" does not save in case of time out,<br />next time, redmine reads all revisions that already been saved in database.</p>
</blockquote>
<p>Hm... This is a failure mode that never occurred to me. Several things occur to me here:</p>
<p>1) The database access presumably takes much longer than an individual loop (100x, 1000x?); consequently, you would seem to be making a timeout problem much worse by saving every loop -- even in a vanilla redmine without plugins. At minimum, you should try to figure out the blow-up factor and save only every n loops, where n is the expected ratio of time to save to the database vs parse the next revision ID from the output of git....<br />2) I would agree with Etienne here. This seems like a case where something is wrong already if you get an actual timeout. Wouldn't this rare case be better handled by the "ruby script/runner 'Repository.fetch_changesets' -e production" solution you suggested to me, rather than impacting the normal case by orders of magnitude (see <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Defect: permissions if not admin (Closed)" href="https://www.redmine.org/issues/1">#1</a>) in time?</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333492011-10-27T22:26:57ZToshi MARUYAMA
<ul><li><strong>File</strong> <a href="/attachments/6704">git-1.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/6704/git-1.png">git-1.png</a> added</li><li><strong>File</strong> <a href="/attachments/6705">git-2.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/6705/git-2.png">git-2.png</a> added</li><li><strong>File</strong> <a href="/attachments/6706">git.odg</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/6706/git.odg">git.odg</a> added</li></ul><p><img src="https://www.redmine.org/attachments/download/6704/git-1.png" alt="" /><br /><img src="https://www.redmine.org/attachments/download/6705/git-2.png" alt="" /></p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333652011-10-28T09:59:01ZEtienne Massip
<ul><li><strong>File</strong> <i>git_brch_tx.patch</i> added</li></ul><p>Still, by "all revisions", you mean all revisions that hadn't been fetched before the call to <code>#fetch_changesets</code>, don't you?</p>
<p>Would you prefer this one?</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333672011-10-28T10:20:20ZEtienne Massip
<ul><li><strong>File</strong> <a href="/attachments/6709">git_brch_tx.patch</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/6709/git_brch_tx.patch">git_brch_tx.patch</a> added</li></ul><p>Cleaner one. Actually, I'm not sure the transaction block here is worth it since the <code>#find_by_name</code> is called anyway for each rev, removing it would save some <code>#save_revision</code> calls? Maybe then it would be needed inside <code>#save_revision</code>?</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333682011-10-28T10:27:14ZEtienne Massip
<ul><li><strong>File</strong> deleted (<del><i>git_brch_tx.patch</i></del>)</li></ul> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333692011-10-28T10:30:51ZEtienne Massip
<ul><li><strong>File</strong> <a href="/attachments/6710">git_brch_no_tx.patch</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/6710/git_brch_no_tx.patch">git_brch_no_tx.patch</a> added</li></ul><p>What I meant.</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333702011-10-28T11:45:53ZToshi MARUYAMA
<ul></ul><p>Subversion and Mercurial have a sequential revision number,<br />So, Redmine can know <strong>last</strong> saved revision number from <strong>changesets</strong> table.</p>
<p>Subversion: (1, 2, 3 ....)<br /><a class="source" href="https://www.redmine.org/projects/redmine/repository/svn/entry/tags/1.2.1/app/models/repository/subversion.rb#L54">source:tags/1.2.1/app/models/repository/subversion.rb#L54</a></p>
<p>Mercurial: (0, 1, 2 ....)<br /><a class="source" href="https://www.redmine.org/projects/redmine/repository/svn/entry/tags/1.2.1/app/models/repository/mercurial.rb#L128">source:tags/1.2.1/app/models/repository/mercurial.rb#L128</a></p>
<p>But, Git does not have a sequential revision number.<br />And Git branch is the pointer to the specific revision.<br />So, Redmine can <strong>not</strong> know <strong>last</strong> saved revision from <strong>changesets</strong> table.</p>
<p>In order to know <strong>last</strong> saved revision per branch,<br />Redmine saves <strong>last</strong> saved revision id per branch to database.<br />Current implemention is <strong>repositories</strong> table.</p>
<p>So, saving "changeset" and "last saved revision id" need in transaction.</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333712011-10-28T12:32:12ZToshi MARUYAMA
<ul><li><strong>File</strong> <a href="/attachments/6711">git-branch.odg</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/6711/git-branch.odg">git-branch.odg</a> added</li><li><strong>File</strong> <a href="/attachments/6712">git-branch.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/6712/git-branch.png">git-branch.png</a> added</li></ul><p><img src="https://www.redmine.org/attachments/download/6712/git-branch.png" alt="" /></p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333722011-10-28T13:05:26ZEtienne Massip
<ul></ul><p>While I find Toshi's method a bit paranoid, I understand his pov since <code>scm.save</code> would not be a bottleneck if this observer thing wasn't actually doing much after each call to <code>#save</code>.</p>
<p>The redmine_git_plugin actually extrapolates the fact that a <code>Repository#save</code> corresponds to the end of a revisions fetching operation.</p>
<p>What might be missing on RM side is maybe some hooks before and after calling <code>Repository#fetch_changeset</code>. The plugin could still use some RoR <code>alias_method_chain</code> technique, though (described in wiki in <a class="wiki-page" href="https://www.redmine.org/projects/redmine/wiki/Plugin_Internals#Wrapping-an-existing-method">Plugin_Internals</a>).</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333812011-10-28T16:34:57ZJohn Kubiatowicz
<ul></ul><p>Yes, I guess I can see this point of view -- although I still think it is optimizing for a failure case that shouldn't happen. Looking more carefully at this code, I believe that there is a bunch of DB traffic that happens per iteration already, when creating new changesets, which negates my performance argument from earlier.</p>
<p>I believe that one problem is that the plugin really only wants to know about major repository changes, not changeset alterations. Yes, I could wrap the fetch_changeset with something that disables the callback.</p>
<p>Question: instead of putting this information in the extra_info hash, why not use changeset entries, one for each branch. They could have the branch name for their identifier and the scmid for last_scmid. They could easily be distinguished from normal git changeset entries for which the identifier and scmid are equal. This would seem to work, no? (Although it is equivalent to the suggestion of putting last_scmid info in a separate table, which is perhaps the best solution...?)</p>
<p>As I think about this code, what happens if the last_scmid for a branch is nolonger on the branch? (i.e. because of, say "git commit --amend" and "git push -f" operations). It appears that "git log foo.bar" where foo is not an ancestor of bar just shows bar. The result would seem to miss the new (altered) history entries, since in this case it would be better to recognize this situation and set foo=>nil.</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333822011-10-28T16:36:29ZJohn Kubiatowicz
<ul></ul><p>Above, I meant "git log foo...bar", of course. In the case where foo is not an ancestor of bar, this only shows the bar log entry.</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333842011-10-28T17:18:51ZToshi MARUYAMA
<ul></ul><p>John Kubiatowicz wrote:</p>
<blockquote>
<p>why not use changeset entries, one for each branch.</p>
</blockquote>
There were serious fetching performance problems.
<ul>
<li>0.9.0:
<ul>
<li><a class="source" href="https://www.redmine.org/projects/redmine/repository/svn/entry/tags/0.9.0/app/models/repository/git.rb#L40">source:tags/0.9.0/app/models/repository/git.rb#L40</a></li>
<li><a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Defect: git: Very high CPU usage for a long time with large repository (Closed)" href="https://www.redmine.org/issues/4547">#4547</a></li>
<li><a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Defect: Git repository performance fall on parsing new commits after 0.9.1 update (Closed)" href="https://www.redmine.org/issues/4716">#4716</a></li>
</ul>
</li>
<li>0.9.3:
<ul>
<li><a class="source" href="https://www.redmine.org/projects/redmine/repository/svn/entry/tags/0.9.3/app/models/repository/git.rb#L40">source:tags/0.9.3/app/models/repository/git.rb#L40</a></li>
<li><a class="issue tracker-1 status-5 priority-3 priority-lowest closed" title="Defect: Git adapter lost commits before 7 days from database latest changeset (Closed)" href="https://www.redmine.org/issues/7146">#7146</a></li>
<li><a class="external" href="https://www.chiliproject.org/issues/214">https://www.chiliproject.org/issues/214</a></li>
<li><a class="issue tracker-1 status-5 priority-3 priority-lowest closed" title="Defect: git tab,browsing, very slow -- even after first time (Closed)" href="https://www.redmine.org/issues/6013">#6013</a></li>
</ul></li>
</ul>
<blockquote>
<p>(i.e. because of, say "git commit --amend" and "git push -f" operations).</p>
</blockquote>
<p>There is test for this situation.<br /><a class="source" href="https://www.redmine.org/projects/redmine/repository/svn/entry/tags/1.2.1/test/unit/lib/redmine/scm/adapters/git_adapter_test.rb#L190">source:tags/1.2.1/test/unit/lib/redmine/scm/adapters/git_adapter_test.rb#L190</a></p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333852011-10-28T19:04:48ZJohn Kubiatowicz
<ul></ul><p>Toshi MARUYAMA wrote:</p>
<blockquote><blockquote>
<p>(i.e. because of, say "git commit --amend" and "git push -f" operations).</p>
</blockquote>
<p>There is test for this situation.<br /><a class="source" href="https://www.redmine.org/projects/redmine/repository/svn/entry/tags/1.2.1/test/unit/lib/redmine/scm/adapters/git_adapter_test.rb#L190">source:tags/1.2.1/test/unit/lib/redmine/scm/adapters/git_adapter_test.rb#L190</a></p>
</blockquote>
<p>Actually, (ironically?) this tested behavior is exactly the wrong behavior for the code in git.rb. When the last_scm is not on the respective branch, one actually needs to reset from_scm=>nil and try again (so that you get a potentially large number of revisions, rather than no revisions).</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=333872011-10-28T23:04:36ZToshi MARUYAMA
<ul></ul><p>John Kubiatowicz wrote:</p>
<blockquote>
<p>one actually needs to reset from_scm=>nil and try again (so that you get a potentially large number of revisions, rather than no revisions).</p>
</blockquote>
<p>As I described at <a href="http://www.redmine.org/issues/4455#note-28" class="external">issue 4455 note 28</a> , history editing is rare case on <strong>shared</strong> repository.<br />We agreed Redmine does not need care history editing for <strong>shared</strong> repository at <a href="http://www.redmine.org/issues/4455#note-150" class="external">issue 4455 note 150</a> .</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=368522012-03-16T16:29:33ZToshi MARUYAMA
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Closed</i></li><li><strong>Target version</strong> set to <i>1.4.0</i></li></ul><p><a class="changeset" title="scm: git: performance improvements in fetching revisions (#8857, #9472) Parse a revision for a g..." href="https://www.redmine.org/projects/redmine/repository/svn/revisions/9144">r9144</a> fixed this issue.</p> Redmine - Defect #9472: The git scm module causes an excess amount of DB traffic.https://www.redmine.org/issues/9472?journal_id=368572012-03-16T23:11:01ZToshi MARUYAMA
<ul><li><strong>Resolution</strong> set to <i>Fixed</i></li></ul>