Patch #25215

Re-use existing identical disk files for new attachments

Added by Jens Krämer 5 months ago. Updated 27 days ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:Jean-Philippe Lang% Done:

0%

Category:Attachments
Target version:3.4.0

Description

Uploaded files can already be associated with multiple attachment records through Attachment#copy. This patch adds an after_create hook to change the disk_filename and disk_directory attributes of a newly created attachment to point to an already existing, identical (filesize and digest) diskfile if one exists.

Database locks are used to guard against the deletion of the older attachment while the reference of the new attachment is changed.

Usefulness of this feature (i.e. how much space is saved) will vary a lot depending on external circumstances of course (one large scale setup we maintain at Planio saved around 15% / 60GB). Since the possiblity to have 1:n relationships of disk files to attachments already exists it just seems logical to make use of it for new attachments as well.

0001-reuse-existing-identical-disk-files-for-new-attachme.patch Magnifier - patch against current trunk (2.89 KB) Jens Krämer, 2017-02-28 04:08

0002-adds-file-equality-check-to-deduplication-hook.patch Magnifier (1.24 KB) Jens Krämer, 2017-03-02 08:36


Related issues

Related to Redmine - Feature #19289: Exclude attachments from incoming emails based on file co... New
Related to Redmine - Feature #23510: Reuse an exist attachment New
Duplicated by Redmine - Feature #15257: Attachment deduplication Closed
Blocked by Redmine - Patch #25240: Use SHA256 for attachment digest computation Closed
Blocked by Redmine - Patch #25590: prevent deadlocks in attachment deduplication Closed

Associated revisions

Revision 16458
Added by Jean-Philippe Lang 4 months ago

Reuse existing identical disk files for new attachments (#25215).

Patch by Jens Kraemer.

Revision 16459
Added by Jean-Philippe Lang 4 months ago

Adds file equality check to deduplication hook (#25215).

Patch by Jens Kraemer.

Revision 16460
Added by Jean-Philippe Lang 4 months ago

Delete the file after the change is committed (#25215).

Revision 16462
Added by Jean-Philippe Lang 4 months ago

Adds a test for when the file comparison fails (#25215).

History

#1 Updated by Jan from Planio www.plan.io 5 months ago

  • Target version set to Candidate for next minor release

#2 Updated by Go MAEDA 5 months ago

#3 Updated by Go MAEDA 5 months ago

  • Related to Feature #19289: Exclude attachments from incoming emails based on file content or file hash added

#4 Updated by Go MAEDA 5 months ago

#5 Updated by Jean-Philippe Lang 5 months ago

  • Status changed from New to Needs feedback

IMO, MD5 is too weak for this purpose and this could lead to potential vulnerabilities. The first that comes to my mind: attacker generates a malicious file and a legitimate file with the same MD5, he first uploads the malicious file then send the legitimate one to a user X who will eventually upload it => people downloading the later from user X will actually get the malicious file.
We should implement this after upgrading the digest to a safer hash function.
Please let me know what you think.

#6 Updated by Jens Krämer 5 months ago

Good point. Since we're also comparing file sizes, an attacker would have to create a collision while also matching the file sizes. Still not impossible I guess. Changing the hash function would make malicious collision creation harder, but still not impossible theoretically.

Maybe adding a real byte by byte comparison for the case of matching filesize / md5 would be the better way? Uploads take a while any way so the added computation time might not weigh in too much.

#7 Updated by Go MAEDA 5 months ago

What do you think about migrating to SHA-1? It is as fast as MD5 and much safer.

One problem is that migrating a existing Redmine instance may take a long time if it stores many large files.

#8 Updated by Jean-Philippe Lang 5 months ago

Jens Krämer wrote:

Good point. Since we're also comparing file sizes, an attacker would have to create a collision while also matching the file sizes. Still not impossible I guess.

MD5 collission with the same input size can be created in seconds with a standard computer. The file size comparaison does not make it safer.

Changing the hash function would make malicious collision creation harder, but still not impossible theoretically.

Not just harder, but practically impossible. Unlike MD5, there's no known way to easily generate SHA256 collisions.

Maybe adding a real byte by byte comparison for the case of matching filesize / md5 would be the better way? Uploads take a while any way so the added computation time might not weigh in too much.

That would be the safer option indeed. It also guarantees that the original file is not broken/missing, which is a good thing IMO before discarding the uploaded file. Replacing the MD5 with a safer hash function does make sense anyway.

#9 Updated by Jean-Philippe Lang 5 months ago

Go MAEDA wrote:

What do you think about migrating to SHA-1? It is as fast as MD5 and much safer.

SHA-1? No. Maybe SHA256 or SHA512 instead.

One problem is that migrating a existing Redmine instance may take a long time if it stores many large files.

Yes, this should not be done during a db migration for this reason. We can use a new hash function for new files and provide a task that updates the existing hashes.

#10 Updated by Jens Krämer 5 months ago

I gave the upgrade to SHA256 a try in #25240. Independent of that I'll add the bytewise comparison to this feature here next.

#11 Updated by Go MAEDA 5 months ago

  • Blocked by Patch #25240: Use SHA256 for attachment digest computation added

#12 Updated by Jens Krämer 5 months ago

Additional patch adding a FileUtils.identical? check to compare actual file contents before removing the duplicate.

#13 Updated by Jean-Philippe Lang 5 months ago

  • Target version changed from Candidate for next minor release to 3.4.0

#14 Updated by Hugo García 5 months ago

OK.

#15 Updated by Jean-Philippe Lang 4 months ago

  • Status changed from Needs feedback to Closed
  • Assignee set to Jean-Philippe Lang

Patches are committed, thanks. I think it's safer to delete the duplicate file after the change is committed to the DB (r16460).

#16 Updated by Go MAEDA 3 months ago

  • Related to Patch #25590: prevent deadlocks in attachment deduplication added

#17 Updated by Go MAEDA 3 months ago

  • Related to deleted (Patch #25590: prevent deadlocks in attachment deduplication)

#18 Updated by Go MAEDA 3 months ago

  • Blocked by Patch #25590: prevent deadlocks in attachment deduplication added

#19 Updated by Go MAEDA 3 months ago

  • Status changed from Closed to Reopened

A fix for this patch was submitted as #25590.

#20 Updated by Jean-Philippe Lang 3 months ago

  • Status changed from Reopened to Closed

Fix committed.

#21 Updated by Jean-Philippe Lang 27 days ago

  • Subject changed from re-use existing identical disk files for new attachments to Re-use existing identical disk files for new attachments

Also available in: Atom PDF