Patch #25215
closedRe-use existing identical disk files for new attachments
0%
Description
Uploaded files can already be associated with multiple attachment records through Attachment#copy
. This patch adds an after_create
hook to change the disk_filename
and disk_directory
attributes of a newly created attachment to point to an already existing, identical (filesize and digest) diskfile if one exists.
Database locks are used to guard against the deletion of the older attachment while the reference of the new attachment is changed.
Usefulness of this feature (i.e. how much space is saved) will vary a lot depending on external circumstances of course (one large scale setup we maintain at Planio saved around 15% / 60GB). Since the possiblity to have 1:n relationships of disk files to attachments already exists it just seems logical to make use of it for new attachments as well.
Files
Related issues
Updated by Jan from Planio www.plan.io over 7 years ago
- Target version set to Candidate for next minor release
Updated by Go MAEDA over 7 years ago
- Has duplicate Feature #15257: Attachment deduplication added
Updated by Go MAEDA over 7 years ago
- Related to Feature #19289: Exclude attachments from incoming emails based on file content or file hash added
Updated by Go MAEDA over 7 years ago
- Related to Feature #23510: Reuse an exist attachment added
Updated by Jean-Philippe Lang over 7 years ago
- Status changed from New to Needs feedback
IMO, MD5 is too weak for this purpose and this could lead to potential vulnerabilities. The first that comes to my mind: attacker generates a malicious file and a legitimate file with the same MD5, he first uploads the malicious file then send the legitimate one to a user X who will eventually upload it => people downloading the later from user X will actually get the malicious file.
We should implement this after upgrading the digest to a safer hash function.
Please let me know what you think.
Updated by Jens Krämer over 7 years ago
Good point. Since we're also comparing file sizes, an attacker would have to create a collision while also matching the file sizes. Still not impossible I guess. Changing the hash function would make malicious collision creation harder, but still not impossible theoretically.
Maybe adding a real byte by byte comparison for the case of matching filesize / md5 would be the better way? Uploads take a while any way so the added computation time might not weigh in too much.
Updated by Go MAEDA over 7 years ago
What do you think about migrating to SHA-1? It is as fast as MD5 and much safer.
One problem is that migrating a existing Redmine instance may take a long time if it stores many large files.
Updated by Jean-Philippe Lang over 7 years ago
Jens Krämer wrote:
Good point. Since we're also comparing file sizes, an attacker would have to create a collision while also matching the file sizes. Still not impossible I guess.
MD5 collission with the same input size can be created in seconds with a standard computer. The file size comparaison does not make it safer.
Changing the hash function would make malicious collision creation harder, but still not impossible theoretically.
Not just harder, but practically impossible. Unlike MD5, there's no known way to easily generate SHA256 collisions.
Maybe adding a real byte by byte comparison for the case of matching filesize / md5 would be the better way? Uploads take a while any way so the added computation time might not weigh in too much.
That would be the safer option indeed. It also guarantees that the original file is not broken/missing, which is a good thing IMO before discarding the uploaded file. Replacing the MD5 with a safer hash function does make sense anyway.
Updated by Jean-Philippe Lang over 7 years ago
Go MAEDA wrote:
What do you think about migrating to SHA-1? It is as fast as MD5 and much safer.
SHA-1? No. Maybe SHA256 or SHA512 instead.
One problem is that migrating a existing Redmine instance may take a long time if it stores many large files.
Yes, this should not be done during a db migration for this reason. We can use a new hash function for new files and provide a task that updates the existing hashes.
Updated by Jens Krämer over 7 years ago
I gave the upgrade to SHA256 a try in #25240. Independent of that I'll add the bytewise comparison to this feature here next.
Updated by Go MAEDA over 7 years ago
- Blocked by Patch #25240: Use SHA256 for attachment digest computation added
Updated by Jens Krämer over 7 years ago
- File 0002-adds-file-equality-check-to-deduplication-hook.patch 0002-adds-file-equality-check-to-deduplication-hook.patch added
Additional patch adding a FileUtils.identical?
check to compare actual file contents before removing the duplicate.
Updated by Jean-Philippe Lang over 7 years ago
- Target version changed from Candidate for next minor release to 3.4.0
Updated by Jean-Philippe Lang over 7 years ago
- Status changed from Needs feedback to Closed
- Assignee set to Jean-Philippe Lang
Patches are committed, thanks. I think it's safer to delete the duplicate file after the change is committed to the DB (r16460).
Updated by Go MAEDA over 7 years ago
- Related to Patch #25590: prevent deadlocks in attachment deduplication added
Updated by Go MAEDA over 7 years ago
- Related to deleted (Patch #25590: prevent deadlocks in attachment deduplication)
Updated by Go MAEDA over 7 years ago
- Blocked by Patch #25590: prevent deadlocks in attachment deduplication added
Updated by Go MAEDA over 7 years ago
- Status changed from Closed to Reopened
A fix for this patch was submitted as #25590.
Updated by Jean-Philippe Lang over 7 years ago
- Status changed from Reopened to Closed
Fix committed.
Updated by Jean-Philippe Lang over 7 years ago
- Subject changed from re-use existing identical disk files for new attachments to Re-use existing identical disk files for new attachments