Feature #306

Full Text Search of files

Added by Ross Manning over 10 years ago. Updated about 1 month ago.

Status:NewStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:Search engine
Target version:4.1.0
Resolution:

Description

It would be great if we could get an option to search the contents of attached files/documents.

0001-moves-shellout-method-to-Utils-Shell.patch Magnifier (1.99 KB) Jens Krämer, 2017-06-21 11:13

0003-store-fulltext-in-the-attachment-model-and-make-it-s.patch Magnifier (6.38 KB) Jens Krämer, 2017-06-21 11:13

0004-turns-attachment-search-on-by-default.patch Magnifier (998 Bytes) Jens Krämer, 2017-06-21 11:14

0005-simplify-wording-for-search-form-options-de-en.patch Magnifier (1.71 KB) Jens Krämer, 2017-06-21 11:14

0002-implements-fulltext-extraction-for-attachments.patch Magnifier (207 KB) Jens Krämer, 2017-06-21 11:15

0006-moves-text-extraction-to-an-ActiveJob-Job.patch Magnifier (3.46 KB) Jens Krämer, 2017-06-23 05:29


Related issues

Duplicated by Redmine - Feature #818: Fulltext search include content of files Closed 2008-03-09
Duplicated by Redmine - Feature #4862: Search engine doesn't look inside documents Closed 2010-02-17

History

#1 Updated by Oleg Lozinskij over 8 years ago

Ross Manning wrote:

It would be great if we could get an option to search the contents of attached files/documents.

Any updates on this feature?

Cheers!

#2 Updated by Jens Goldhammer over 8 years ago

Maybe you can use ActAsSolr (http://acts-as-solr.rubyforge.org/) or ActAsFerret (http://rm.jkraemer.net/projects/activity/aaf) for it...

#3 Updated by S Reid about 8 years ago

Has anyone tried the above, or have any other suggestions ?

#4 Updated by Jean-Philippe Lang almost 8 years ago

  • Category set to Search engine

#5 Updated by Emmanuel Bastien almost 8 years ago

FYI, I was pleased with acts_as_xapian features, even if I'm not sure that there is still a maintainer for this project.

#6 Updated by Anonymous almost 5 years ago

+100 for this one.

This is a really interesting feature!

#7 Updated by Dipan Mehta over 4 years ago

+1. Very much useful

#8 Updated by Terence Mill over 4 years ago

It's worth to have a look at RSolr whih can easily integrate with an solr instance embedded in jruby.

if RUBY_PLATFORM =~ /java/

  require 'rsolr-direct'
  ::RSolr.load_java_libs

  ::Sunspot.session.instance_variable_set(:@connection,
    ::RSolr.connect(
      :direct,
      :solr_home => File.join(RAILS_ROOT, 'solr'),
      :data_dir =>  File.join(RAILS_ROOT, 'solr', 'data', RAILS_ENV)
    )
  )

end

#9 Updated by Jens Krämer 6 months ago

  • File 0001-moves-shellout-method-to-Utils-Shell.patch added
  • File 0002-implements-fulltext-extraction-for-attachments.patch added
  • File 0003-store-fulltext-in-the-attachment-model-and-make-it-s.patch added
  • File 0004-turns-attachment-search-on-by-default.patch added
  • File 0005-simplify-wording-for-search-form-options-de-en.patch added

I've been looking into this for Planio recently and we would like to contribute our code for this feature.

Our patch adds a fulltext column to the attachments table, which is filled with the plain text representation of the attachment, as far as possible, after the attachment has been created. As of now, the following text extractors are implemented:

  • XML based office documents (LibreOffice / OpenOffice / MS Office) (through RubyZIP / Nokogiri)
  • Old binary MS office formats (using the external catdoc, catppt and xls2csv commands)
  • PDF (using pdf2text)
  • RTF (uses the external unrtf command)
  • plain text, CSV

Other formats could be added, i.e. to extract image metadata through imagemagick.
External commands come with sensible defaults and can be configured through configuration.yml. The whole feature may be turned off in the same place.
The fulltext is added to the list of columns searched when attachment search is active. We also chose to enable attachment search by default and changed the wording of the option slightly to reflect the fact that now also attachment contents will be searched.

We think this is a "good enough" solution for many if not most Redmine installations, compared to more complex external indexing solutions.

Given that most attachment uploads happen asynchronously through Javascript the added processing time for text extraction should be barely noticed by the user. Going further one could think about pushing that work into an ActiveJob worker so administrators can decide if text extraction should happen inline or if they want to set up i.e. DelayedJob or any other deferred processing backend for this.

Please let me know what you think!

#10 Updated by Jens Krämer 6 months ago

  • File 0002-implements-fulltext-extraction-for-attachments.patch added

here is an updated version of patch 0002, with improved exception handling in the TextExtractor.

#12 Updated by Jan from Planio www.plan.io 6 months ago

  • File deleted (0001-moves-shellout-method-to-Utils-Shell.patch)

#13 Updated by Jan from Planio www.plan.io 6 months ago

  • File deleted (0003-store-fulltext-in-the-attachment-model-and-make-it-s.patch)

#14 Updated by Jan from Planio www.plan.io 6 months ago

  • File deleted (0004-turns-attachment-search-on-by-default.patch)

#15 Updated by Jan from Planio www.plan.io 6 months ago

  • File deleted (0005-simplify-wording-for-search-form-options-de-en.patch)

#16 Updated by Jan from Planio www.plan.io 6 months ago

  • File deleted (0002-implements-fulltext-extraction-for-attachments.patch)

#17 Updated by Jan from Planio www.plan.io 6 months ago

  • File deleted (0002-implements-fulltext-extraction-for-attachments.patch)

#18 Updated by Jan from Planio www.plan.io 6 months ago

  • Target version set to Candidate for next major release

This feature was high on the priorities list at Planio, so I assume it would be popular with Redmine users as well. I'd like to propose this feature for the next major release.

By the way, many of the binaries needed for text extraction are available on Windows as well. So this feature should be mostly cross-platform:

Only catdoc/catppt/xls2csv are afaik not available on Windows, but they are only required for the "old" binary MS Office formats (doc, ppt, xls), the newer XML based formats (docx, pptx, xlsx) are supported using rubyzip & nokogiri and don't need external binaries.

#19 Updated by Jens Krämer 6 months ago

Here's an additional patch which moves the text extraction into an ActiveJob job. By default these are executed inline, so the behaviour does not change. However users can now set up DelayedJob or one of the other possible ActiveJob backends to benefit from text extraction in the background.

#20 Updated by Go MAEDA 6 months ago

The patch from Planio is very insteresting. Thank you for sharing the patch.

I tried the patch but attachments.fulltext column was not updated in my environment. I found an error "uninitialized constant Redmine::TextExtractor::ZippedXmlHandler::Zip" in the log. Could you let me know how can I fix the problem?

log/development.log:

[ActiveJob] [ExtractFulltextJob] [b85b3211-0396-4ffa-a99d-ff877ea2ad6f] Performing ExtractFulltextJob from Inline(text_extraction) with arguments: 110
[ActiveJob] [ExtractFulltextJob] [b85b3211-0396-4ffa-a99d-ff877ea2ad6f]   Attachment Load (0.2ms)  SELECT  "attachments".* FROM "attachments" WHERE "attachments"."id" = ? LIMIT 1  [["id", 110]]
[ActiveJob] [ExtractFulltextJob] [b85b3211-0396-4ffa-a99d-ff877ea2ad6f] error in fulltext extraction: uninitialized constant Redmine::TextExtractor::ZippedXmlHandler::Zip
[ActiveJob] [ExtractFulltextJob] [b85b3211-0396-4ffa-a99d-ff877ea2ad6f] Performed ExtractFulltextJob from Inline(text_extraction) in 28.16ms
[ActiveJob] Enqueued ExtractFulltextJob (Job ID: b85b3211-0396-4ffa-a99d-ff877ea2ad6f) to Inline(text_extraction) with arguments: 110

about the attachment:

2.3.3 :001 > Attachment.find(110)
  Attachment Load (0.3ms)  SELECT  "attachments".* FROM "attachments" WHERE "attachments"."id" = ? LIMIT 1  [["id", 110]]
 => #<Attachment id: 110, container_id: 33, container_type: "Issue", filename: "test.docx", disk_filename: "170624114527_test.docx", filesize: 133663, content_type: "application/vnd.openxmlformats-officedocument.word...", digest: "0215fd360f2759b605151f171741b1a503f77d2bda5234d4ea...", downloads: 0, author_id: 1, created_on: "2017-06-24 02:45:27", description: "", disk_directory: "2017/06", fulltext: nil>

bin/about:

Environment:
  Redmine version                3.3.3.devel.16682
  Ruby version                   2.3.3-p222 (2016-11-21) [x86_64-darwin16]
  Rails version                  4.2.8
  Environment                    development
  Database adapter               SQLite
SCM:
  Subversion                     1.9.5
  Darcs                          2.12.0
  Mercurial                      3.8.4
  Cvs                            1.12.13
  Bazaar                         2.7.0
  Git                            2.11.0
  Filesystem
Redmine plugins:
  no plugin installed

#21 Updated by Go MAEDA 6 months ago

Go MAEDA wrote:

I tried the patch but attachments.fulltext column was not updated in my environment. I found an error "uninitialized constant Redmine::TextExtractor::ZippedXmlHandler::Zip" in the log. Could you let me know how can I fix the problem?

The workaround for the error:

diff --git a/lib/redmine/text_extractor.rb b/lib/redmine/text_extractor.rb
index 33f922f8d..ad78f69e5 100644
--- a/lib/redmine/text_extractor.rb
+++ b/lib/redmine/text_extractor.rb
@@ -1,3 +1,5 @@
+require 'zip'
+
 module Redmine
   class TextExtractor

#22 Updated by Mischa The Evil 5 months ago

I like (the proposed implementation of) this feature. +1 from me...

#23 Updated by Go MAEDA 5 months ago

  • Target version changed from Candidate for next major release to 4.1.0

I think this is very important and long awaited feature.
Let's discuss implementing this feature for 3.5.0.

#24 Updated by Go MAEDA 5 months ago

  • Subject changed from Full Text Search of files? to Full Text Search of files

#25 Updated by Mitsuyoshi Kawabata 4 months ago

+1

#26 Updated by Hirofumi Kadoya 4 months ago

+1

#27 Updated by okkez _ 4 months ago

I have considered to implement this feature, too.
This patch is nice and great work.

How about support plain text only at first step?
I want to customize and extend text extraction method via plugins or something.

#28 Updated by Kush Suryavanshi about 1 month ago

+1. It will be great if this happens 3.5.0

Also available in: Atom PDF