Defect #20730

Fix tokenization of phrases with non-ascii chars

Added by Jens Krämer over 2 years ago. Updated over 2 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:Jean-Philippe Lang% Done:

0%

Category:Search engine
Target version:3.0.6
Resolution:Fixed Affected version:

Description

\w only matches ASCII characters, we should either use [:alnum:] instead or simply match all non-" characters for the phrase. Test case included.

fix-tokenization-for-phrases-with-non-ascii-characte.patch Magnifier (1.39 KB) Jens Krämer, 2015-09-13 05:53

Associated revisions

Revision 14662
Added by Jean-Philippe Lang over 2 years ago

Fix tokenization of phrases with non-ascii chars (#20730).

Patch by Jens Krämer.

History

#1 Updated by Go MAEDA over 2 years ago

  • Tracker changed from Patch to Defect
  • Target version set to 3.1.2

+1

Search keyword '"日本語 テスト"' (written in Japanese) matches both "日本語 テスト" and "日本語テスト" in the current trunk, but it should not match the latter.

expected:

Redmine::Search::Fetcher.new('"日本語 テスト"', ...).tokens => ['日本語 テスト']

actual:

Redmine::Search::Fetcher.new('"日本語 テスト"', ...).tokens => ['日本語', 'テスト']

This behavior can be fixed by this patch.

#2 Updated by Jean-Philippe Lang over 2 years ago

  • Status changed from New to Closed
  • Assignee set to Jean-Philippe Lang
  • Target version changed from 3.1.2 to 3.0.6
  • Resolution set to Fixed

Patch applied, thanks.

Also available in: Atom PDF