Defect #12641

Diff outputs become ??? in some non ASCII words.

Added by Toshi MARUYAMA almost 5 years ago. Updated over 4 years ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:Toshi MARUYAMA% Done:

0%

Category:I18n
Target version:2.3.0
Resolution:Fixed Affected version:2.1.4

Description

An example is r11052 in #12640#note-2.

diff-r11052.png (20 KB) Toshi MARUYAMA, 2012-12-19 09:30

unified_diff.rb.diff Magnifier - Correct UTF-8 parsing (787 Bytes) Filou Centrinov, 2013-03-05 00:16

unified_diff.rb.2.diff Magnifier - Set utf-8 encoding (621 Bytes) Filou Centrinov, 2013-03-05 12:54


Related issues

Related to Redmine - Patch #12640: Russian "about_x_hours" translation change Closed

Associated revisions

Revision 11544
Added by Toshi MARUYAMA over 4 years ago

remove unnecessary h() from diff filename (#12641)

On Rails3, escaping is default.

Revision 11545
Added by Toshi MARUYAMA over 4 years ago

move utf8 encoding from view to UnifiedDiff (#12641)

Revision 11546
Added by Toshi MARUYAMA over 4 years ago

code cleanup (#12641)

Revision 11547
Added by Toshi MARUYAMA over 4 years ago

set html encoding utf8 at Diff class (#12641)

Revision 11549
Added by Toshi MARUYAMA over 4 years ago

fix that diff outputs become ??? in some non ASCII words (#12641)

Contributed by Filou Centrinov.

Revision 11550
Added by Toshi MARUYAMA over 4 years ago

svn propset svn:eol-style native to fixtures (#12641)

Revision 11551
Added by Toshi MARUYAMA over 4 years ago

Merged r11544, r11545, r11546, r11547, r11549 from trunk to 2.3-stable (#12641)

fix that diff outputs become ??? in some non ASCII words.

Contributed by Filou Centrinov.

Revision 11552
Added by Toshi MARUYAMA over 4 years ago

2.3-stable: svn propset svn:eol-style native to fixtures (#12641)

History

#1 Updated by Filou Centrinov over 4 years ago

The Problem is, that for example the following diff-lines

- часа" 
+ часов" 

are parsed in Redmine as UTF-8 like this:

\xD1\x87\xD0\xB0\xD1\x81\xD0<span>\xB0</span>&quot;
\xD1\x87\xD0\xB0\xD1\x81\xD0<span>\xBE\xD0\xB2</span>&quot;

This is wrong, because the leading byte \xD0 is part of the cyrillic 2-Byte character "а" in the <span>-tag, but it's actually outside of the <span>-tag. Therefore charaters will be misinterpreted and will be displayed with "?".

Correct UTF-8 would be:

\xD1\x87\xD0\xB0\xD1\x81<span>\xD0\xB0</span>&quot;
\xD1\x87\xD0\xB0\xD1\x81<span>\xD0\xBE\xD0\xB2</span>&quot;

So we have for the first line "...<span>\xD0\xB0</span>..." instead of "...\xD0<span>\xB0</span>...". The attached patch searchs for the last leading byte, if the unmatching byte is a continuation byte (and not a leading byte or a single character byte).

A continuation byte has the binary format 10xxxxxx, so we can determine it with myContinuationByte.ord.between?(128, 191)

This problem occurs always, when the first determined difference between two bytes are continuation bytes. An other example in japanese you find in #13350.

#2 Updated by Filou Centrinov over 4 years ago

A much better way to fix this problem is to set an UTF-8 encoding. :-)

#3 Updated by Filou Centrinov over 4 years ago

The affected version is also 2.3 (devel)

#4 Updated by Toshi MARUYAMA over 4 years ago

  • Category set to I18n
  • Assignee set to Toshi MARUYAMA
  • Target version set to 2.4.0

#5 Updated by Toshi MARUYAMA over 4 years ago

  • Target version changed from 2.4.0 to 2.3.0

#6 Updated by Toshi MARUYAMA over 4 years ago

  • Status changed from New to Closed
  • Resolution set to Fixed

Committed in, thanks.

Also available in: Atom PDF