Defect #24616

Should not replace all invalid utf8 characters (e.g in mail)

Added by Pavel Rosický 3 months ago. Updated about 1 month ago.

Status:ClosedStart date:
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:I18n
Target version:3.4.0
Resolution:Fixed Affected version:

Description

Hello,
I've an email, that is encoded in utf8, but it contains an invalid character. In this case, redmine converts the content to us-ascii and then to utf8. This step will replace non-ascii compatible chars to "?". Why?

1) Failure:
MailHandlerTest#test_invalid_utf8 [/test/unit/mail_handler_test.rb:548]:
Expected: "Здравствуйте?" 
  Actual: "?????????????" 

I changed Redmine::CodesetUtil.replace_invalid_utf8(str) and Redmine::CodesetUtil.to_utf8(str, encoding)

        str = str.encode("US-ASCII", :invalid => :replace, :undef => :replace, :replace => '?').encode("UTF-8")

to
        str = str.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => '?')

all tests are passing with this change.
  Redmine version                3.3.1.stable
  Ruby version                   2.1.5-p273 (2014-11-13) [x64-mingw32]
  Rails version                  4.2.7.1
  Environment                    production
  Database adapter               Mysql2
SCM:
  Git                            2.10.1
  Filesystem
Redmine plugins:
  no plugin installed

invalid_utf8_test.patch Magnifier - spec (1.49 KB) Pavel Rosický, 2016-12-15 01:42

defect-24616.diff Magnifier - fix + tests (generated from Pavel Rosický's contribution) (1.73 KB) Go MAEDA, 2016-12-29 07:52

Associated revisions

Revision 16271
Added by Toshi MARUYAMA about 1 month ago

add more tests (#24616)

Revision 16272
Added by Toshi MARUYAMA about 1 month ago

unify duplicate codes (#24616)

Revision 16273
Added by Toshi MARUYAMA about 1 month ago

do not replace all invalid utf8 (#24616)

Revision 16274
Added by Toshi MARUYAMA about 1 month ago

additional test for mail by Pavel Rosický (#24616)

Revision 16275
Added by Toshi MARUYAMA about 1 month ago

svn propset svn:eol-style native test/fixtures/mail_handler/invalid_utf8.eml (#24616)

History

#1 Updated by Go MAEDA 3 months ago

Looks good to me.

# valid UTF-8 string
text = "こんにちは" 
p text.valid_encoding?  # => true

# making invalid UTF-8 string
text.force_encoding('ASCII-8BIT')
text[-1] = 0xff.chr
text.force_encoding("UTF-8")
p text.valid_encoding?  # => false
p text                  # => "こんにち\xE3\x81\xFF" 

# Current code of Redmine
p text.encode("US-ASCII", :invalid => :replace, :undef => :replace, :replace => '?').encode("UTF-8")
# => "??????" 

# Fixed code by Pavel Rosický
p text.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => '?')
# => "こんにち??" 

#2 Updated by Toshi MARUYAMA 3 months ago

  • Target version deleted (3.3.2)

Did you run whole tests?
Especially this test.
source:tags/3.3.1/test/unit/lib/redmine/codeset_util_test.rb

#3 Updated by Toshi MARUYAMA 3 months ago

Pavel Rosický wrote:

Hello,
I've an email, that is encoded in utf8, but it contains an invalid character. In this case, redmine converts the content to us-ascii and then to utf8. This step will replace non-ascii compatible chars to "?". Why?

You can see this function purpose.
source:tags/3.3.1/test/unit/lib/redmine/codeset_util_test.rb#L68

#4 Updated by Pavel Rosický 3 months ago

Thanks Toshi, I rechecked it again and all tests are passing.

source:tags/3.3.1/test/unit/lib/redmine/codeset_util_test.rb#L68
In this case, my change has no effect on the result, because the string contains just one invalid utf-8 character.

s1.encode('us-ascii', :invalid => :replace, :undef => :replace, :replace => '?').encode('utf-8')
"Texte encod? en ISO-8859-1." 
# patched
s1.encode('utf-8', :invalid => :replace, :undef => :replace, :replace => '?')
"Texte encod? en ISO-8859-1." 

but a combination of valid and invalid utf-8 chars (non-ascii-compatible) will result both characters are stripped. Try out GO Media's example.

#5 Updated by Toshi MARUYAMA about 1 month ago

$ irb
1.9.3-p551 :001 > text = "こんにち\xE3\x81\xFF" 
 => "こんにち\xE3\x81\xFF" 
1.9.3-p551 :002 > text =  text.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => '?')
 => "こんにち\xE3\x81\xFF" 
1.9.3-p551 :003 > text.valid_encoding?
 => false 
$ irb
2.3.3 :001 > text = "こんにち\xE3\x81\xFF" 
 => "こんにち\xE3\x81\xFF" 
2.3.3 :002 > text =  text.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => '?')
 => "こんにち??" 
2.3.3 :003 > text.valid_encoding?
 => true 

#6 Updated by Toshi MARUYAMA about 1 month ago

Pavel Rosický wrote:

Hello,
I've an email, that is encoded in utf8, but it contains an invalid character. In this case, redmine converts the content to us-ascii and then to utf8. This step will replace non-ascii compatible chars to "?". Why?

Because of Ruby 1.8.7 behavior compatibility.
source:tags/2.6.9/lib/redmine/codeset_util.rb

#7 Updated by Toshi MARUYAMA about 1 month ago

  • Subject changed from encoding error if email contains an invalid utf8 character to Should not replace all invalid utf8 characters
  • Category changed from Email receiving to I18n

#8 Updated by Toshi MARUYAMA about 1 month ago

  • Subject changed from Should not replace all invalid utf8 characters to Should not replace all invalid utf8 characters (e.g in mail)

#9 Updated by Toshi MARUYAMA about 1 month ago

  • Status changed from New to Closed
  • Target version set to 3.4.0
  • Resolution set to Fixed

I have committed r16273 to pass on Ruby 1.9.3.
I don't want to change behavior on stable.

Also available in: Atom PDF