Opened 2 years ago

Closed 2 years ago

Last modified 7 months ago

#16298 closed defect (fixed)

String length is incorrect when converting from UTF-16 to UTF-32 wxString

Reported by: plorkyeran Owned by: VZ
Priority: normal Milestone: 3.1.0
Component: base Version: dev-latest
Keywords: Cc:
Blocked By: Blocking:
Patch: yes

Description

auto str = wxString::FromUTF8("\xf0\x9f\x98\x84");
printf("%s\n", str.utf8_str().data());
assert(str.size() == 1);

wxMBConvUTF16 conv;
auto str2 = wxString(conv.cMB2WX((const char *)u"\xd83d\xde04"));
printf("%s\n", str2.utf8_str().data());
assert(str2.size() == 1);

This correctly prints the same character both times, but the second assert fails due to str2.size() being 2, and with longer strings anything that operates based on the string length ends up reading uninitialized memory.

The cause of this is that wxMBConvUTF16straight::ToWChar skips calculating the required buffer size by taking advantage of that UTF-32 is never more code units than UTF-16 and allocating a larger buffer than needed when surrogate pairs are involved, but wxMBConv::cMB2WC assumes the buffer size is the actual string size and discards the length returned by the actual conversion. The attached patch fixes this by making wxMBConv::cMB2WC shrink the buffer to the length actually used by the conversion.

Attachments (1)

utf16-conv-length.patch download (973 bytes) - added by plorkyeran 2 years ago.

Download all attachments as: .zip

Change History (8)

Changed 2 years ago by plorkyeran

comment:1 Changed 2 years ago by vadz

  • Milestone set to 3.1.0

Thanks for finding this problem!

I'm not sure about the fix though, I'm tempted to fix wxMBConvUTF16::ToWChar(NULL) return value instead, it clearly doesn't satisfy the requirements to this method described in the comment in wxMBConv. What do you think?

We also clearly need a test case for this in source:wxWidgets/trunk/tests/strings/unicode.cpp

comment:2 Changed 2 years ago by plorkyeran

Making ToWChar(NULL) calculate the actual length is definitely the way to go if that's what the interface is documented as doing. It'd be very difficult to construct scenarios where the performance hit is meaningful.

comment:3 Changed 2 years ago by VZ

  • Owner set to VZ
  • Resolution set to fixed
  • Status changed from new to closed

In 76618:

Fix wrong resulting string length in UTF-16 to wchar_t conversion.

Don't optimize the returned length for surrogate-less case, this does save a
pass of the string but at the price of returning a wrong result, which is not
worth it, just compute the really required length exactly.

Closes #16298.

comment:4 Changed 2 years ago by VZ

In 76622:

Fix wrong resulting string length in UTF-16 to wchar_t conversion.

Don't optimize the returned length for surrogate-less case, this does save a
pass of the string but at the price of returning a wrong result, which is not
worth it, just compute the really required length exactly.

Closes #16298.

comment:5 Changed 2 years ago by VZ

In 78022:

(The changeset message doesn't reference this ticket)

comment:6 Changed 18 months ago by Vadim Zeitlin <vadim@…>

In 52c8a3684c42baeb8b83008452e5990ec6e0bfa5/git-wxWidgets:

Disable surrogate-related unit test for MSW.

This test can't work when the in-memory representation is UTF-16, as we can't
convert surrogates to anything else in this case.

This fixes the unit tests broken since r76622, see #16298.

This is the backport of 0bb7676889d2a2031d68595f72bf2680dd415470 from master.

comment:7 Changed 7 months ago by Vadim Zeitlin <vadim@…>

In 5a92181ac1515006d464ec4d6b39465ada146df6/git-wxWidgets:

Fix the length returned by UTF-32 conversion for non-BMP input

Don't optimize the required length as this is a tiny gain resulting in big
problems with the strings containing surrogates for which the actual result is
shorter than the length returned, resulting in extra NUL bytes at the end of
the converted buffer.

This is similar to 3410aa372fe142bdb8e047eb303bf70b7adba039 (see #16298) but
for UTF-32 and not UTF-16.

Closes #17070.

Note: See TracTickets for help on using tickets.