Opened 5 months ago

Closed 5 months ago

Last modified 11 days ago

#16298 closed defect (fixed)

String length is incorrect when converting from UTF-16 to UTF-32 wxString

Reported by: plorkyeran Owned by: VZ
Priority: normal Milestone: 3.1.0
Component: base Version: dev-latest
Keywords: Cc:
Blocked By: Blocking:
Patch: yes

Description

auto str = wxString::FromUTF8("\xf0\x9f\x98\x84");
printf("%s\n", str.utf8_str().data());
assert(str.size() == 1);

wxMBConvUTF16 conv;
auto str2 = wxString(conv.cMB2WX((const char *)u"\xd83d\xde04"));
printf("%s\n", str2.utf8_str().data());
assert(str2.size() == 1);

This correctly prints the same character both times, but the second assert fails due to str2.size() being 2, and with longer strings anything that operates based on the string length ends up reading uninitialized memory.

The cause of this is that wxMBConvUTF16straight::ToWChar skips calculating the required buffer size by taking advantage of that UTF-32 is never more code units than UTF-16 and allocating a larger buffer than needed when surrogate pairs are involved, but wxMBConv::cMB2WC assumes the buffer size is the actual string size and discards the length returned by the actual conversion. The attached patch fixes this by making wxMBConv::cMB2WC shrink the buffer to the length actually used by the conversion.

Attachments (1)

utf16-conv-length.patch download (973 bytes) - added by plorkyeran 5 months ago.

Download all attachments as: .zip

Change History (6)

Changed 5 months ago by plorkyeran

comment:1 Changed 5 months ago by vadz

  • Milestone set to 3.1.0

Thanks for finding this problem!

I'm not sure about the fix though, I'm tempted to fix wxMBConvUTF16::ToWChar(NULL) return value instead, it clearly doesn't satisfy the requirements to this method described in the comment in wxMBConv. What do you think?

We also clearly need a test case for this in source:wxWidgets/trunk/tests/strings/unicode.cpp

comment:2 Changed 5 months ago by plorkyeran

Making ToWChar(NULL) calculate the actual length is definitely the way to go if that's what the interface is documented as doing. It'd be very difficult to construct scenarios where the performance hit is meaningful.

comment:3 Changed 5 months ago by VZ

  • Owner set to VZ
  • Resolution set to fixed
  • Status changed from new to closed

In 76618:

Fix wrong resulting string length in UTF-16 to wchar_t conversion.

Don't optimize the returned length for surrogate-less case, this does save a
pass of the string but at the price of returning a wrong result, which is not
worth it, just compute the really required length exactly.

Closes #16298.

comment:4 Changed 5 months ago by VZ

In 76622:

Fix wrong resulting string length in UTF-16 to wchar_t conversion.

Don't optimize the returned length for surrogate-less case, this does save a
pass of the string but at the price of returning a wrong result, which is not
worth it, just compute the really required length exactly.

Closes #16298.

comment:5 Changed 11 days ago by VZ

In 78022:

Disable surrogate-related unit test for MSW.

This test can't work when the in-memory representation is UTF-16, as we can't
convert surrogates to anything else in this case.

This fixes the unit tests broken since r76622, see #16298.

Note: See TracTickets for help on using tickets.