Opened 2 years ago

Last modified 15 months ago

#14720 confirmed defect

wxTextInputStream doesn't work correctly with its default wxConvAuto

Reported by: Sentosa Owned by:
Priority: normal Milestone: 3.2.0
Component: wxOSX-Cocoa Version: 2.9.4
Keywords: Cc:
Blocked By: Blocking:
Patch: no

Description

If you use wxExecute and capture the command output on OSX (wxOSX-Cocoa 64bit) then UNICODE characters are not correct in wxArrayString.

  • name a file äöüàéè.txt
  • Take exec sample and pass ls as command.

wxString cmd = _T("ls");
wxArrayString output, errors;
int code = wxExecute(cmd, output, errors);

In output you see that äöüàéè is not shown correctly.

It happens with SDK 10.5, 10.6, 10.7 and 10.8.

Int wxWidgets 2.8 Carbon 32bit UNICODE it was working.

Attachments (2)

äöüàéè.txt download (29 bytes) - added by Sentosa 2 years ago.
Exec wxWidgets sample.png download (45.3 KB) - added by Sentosa 2 years ago.
output

Download all attachments as: .zip

Change History (14)

comment:1 Changed 2 years ago by vadz

  • Milestone 2.9.5 deleted
  • Summary changed from wxExecute UNICODE problem to wxExecute() Unicode output not converted correctly in wxOSX

What's the encoding used for the file names under OS X? E.g. what does ls | od -c show for this file?

Changed 2 years ago by Sentosa

Changed 2 years ago by Sentosa

output

comment:2 Changed 2 years ago by Sentosa

  • Standard OSX installation
  • File generated with $ date > äöüàéè.txt

$ ls | od -c
0000000 a ̈ o ̈ u ̈ a ̀ e ́ e
0000020 ̀
. t x t \n
0000027

See attached file and picture.

comment:3 Changed 2 years ago by vadz

Ugh, I should have probably asked for "od -t x1", sorry. I thought "-c" would use octal escapes for non-ASCII characters.

Anyhow, I strongly suspect it's another case of NFC vs NFD. I.e. we should recompose the stuff we get from wxExecute().

comment:4 Changed 2 years ago by Sentosa

$ ls | od -t x1
0000000 61 cc 88 6f cc 88 75 cc 88 61 cc 80 65 cc 81 65
0000020 cc 80 2e 74 78 74 0a
0000027

$ ls
äöüàéè.txt

comment:5 Changed 2 years ago by vadz

  • Milestone set to 3.0
  • Status changed from new to confirmed

Thanks, that's it. cc 88 is the "combining diaeresis" in UTF-8 so it's really in decomposed form. The surprising thing is that it worked in 2.8 as this should have been just as true there.

Anyhow, the really problematic thing is that it's not clear at which level should the string be recomposed. We can't do it in wxExecute() itself because it works with bytes, not strings. We could do it in ReadAll() helper in source:wxWidgets/trunk/src/common/utilscmn.cpp but this seems like a rather ad hocish hack and won't help if you're not using the overloading taking wxArrayString but reading from the child process directly. It would be better to do it at wxProcess level, i.e. recompose strings in wxPipeInputStream but this is again impossible because it only reads bytes, not strings.

I'm really not sure what to do here, any ideas?

comment:6 Changed 2 years ago by Sentosa

This way it works

void ExecCommand(wxString &cmd, wxArrayString &output)
{

wxProcess p;
p.Redirect();
wxExecute(cmd, wxEXEC_SYNC, &p);


wxInputStream *i = p.GetInputStream();
if(i)
{

wxTextInputStream t(*i, " \t", wxConvUTF8);


while(!i->Eof())
{

output.Add(t.ReadLine());

}

}

}

comment:7 Changed 2 years ago by vadz

I don't see how can it work... you should be still getting the decomposed form in wxString. OTOH it's true that I didn't look at your screenshot carefully enough before and now that I do, I don't understand what is <i!> (U+00EC, c3 ac in UTF-8) doing there neither. Somebody (possible me...) really needs to debug it further.

comment:8 Changed 2 years ago by Sentosa

I think this
wxTextInputStream t(*i, " \t", wxConvUTF8);
should be default and same as
wxTextInputStream t(*i);
but it is not. Second gives wrong characters.

comment:9 Changed 2 years ago by vadz

It should be the same because wxConvAuto (used by default) should try UTF-8 and only fall back to something else if decoding using UTF-8 fails. I have no idea why would it fail though, which is why it would need to be debugged. If you can do this, it would be very welcome.

comment:10 Changed 2 years ago by vadz

The real question remains the one from comment:5 and I still have no good answer for it.

Any ideas/help?

comment:11 Changed 15 months ago by vadz

  • Milestone changed from 3.0 to 3.2
  • Summary changed from wxExecute() Unicode output not converted correctly in wxOSX to wxTextInputStream doesn't work correctly with its default wxConvAuto

I did debug it and actually the problem is simpler than NFC/NFD. Or, rather, there is this problem too but there is one in wxTextInputStream which breaks everything before we can get to that problem.

As comment:8 noticed, wxTextInputStream behaves differently when constructed with wxConvUTF8 or when using wxConvAuto which is the default. This happens because when we try reading 0xcc byte (first byte of U0308 (COMBINING DIAERESIS) decomposition), wxConvAuto can't decode it using the default UTF-8 and falls back to Latin-1, which it then uses for decoding all the rest.

I'm going to use a hack here to at least make this particular case work but the real problem is that wxConvAuto doesn't work with wxTextInputStream approach, it needs to be given enough text to really determine the encoding but here it's fed byte by byte. I'm really not sure what to do about it, but this must be fixed somehow.

comment:12 Changed 15 months ago by VZ

(In [74946]) Fix capturing non-ASCII output using wxExecute().

Explicitly use wxConvLibc with wxTextInputStream to make sure we correctly
decode non-ASCII data in the subprocess output.

This is a hack, the real solution would be to make wxTextInputStream work
properly with wxConvAuto.

See #14720.

Note: See TracTickets for help on using tickets.