Ticket #10063 (closed defect: worksforme)

Opened 6 weeks ago

Last modified 5 weeks ago

regex find fails under Unicode, succeeds in regular build

Reported by: widgets Owned by:
Priority: normal Milestone:
Component: wxMSW Version: 2.8.9
Keywords: unicode regex Cc:
Blocked By: Patch: no
Blocking:

Description

I'm using MS VC++ Express 2008 to compile Dave Silvia's regex tester program using DLLs(using it to verify the same problem in my own unicode app build - see wxWidgets forum http://wxforum.shadonet.com/viewtopic.php?t=21267 ) using the 2.8.9 version of wxWidgets

The regex expression is one of two
Received: from ([[:alnum:][:punct:][:space:]]+?) for <
Received: from ([[:alnum:][:punct:][:space:]]+) for <

The test text:
(excluding the ++++ delimiters)
+++++++++++++++++++++++++++++++++
From - Tue Oct 07 12:09:18 2008
X-Account-Key: account5
X-UIDL: <000301c92868$5f299400$0b2be050@JBAENA>
X-Mozilla-Status: 0001
X-Mozilla-Status2: 10000000
X-Mozilla-Keys:
Return-Path: <rioux2005@…>
Received: from priv-edmwaa02.teldplanet.net ([204.209.205.55])

by priv-edmwes50.teld.net
(InterMail vM.7.08.02.00 201-2186-121-20061213) with ESMTP
id <20081007103518.EQVT3012.priv-edmwes50.teld.net@…>
for <somebody@…>; Tue, 7 Oct 2008 04:35:18 -0600

Received: from 80.224.43.11.static.user.ono.com (80.224.43.11.static.user.ono.com [80.224.43.11])

by priv-edmwaa02.teld.net (BorderWare Security Platform) with ESMTP id 805A06263C193DFF
for <somebody@…>; Tue, 7 Oct 2008 04:35:11 -0600 (MDT)

From: Chae <rioux2005@…>
+++++++++++++++++++++++++++++++++

Compiling the tester without Unicode support will find the string as expected with either regex. When compiled for Unicode support it will fail to find it. I suppose I could also recompile my own app without Unicode support - and if I have to I will, but I think the regex tester would work quite well as a test app for this problem.

Attachments

console.cpp (0.6 kB) - added by mweth 5 weeks ago.
regex.txt (0.8 kB) - added by mweth 5 weeks ago.
console.zip (2.8 kB) - added by widgets 5 weeks ago.

Change History

  Changed 6 weeks ago by widgets

  • status changed from new to closed
  • resolution set to fixed

I ticked 2.8.8 in the submission since the tracker seemed to reject my ticket without the filed and there was no 2.8.9 option

  Changed 6 weeks ago by widgets

  • status changed from closed to reopened
  • resolution deleted

I failed to notice the 'close' check box - the problem still exists.

  Changed 6 weeks ago by robind

  • version changed from 2.8.8 to 2.8.9

in reply to: ↑ description   Changed 6 weeks ago by mweth

I tried it with the non-greedy pattern:
"Received: from ([[:alnum:][:punct:][:space:]]+?) for <"
using wxRE_ADVANCED

And on my system too it matched differently with and without Unicode.

Without Unicode the first match was:
Received: from priv-edmwaa02.teldplanet.net ([204.209.205.55])

by priv-edmwes50.teld.net
(InterMail vM.7.08.02.00 201-2186-121-20061213) with ESMTP
id <20081007103518.EQVT3012.priv-edmwes50.teld.net@priv-

edmwaa02.teld.net>

for <

But with Unicode it misses that one and the first match was:
Received: from 80.224.43.11.static.user.ono.com
(80.224.43.11.static.user.ono.com [80.224.43.11])

by priv-edmwaa02.teld.net (BorderWare Security Platform) with

ESMTP id 805A06263C193DFF

for <

The difference is [:punct:]. In a non-Unicode build it corresponds to
ispunct(), while in a Unicode build it corresponds to the Unicode
punctuation classes.

The characters that aren't matching are '<' and '>' as their
category is "Symbol, Math [Sm]" rather than one of the punctuation
classes.
http://www.fileformat.info/info/unicode/char/003C/index.htm

Also note that even in non-Unicode mode is won't necessarily work in
all locales as ispunct() is locale dependent.

Can you just use this instead?
"Received: from (.+?) for <"

follow-up: ↓ 6   Changed 6 weeks ago by widgets

Thanks for looking into this issue.

The problem I had was that it wouldn't find either string.

As you explained it likely is a difference in locale, although I'll have to see what mine is set at. ;-)
I've never had to even think about this issue, let alone find out what mine is set to

It would explain why the regexTester seems to behave the same way as my app.

I have meanwhile converted my app to ANSI build so that I can proceed while I resolve this issue. In the process I have found that I will have to change my regex in any case since I will have to change the way I attempt to handle the various possibilities.

This regex was my first cut at such a beastie as well as my first Unicode app, so I have had to change my approach a few times already. :-)

As it is, the differences in what is considered 'punctuation' by ANSI versus Unicode versus e-mail header syntax rules probably means that I'd better define my own set of characters which the e-mail header rules consider legal in this context.

So the only issue I have to resolve now is how to set my locale, or at least keep the issue in mind while sorting out the best approach for my app.

in reply to: ↑ 5   Changed 5 weeks ago by mweth

  • status changed from reopened to closed
  • resolution set to worksforme

The problem I had was that it wouldn't find either string.

As you explained it likely is a difference in locale,

No actually I was saying the Unicode build doesn't depend on the
locale.

I've just rechecked that it does find a match with a Windows build. It
works here so I guess you must have something else wrong too, such as
a character set conversion or something like that. I can attach the
test program I used if that helps.

As it is, the differences in what is considered 'punctuation' by
ANSI versus Unicode versus e-mail header syntax rules probably means
that I'd better define my own set of characters which the e-mail
header rules consider legal in this context.

Yes it isn't great that [:punct:] differs for '<' and '>' between
Unicode and non-Unicode. Looking at the Unicode standard, it seems
that the way we have it is indeed what they recommend:

http://unicode.org/reports/tr18/#Compatibility_Properties

though they've also got a compatibilty compromise that does include
most math symbols.

follow-up: ↓ 8   Changed 5 weeks ago by widgets

As far as the 'locale' issue goes, I had been wondering, but since I am not too familiar with it or its impact on Unicode, I very much appreciate the clarification.

If you could attach the test program, I would be very grateful - and perhaps it might help some else down the road as well.

While in this instance, I thought it best to reorganize my code to avoid this specific problem I do want to get more comfortable with Unicode and regexes and so I would be interested in pursuing the issue for my own satisfaction and education.

In any case, "Thank you"

Changed 5 weeks ago by mweth

Changed 5 weeks ago by mweth

in reply to: ↑ 7   Changed 5 weeks ago by mweth

If you could attach the test program, I would be very grateful - and perhaps it might help some else down the road as well.

Well, it's not all that helpful, but maybe it will be of some help isolating the problem.

You can copy the two files into samples\console, overwriting console.cpp, then
recompile the sample and run it like so:

C:\WX_2_8_BRANCH\samples\console>vc_mswud\console "Received: from ([[:alnum:][:punct:][:space:]]+?) for <" < regex.txt
Received: from 80.224.43.11.static.user.ono.com
(80.224.43.11.static.user.ono.com [80.224.43.11])

by priv-edmwaa02.teld.net (BorderWare Security Platform) with

ESMTP id 805A06263C193DFF

for <

HTH,
Mike

follow-up: ↓ 10   Changed 5 weeks ago by widgets

Again, many thanks, Mike.

I have compiled the program as you suggested - I made a separate VS project - but unfortunately I am having no luck, either with the release nor the debug version, compiled either monolithic or using DLLs.

The program compiles without problems, but I don't get the expected output running it either from within the IDE or from the command line :-(

At this time, though, I'm not sure it is worth chasing; perhaps with time I'll stumble across the reason, although it sort of bother me to leave it unresolved.

in reply to: ↑ 9   Changed 5 weeks ago by mweth

I have compiled the program as you suggested - I made a separate VS project - but unfortunately I am having no luck, either with the release nor the debug version, compiled either monolithic or using DLLs.

The program compiles without problems, but I don't get the expected output running it either from within the IDE or from the command line :-(

When you ran it did you redirect the text into it?

e.g vc_mswud\console "pattern" < regex.txt

At this time, though, I'm not sure it is worth chasing; perhaps with time I'll stumble across the reason, although it sort of bother me to leave it unresolved.

If you attach an example program that illustrates the problem I'll look at it.

Changed 5 weeks ago by widgets

follow-ups: ↓ 12 ↓ 13   Changed 5 weeks ago by widgets

Replying to mweth:

I have compiled the program as you suggested - I made a separate VS project - but unfortunately I am having no luck, either with the release nor the debug version, compiled either monolithic or using DLLs.

The program compiles without problems, but I don't get the expected output running it either from within the IDE or from the command line :-(


When you ran it did you redirect the text into it?

e.g vc_mswud\console "pattern" < regex.txt

Yes, that is how I tried to run it.
But both the release and debug version throw exceptions, are complaining about unbalanced [] brackets.
The debug version also throws an assert about a failed regex compile first.


At this time, though, I'm not sure it is worth chasing; perhaps with time I'll stumble across the reason, although it sort of bother me to leave it unresolved.


If you attach an example program that illustrates the problem I'll look at it.

I built it as a monolithic Unicode exe, but it is too large (just less than 500KB, zipped) to attach.
If you want me to attach the DLL version, I can do that.

I'll attach the zipped up project file and sources, they are only 3KB

in reply to: ↑ 11   Changed 5 weeks ago by mweth

When you ran it did you redirect the text into it?

e.g vc_mswud\console "pattern" < regex.txt


Yes, that is how I tried to run it.
But both the release and debug version throw exceptions, are complaining about unbalanced [] brackets.
The debug version also throws an assert about a failed regex compile first.

The pattern comes from the command line.

At this time, though, I'm not sure it is worth chasing; perhaps with time I'll stumble across the reason, although it sort of bother me to leave it unresolved.


If you attach an example program that illustrates the problem I'll look at it.


I built it as a monolithic Unicode exe, but it is too large (just less than 500KB, zipped) to attach.
If you want me to attach the DLL version, I can do that.

I'll attach the zipped up project file and sources, they are only 3KB

Yes just attach source, but that is my own test program which we know works. What I meant was if you want to attach a small example program that illustrates the problem you were having initially then I'll look at it.

in reply to: ↑ 11   Changed 5 weeks ago by mweth

I have compiled the program as you suggested - I made a separate VS project -

Something else that just occured to me, if you used your own project rather than using the console samples' one, you must remember to compile it for the console subsystem. This is because the program uses cin and cout, and when Windows launches a GUI program it leaves those unopened.

Anyway, it seems like this test program is just distracting you from your original isssue, so attach your own test program if you would like me to look at it.

follow-up: ↓ 15   Changed 5 weeks ago by widgets

Thanks for pointing that out; I had forgotten about that detail and it does make a big difference for running the program.

As my latest test, I did copy the console project and then substituted your console.cpp.

If I compile it as is, it compiles as ANSI and it runs as you found.

If I compile it as Unicode, by defining the extra preprocessor definition wxUSE_UNICODE and change the linker input libs to wxbase28ud_net.lib wxbase28ud.lib ( I deleted the obdcxxx.lib since I didn't compile that option) then it does not run and I get the output

08:31:55: Error: Invalid regular expression 'Received: from ([[:alnum:][:punct:]
[:space:]]+?) for <': brackets [] not balanced
08:31:55: Debug: ..\..\src\common\regex.cpp(642): assert "IsValid()" failed in w
xRegEx::Matches(): must successfully Compile() first

Another sample program is Dave Silvia's regexTester; last I tried it the Unicode version showed similar problems while the ANSI build worked as advertised.

in reply to: ↑ 14 ; follow-up: ↓ 16   Changed 5 weeks ago by widgets

Replying to widgets:

Thanks for pointing that out; I had forgotten about that detail and it does make a big difference for running the program.

As my latest test, I did copy the console project and then substituted your console.cpp.

If I compile it as is, it compiles as ANSI and it runs as you found.

If I compile it as Unicode, by defining the extra preprocessor definition wxUSE_UNICODE and change the linker input libs to wxbase28ud_net.lib wxbase28ud.lib ( I deleted the obdcxxx.lib since I didn't compile that option) then it does not run and I get the output

{{{
08:31:55: Error: Invalid regular expression 'Received: from ([[:alnum:][:punct:]
[:space:]]+?) for <': brackets [] not balanced
08:31:55: Debug: ..\..\src\common\regex.cpp(642): assert "IsValid()" failed in w
xRegEx::Matches(): must successfully Compile() first
}}}

Another sample program is Dave Silvia's regexTester; last I tried it the Unicode version showed similar problems while the ANSI build worked as advertised.

I'm hoping this is how I edit my earlier reply.

After working with my own program a bit more, I've decided to simply stick with extended RE since I was able to make things work, but in the process I realized that I did have a couple of problems with the little test program you had made up for me.

One - I was using VS Express 2005 instead of 2008 and - most importantly I was using the ANSI regex lib, instead of wxregexu[d].lib.

After I changed that the program now runs as expected.

I apologize for the trouble I've put you to and I very much apprciate your taking the time to resolve this issue.

in reply to: ↑ 15   Changed 5 weeks ago by mweth

One - I was using VS Express 2005 instead of 2008 and - most
importantly I was using the ANSI regex lib, instead of
wxregexu[d].lib.

After I changed that the program now runs as expected.

Teriffic!

I apologize for the trouble I've put you to and I very much
apprciate your taking the time to resolve this issue.

No the example you attached did indeed show the problem and I should
have looked at it properly and I could have said something more
helpful.

Anyway, I'm glad you solved it!

Note: See TracTickets for help on using tickets.