Opened 5 years ago

Closed 4 years ago

#10375 closed optimization (fixed)

Inconsistent String comparison results

Reported by: ZenJu Owned by:
Priority: high Milestone: 2.9.2
Component: wxMSW Version: 2.8.9
Keywords: locale collation regression Cc:
Blocked By: Blocking:
Patch: no

Description

Hi,

I'm programming on a tool where performance of string comparison is extremely important. I had thought that the implementation of

int wxString::CmpNoCase(const wxString& s) const

was reasonably fast.

But then I tried the Windows API

int lstrcmpi(
LPCTSTR lpString1, address of first string
LPCTSTR lpString2
address of second string
);


which is about 50 % faster in my tests. So my suggestion is, to simply replace the implementation in the windows build with the second one.

Best regards, ZenJu

Attachments (1)

string.patch download (764 bytes) - added by ZenJu 5 years ago.
Performance optimization for wxString::CmpNoCase()

Download all attachments as: .zip

Change History (22)

comment:1 Changed 5 years ago by vadz

  • Keywords simple added
  • Milestone 2.8.9 deleted
  • Priority changed from normal to low
  • Status changed from new to confirmed

We should use strcmpi() or strcasecmp() here, I really don't know why are we reimplementing such basic functionality ourselves. And strcmpi() should be about the same as lstrcmpi().

If you could please make a patch to the trunk version of CmpNoCase() doing this, it would be appreciated.

Changed 5 years ago by ZenJu

Performance optimization for wxString::CmpNoCase()

comment:2 Changed 5 years ago by ZenJu

Hi Vadim,

I've created a patch for the windows build.

Best regards, ZenJu

comment:3 Changed 5 years ago by VZ

  • Resolution set to fixed
  • Status changed from confirmed to closed

(In [58155]) use wcs(n)casecmp() if available; use wxStricmp() to implement wxString::CmpNoCase() as it's significantly more efficient than wx code (closes #10375)

comment:4 Changed 5 years ago by ZenJu

  • Resolution fixed deleted
  • Status changed from closed to reopened

Hi,

in my testcase I am sorting an array of 50000 strings and do a binary search afterwards. When I use wxStricmp() as compare function I get an (average) CPU-time of 2400 ms, when I use lstrcmpi() 1700 ms.

I don't know why, but the windows variant of case-insensitive string comparison still is a lot faster. Can you confirm this?

Regards, ZenJu

comment:5 Changed 5 years ago by vadz

  • Status changed from reopened to infoneeded_new

What compiler do you use? wxStricmp() is just an inline function calling wxCRT_StricmpW() which should be just _wcsicmp() with MSVC so it should be really just as efficient (at least in release build where inline functions are effectively inlined). For other Windows compilers we could indeed use lstrcmpi[AW]() for wxCRT_Stricmp[AW]() implementation, and implement just wxCRT_Strncimp[AW]() in wx itself...

comment:6 Changed 5 years ago by ZenJu

  • Status changed from infoneeded_new to new

Hi Vadim,

I am using Mingw based on gcc 3.4.5 and have tested in release build. You're right, it just redirects to _wcsicmp():

wchar.h:
_CRTIMP int cdecl MINGW_NOTHROW _wcsicmp (const wchar_t*, const wchar_t*);

It seems _wcsicmp() is slower than lstrcmpi() at least for GCC. If this advantage is only for GCC I think it would be nice to redirect to the latter in case of the GCC compiler.

comment:7 Changed 5 years ago by vadz

  • Resolution set to wontfix
  • Status changed from new to closed

In fact mingw32 _wcsicmp() should be the same one as MSVC one as mingw32 uses MSVC runtime. And it's amazing that it should be noticeable slower than the Windows function, I don't know how to explain it. But I'm still reluctant to replace it with lstrcmpi() as it doesn't return quite the same thing, see its docs in the MSDN, notably the part about "using word sort rather than string sort". And as I'm not sure that this is what we want I prefer to be safe even if it's slower...

comment:8 Changed 5 years ago by ZenJu

  • Resolution wontfix deleted
  • Status changed from closed to reopened

Hi vads,

I have a simple example for testing:

    wxString c = wxT("afds902rlkjdafasdfasdfklj2309f23f32fsadfasdfsadf3r3rfr3f3fasdfsdf");
    wxString b = wxT("afds902rlkjdafasdfasdfklj2309f23f32fsadfasdfsadf3r3rfr3fdfasdfsdf");

    for (int i = 0; i <= 10000000; ++i)
        if (lstrcmpi(c.c_str(), b.c_str()) == 0)
            throw;

    for (int i = 0; i <= 10000000; ++i)
        if (_wcsicmp(c.c_str(), b.c_str()) == 0)
            throw;

In my tests the second loop takes even 5 times longer than the first one. So if you can confirm this, I think it would be a great performance optimization.

PS: concerning word sort: CompareString() with flag SORT_STRINGSORT is nearly just as fast as lstrcmpi() and could be used instead.

comment:9 Changed 5 years ago by vadz

  • Keywords optimization added
  • Milestone set to 2.9.1

[Sorry for taking so much time to get back to this]

I'd like to be able to verify the results because this continues to look very surprising to me, I feel that there must be something we're missing here that explains such huge differences between Windows and CRT versions. So it would be very helpful if you could make a patch to tests/benchmarks/strings.cpp adding the different cases. Otherwise I'll try to find time for this myself later.

comment:10 Changed 5 years ago by ZenJu

Hi,

I don't know how to setup wxWidgets benchmarks, but here's the windows function I use for fast case-insensitive string comparison. Hope it helps:

int compareStringsWin32(const wchar_t* a, const wchar_t* b, const int aCount =  -1, const int bCount = -1)
{
    const int rv = CompareString(
                       LOCALE_USER_DEFAULT, //locale identifier
                       NORM_IGNORECASE | SORT_STRINGSORT,	  //comparison-style options
                       a,  	            //pointer to first string
                       aCount,	            //size, in bytes or characters, of first string
                       b,	            //pointer to second string
                       bCount); 	     //size, in bytes or characters, of second string
    if (rv == 0)
        throw RuntimeException(wxString(wxT("Error comparing strings!")));
    else
        return rv - 2; //convert to C-style string compare result
}

Regards, ZenJu

comment:11 Changed 5 years ago by VZ

(In [60001]) handle embedded NULs correctly in wxString::CmpNoCase() in all builds and not just UTF-8 one; incidentally improve its performance under Windows (see #10375)

comment:12 Changed 5 years ago by vadz

  • Resolution set to fixed
  • Status changed from reopened to closed

Actually using CompareString() is more correct than using _wcsicmp() as the latter doesn't handle embedded NULs while the former does. So I do use it now, please let me know if you still have any problems.

Thanks!

comment:13 Changed 5 years ago by ZenJu

Hi,

just one small thing: I see the call to CompareString() looks like this:

    switch ( ::CompareString(LOCALE_USER_DEFAULT, NORM_IGNORECASE, 
 		                             m_impl.c_str(), m_impl.length(), 
 		                             s.m_impl.c_str(), s.m_impl.length()) ) 
 		    {

I suggest you add the parameter SORT_STRINGSORT. In my software FreeFileSync I had some very hard to debug error which turned out to be some issue with string sorting for the swedish locale, where CompareString() did not behave as a strikt weak ordering function when used in "word-sort" mode (without the SORT_STRINGSORT parameter.)

Regards, ZenJu

comment:14 Changed 5 years ago by ZenJu

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:15 Changed 5 years ago by vadz

  • Status changed from reopened to infoneeded_new

Could you please provide more details? From reading the documentation it seems that we want to use the (usual) "word sort" rather than "string sort" and it certainly doesn't mention that the former doesn't define a strict weak ordering...

TIA!

comment:16 Changed 5 years ago by ZenJu

  • Status changed from infoneeded_new to new

The issue was that std::sort() for a vector of strings did not work as expected when using CompareString() with word sort as sorting function. Unfortunately this error seemed to be locale dependent, if I either used parameter SORT_STRINGSORT or LOCALE_INVARIANT it did not occur. I couldn't reproduce it on my PC but had to do a remote analysis with a user from sweden.

I had a look on the MSN page http://msdn.microsoft.com/en-us/library/ms647476.aspx comment "Inconsistent results when comparing Thai". So I think there is some risk that word sort could introduce some errors concerning strikt weak ordering.

-ZenJu

comment:17 Changed 4 years ago by VZ

  • Resolution set to fixed
  • Status changed from new to closed

(In [63941]) Use string sort order with CompareString() in wxString::CmpNoCase().

Using the default word sort order may fail to define a strict weak order using
this function, thus breaking algorithms such as std::sort which rely on its
properties. It's also more consistent with the fallback manual implementation.

Closes #10375.

comment:18 Changed 4 years ago by vadz

  • Keywords locale collation regression added; simple optimization removed
  • Milestone changed from 2.9.1 to 2.9.2
  • Priority changed from low to high
  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Summary changed from Performance improvement: String comparison to Inconsistent String comparison results

Unfortunately using SORT_STRINGSORT was a mistake. Now when comparing "!" and "'" we get different results depending on whether the comparison is case-sensitive or not (which doesn't make any sense as neither string is case-sensitive) and also different results under MSW (where "'" now comes first) and elsewhere (where "!" comes first, as per ASCII ordering).

Apparently removing this flag is not an option neither because of the (albeit still unclear/irreproducible) problem of comment:16. So I think we should just forget about using ::CompareString() and use inefficient but correct manual version.

Any other ideas?

comment:19 Changed 4 years ago by ZenJu

Hi,

CompareString with SORT_STRINGSORT has other issues as well, for example special German characters are handled incorrectly:
"weiß" and "weiss" are evaluated as the same string.

I myself have the usecase of requiring an efficient and correct string comparison for filenames. I don't know if this is too much a restriction for wxString, but for a case-insensitive comparison of filenames the fastest working functions are:

LCMapString + binary comparison: existing with Windows 2000 upwards
CompareStringOrdinal : the fastest function I know of, unfortunately not before Windows Vista.

Regards, ZenJu

comment:20 Changed 4 years ago by disc

Since 2.9 r63941 I'm seeing unexpected results with sorting as well. My situation is static (translation) tables which must have a sorted order and the startup integrity test fails on MSW after CmpNoCase started using CompareString.
Similar to comparing ! and ', I have problems with \n (10) and a space (32) :

wxASSERT(wxT('\n') < wxT(' ')); // OK
wxASSERT(wxString(wxT("\n")).Cmp(wxT(" ")) < 0); // OK
wxASSERT(wxStricmp(wxT("\n"), wxT(" ")) < 0); // OK

wxASSERT(wxString(wxT("\n")).CmpNoCase(wxT(" ")) < 0); // assert on wxMSW 2.9 (Ansi and Unicode)

(I could add some unit tests if we agree the current MSW behaviour is not desired).

The same occurs with at least using \r and \t instead of \n. And it happens regardless of using LOCALE_INVARIANT and/or SORT_STRINGSORT as well as a bunch of other flag combinations I tried. The problem is solved when I use the manual comparison in CmpNoCase.
I would vote for consistency (among platforms and wx versions) rather than efficiency.

In addition maybe there's also a place for 'locale/GUI sorting' (sometimes you want "weiß" and "weiss" to be equal), for example in the form of extra functions CmpLocale and CmpLocaleNoCase (which for non-MSW platforms would now just call the Cmp and CmpNoCase functions).

comment:21 Changed 4 years ago by VZ

  • Resolution set to fixed
  • Status changed from reopened to closed

(In [65572]) Don't use native MSW functions in wxString::CmpNoCase().

While the native CompareString() is much more efficient than MSVC CRT version
of _wcsicmp(), it gives unexpected results for non-letter characters, so don't
use it but use the slow but correct wxStricmp() instead.

At least don't use char-by-char comparison (in non-UTF-8 case) as it's the
slowest possible implementation of this function, the new one using
wxStricmp() is 3 times faster (by comparison, using CompareString() is 16
times faster still -- but wrong).

Closes #10375.

Note: See TracTickets for help on using tickets.