#18137 closed defect (fixed)

WxThread: TestDestroy() and Pause() may cause MS debug version of free() to wait forever

Reported by: andreamrl Owned by: Vadim Zeitlin <vadim@…>
Priority: normal Milestone:
Component: base Version: 3.0.2
Keywords: wxThread Pause TestDestory Cc:
Blocked By: Blocking:
Patch: yes

Description (last modified by vadz)

I think I've found a really cumbersome bug that, in certain specific conditions, makes calling wxThread::Pause() unsafe if that wxThread calls TestDestroy() in its loop (that is indeed what it is expected to do)..

The "specific conditions" I'm talking about are:
I'm on Windows (ver 8), I'm compiling on MSVC 2013, and apparently I'm using debug version of malloc()/free() that, AFAICT, perform sanity checks i.e. calling RtlValidateHeap() internally.

The scenario is the following:
I have a wxThread, that at a certain point I want to stop. Before really stopping that wxThread by calling its Pause() method (from the main thread) I make sure that my wxThread gives up of doing its stuff (so it does not hold any wxMutex, etc..), and I force it to basically only loop over TestDestroy(). Note that wxWidget documentation advise to make sure that a wxThread does periodically call TestDestroy to make sure Pause() does work.

Very often happens that:
-I stop my wxThread calling its Pause() method in the main thread
-That thread stops
-Then in the main thread I call free() on a buffer I want to get rid of, and the main thread remains stuck forever inside the free() function.

What seems is happening here is that:
-The wxThread stops inside TestDestroy(), more specifically after it has entered a CriticalSection, as shown in call stack [1]. So the thread is stopped inside a CriticalSection - that seems wrong to me.

-The free() function in turn calls debug sanity check stuff i.e. HeapValidate() that in turns try entering a CriticalSection, as shown in call stack [2], but it remains stuck there because of the wxThread already stopped there

It seems to me that, because wxWidget documentation advise to call TestDestroy in a wxThread on which we want to be able to call Pause(), then wxWidget implementation of TestDestroy() must guarantee not to hold a lock when the thread becomes halted by calling wxPause().

Thanks,
Andrea

[1] Call stack of my wxThread worker

 	ntdll.dll!RtlpAllocateHeap()	Unknown
 	ntdll.dll!RtlAllocateHeap()	Unknown
 	ntdll.dll!RtlDebugAllocateHeap()	Unknown
 	ntdll.dll!RtlpAllocateHeap()	Unknown
 	ntdll.dll!RtlAllocateHeap()	Unknown
 	ntdll.dll!RtlpAddDebugInfoToCriticalSection()	Unknown
 	ntdll.dll!RtlpWaitOnCriticalSection()	Unknown
>	ntdll.dll!RtlpEnterCriticalSectionContended()	Unknown
 	graph64.exe!wxCriticalSection::Enter() Line 165	C++
 	graph64.exe!wxCriticalSectionLocker::wxCriticalSectionLocker(wxCriticalSection & cs) Line 308	C++
 	graph64.exe!wxThread::TestDestroy() Line 1253	C++
 	graph64.exe!ClientThread::Entry() Line 91	C++
 	graph64.exe!wxThread::CallEntry() Line 356	C++
 	graph64.exe!wxThreadInternal::DoThreadStart(wxThread * thread) Line 574	C++
 	graph64.exe!wxThreadInternal::WinThreadStart(void * param) Line 606	C++
 	msvcr120d.dll!_callthreadstartex() Line 376	C
 	msvcr120d.dll!_threadstartex(void * ptd) Line 359	C
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown

[2] Call stack of the main thread:

 	ntdll.dll!NtWaitForSingleObject()	Unknown
 	ntdll.dll!RtlpWaitOnCriticalSection()	Unknown
 	ntdll.dll!RtlpEnterCriticalSectionContended()	Unknown
 	ntdll.dll!RtlValidateHeap()	Unknown
 	KernelBase.dll!HeapValidate()	Unknown
 	msvcr120d.dll!_CrtIsValidHeapPointer(const void * pUserData) Line 2034	C++
 	msvcr120d.dll!_free_dbg_nolock(void * pUserData, int nBlockUse) Line 1322	C++
 	msvcr120d.dll!_free_dbg(void * pUserData, int nBlockUse) Line 1265	C++
 	msvcr120d.dll!free(void * pUserData) Line 50	C++
>	graph64.exe!LedPanel::Stop() Line 276	C++
 	graph64.exe!MainWindow::Stop_() Line 478	C++
 	graph64.exe!MainWindow::Stop() Line 466	C++
 	graph64.exe!MainWindow::OnStop(wxCommandEvent & __formal) Line 515	C++
....

Change History (8)

comment:1 Changed 19 months ago by vadz

  • Description modified (diff)
  • Resolution set to invalid
  • Status changed from new to closed

There are 2 things here: first, Pause() (and also Destroy()) and TestDestroy() are pretty poor APIs that I don't recommend using at all. I don't know why do you think you need to call Pause() to stop the thread, but I'm almost certain that this is not the case, so the real solution is to just stop doing this.

Second, TestDestroy() being blocked trying to acquire the critical section just means that another thread is currently inside this critical section. I don't see which one is it, but you should be able to debug it easily by checking the owning thread field of the critical section struct. You can't have a deadlock with just a single synchronization object, so the heap critical section inside ntdll.dll doesn't explain the problem on its own. But you need to debug this further to understand what does or, alternatively, reproduce the problem with minimal changes in the thread sample. Unless you can do it, there doesn't seem to be any reasons to believe that the problem is really in wxThread code rather than in your application and I can't debug the latter, sorry.

comment:2 Changed 19 months ago by andreamrl

  • Cc andrea.merello@… added

Hi

Thank for your recommendation; but may you please suggest which better API is one supposed to use in order to suspend a wxThread instead of calling Pause() ?

.. The reason I'm using this poor API is just because, according to the documentation, it seems the only one available..

Still, as long as this API is available and documented, I suppose that it should also work correctly.. Otherwise IMHO it would be better to remove it..

I will try to produce a minimal example that reproduces the issue, although I'm not sure it will successfully hit the bug on all systems (possibly this is timing dependant).

BTW I'm also not sure that I've explained this thing well enough, because you talk about a deadlock, but indeed I'm not talking about any deadlock here (as you said, it would require two locks i.e. taken in the wrong order). And it's not TestDestroy() blocked trying to acquire the critical section.

Instead, it seems to me that TestDestroy() successfully acquires the critical section, and, after that, the wxThread gets suspended by the Pause() call, while it is still inside the critical section. That is the wxThread will never exit the critical section (until we resume the thread), causing any other attempt to enter the critical section to wait forever.

Note that DOC says that on Win32 implementation, Pause() could stop the thread anywhere (so on this platform TestDestroy() does not probably matter for the purpose of making Pause() working, still we probably need to call it for ensuring Pause() works on other platforms).

So it seems that Pause() ends up in stopping the wxThread (often) in TestDestroy() just by chance and because of how the code is made (basically looping on TestDestroy() only when we are about to suspend the wxThread).

IMHO The point here is that:

  • One must call TestDestroy() to ensure Pause() works on all platforms.
  • TestDestroy() does take a critical section.
  • On platforms in which Pause() could stop the thread anywhere, it seems that it could happen that the wxThread gets suspended while it is in the critical section (inside TestDestroy).

.. And this seems pretty wrong to me. It seems a no-way-out. The fact that in my case the free() function get stuck seems just one possible outcome of this..

Andrea

comment:3 Changed 19 months ago by vadz

  • Resolution invalid deleted
  • Status changed from closed to reopened

Pause() is indeed the only way to suspend a thread, but you wrote that you wanted to stop it and I still don't see why do you need to suspend it first?

But you're right that I misunderstood you originally, sorry, maybe there is indeed a bug with a critical section being held by suspended thread here. Just to check that I understand you correctly now: the sequence of actions here is

  1. Worker thread calls TestDestroy()
  2. Main thread calls Pause()
  3. Worker tries to acquire wxThread::m_critsect...
  4. ... but main gets m_critsect first
  5. Which makes worker start waiting on the CS and it enters into some CS inside system heap code in order to do it.
  6. And suspends the worker

If so, it's not obvious what could we do to avoid this. The only solution I see is to use an atomic variable for the state instead of protecting it with a CS, this should take care of this scenario.

Alternatively, and simpler, we might use TryEnter() -- it's not like using Enter() really buys us anything as there is unavoidable race condition here anyhow, because the actual state can always change between TestDestroy() call and when its result is used.

So, all this being said, could you please check if this patch:

  • src/msw/thread.cpp

    diff --git a/src/msw/thread.cpp b/src/msw/thread.cpp
    index 0805e1e19a..62c0c2b577 100644
    a b bool wxThread::IsPaused() const 
    11951195
    11961196bool wxThread::TestDestroy()
    11971197{
    1198     wxCriticalSectionLocker lock(const_cast<wxCriticalSection &>(m_critsect));
     1198    wxCriticalSection& cs = const_cast<wxCriticalSection &>(m_critsect);
     1199
     1200    bool cancelled;
     1201    if ( cs.TryEnter() )
     1202    {
     1203        cancelled = m_internal->GetState() == STATE_CANCELED;
     1204        cs.Leave();
     1205    }
     1206    else // Another method of this class is changing the state right now.
     1207    {
     1208        // Assume we hadn't been cancelled yet: this is less wrong than
     1209        // assuming that we had been and we can't start blocking on the CS now
     1210        // as the other function inside the CS right now could be Pause(), and
     1211        // if it suspends this thread while we're waiting on a CS, it could
     1212        // result in deadlocks later as the code inside Windows blocking on the
     1213        // CS itself acquires some other CS inside heap-related code in
     1214        // ntdll.dll and we shouldn't be suspended while inside this CS.
     1215        cancelled = false;
     1216    }
    11991217
    1200     return m_internal->GetState() == STATE_CANCELED;
     1218    return cancelled;
    12011219}
    12021220
    12031221// ----------------------------------------------------------------------------

fixes the problem for you?

comment:4 Changed 19 months ago by andreamrl

  • Cc andrea.merello@… removed

Sorry, it's my fault: I've wrote that I wanted to stop the wxThread, but I meant I wanted to suspend it..

The scenario you depicted is even more cumbersome than the one I have imagined..

I'm not honestly exactly sure about how to interpret the call stack in terms of 'who does succeed entering CI and who does wait'; but seeing the call NtWaitForSingleObject() only on the main thread let me suppose that the main thread has to wait, while the wxThread entered flawlessy in the CI.

IMHO it can simply happen that the wxThread gets suspended in the critical section if either:

  • Pause() missed a critical section enter/exit - My 1st thought, but I see that's not the case.
  • SuspendThread() Win32 API does not guarantee that the thread is already really suspended when SuspendThread() returns. I place my bet here.

So maybe:
1- main thread calls Pause(), enters CI and calls SuspendThread(), then exit CI.
2- wxThread still lives enough to enter TestDestroy(), enter CI, and then it gets really suspended.

SuspendThread() doc is not clear on it, but googling I found other people speculating about it being asynchronous().

https://blogs.msdn.microsoft.com/oldnewthing/20150205-00/?p=44743

Said that, I'll test your patch ASAP (that might be tomorrow) and I'll tell you the outcome.

Andrea

comment:5 Changed 17 months ago by vadz

  • Description modified (diff)

I think you're right and SuspendThread() is indeed async. However this means that we basically can't ensure that the thread doesn't get suspended inside a CS, so nothing can be done about it...

Unless we use the trick mentioned in Raymond Chen's post above and call GetThreadContext(). So what about the following patch instead?

  • src/msw/thread.cpp

    diff --git a/src/msw/thread.cpp b/src/msw/thread.cpp
    index 0805e1e19a..5c94613f46 100644
    a b bool wxThreadInternal::Suspend() 
    875875        return false;
    876876    }
    877877
     878    // Calling GetThreadContext() forces the thread to actually be suspended:
     879    // just calling SuspendThread() is not enough, it just asks the scheduler
     880    // to suspend the thread at the next opportunity and by then we may already
     881    // exit wxThread::Pause() and leave m_critsect, meaning that the thread
     882    // could enter it and end up suspended inside a CS, which will inevitably
     883    // result in a deadlock later.
     884    CONTEXT ctx;
     885    // We don't really need the context, but we still must initialize it.
     886    ctx.ContextFlags = CONTEXT_FULL;
     887    if ( !::GetThreadContext(m_hThread, &ctx) )
     888    {
     889        wxLogLastError(wxS("GetThreadContext"));
     890    }
     891
    878892    m_state = STATE_PAUSED;
    879893
    880894    return true;

comment:6 Changed 17 months ago by vadz

  • Patch set
  • Status changed from reopened to confirmed

comment:7 Changed 14 months ago by vadz

If there are no objections, I'm going to merge this PR soon.

comment:8 Changed 13 months ago by Vadim Zeitlin <vadim@…>

  • Owner set to Vadim Zeitlin <vadim@…>
  • Resolution set to fixed
  • Status changed from confirmed to closed

In 5d06593ae/git-wxWidgets:

Ensure that wxThread::Pause() suspends the thread immediately

Surprisingly, ::SuspendThread() doesn't actually do it, but only
schedules the thread for suspension in some undetermined near future.
This could result in problems as Pause() could exit, releasing the CS
protecting wxThread internal data in the main thread, and allowing the
worker thread to enter it (e.g. in its TestDestroy()) before being
actually suspended, meaning that the thread got suspended inside a CS,
which resulted in later deadlocks.

Fix this by calling ::GetThreadContext() which is supposed to ensure
that the scheduler does really suspend the thread before it returns (as
it's impossible to get the context of a thread while it's running).

Closes #18137.

Note: See TracTickets for help on using tickets.