Sunday, February 19, 2012

PyGTK - DrawingArea, Notebook Tabs, and Deadlocks

Despite my previous rant on GTK, I still need to finish up some work with this toolkit. For example, today we'll briefly discuss yet another example of the bugginess of this toolkit: random deadlocks and "crash by assertions".


Yesterday, I ended up wasting quite a bit of time trying to debug various deadlock freeze bugs I was intermittently getting in a non-deterministic fashion on one of my test machines but not on any other machine I was using. The cases I was dealing with there had cases where the main window would intermittently freeze (and just show a blank grey screen) while loading.

While trying to reproduce these problems by going through and clamping down any places where potential threading-issues may have arisen, I happened to notice a far worse bug. All of the widgets on the main window (scrollbars, tabs, sliders) were all non-responsive, with clicks in certain places causing non-recoverable glitches. Yet, at the same time, there would be a few widgets (toolbar buttons, menu items) which still worked! To make matters even weirder, all the controls on a secondary window worked flawlessly throughout.

After hours of fruitlessly retracing various aspects that I had believed were problematic, commenting out various things added recently just before noticing the bug or checking out different revisions, I finally narrowed down a particular commit that was the source of the problems. The culprit, I eventually found, was in the least expected place!

Custom Widgets and Deadlocks...
I had written a custom widget using gtk.DrawingArea as a base. This custom widget was included as part of a layout that went on a "Notebook" tab thingy (*). The Notebook was being used to emulate the "CardLayout" from Java Swing (with no tabs shown, and no border either); the tab that the custom widget went on was a secondary tab that got shown "later" in response to some other actions taking place elsewhere. The layout manager hosting the Notebook had a show_all() call when everything was in place.
(*) TBH, I can't understand why they chose such a strange name for this, though then again, there's always confusion about what to call the widget which hosts several tabs, the thingies which let you select a tab, and whether a tab includes the content area too. For clarity, I usually refer to them as TabHost, TabLabel/TabPane, and Tab
As it turns out, if you call set_size_request() on the custom widget, then this will result in a deadlock situation whereby all the other widgets which happen to be on the same screen layout as that widget will end up not taking any events, thus resulting in everything not responding. I haven't been able to find anything else about this problem, but for now, I've just commented out this call to get things working fine, albeit the widget draws a bit too large or small at times.

Crash by Assertion
Another one of the problems I've been running into is this cryptic error:
Gdk:ERROR:gdkregion-generic.c:1114:miUnionNonO: assertion failed: (r->x1 < r->x2)
Apparently this is quite infamous in a few other GTK apps, in particular some of the File-A-Bug systems. Although I haven't been able to pinpoint exactly when this occurs, it seems to happen a lot when I have some thread continuously updating some UI widgets, and then I put some other windows (e.g. the console associated with the app) partially covering it for a few seconds.


I've said it many times, but NEVER should an software released for people use to result in "crash by assertions". FFMPEG used to (don't know about now, since I haven't needed to do another FFMPEG Codec+Container dance to find a working combo for a while now) result in "crash by assertion" whereby some obscure MSVC "Assertion Failure" error msgbox would popup when trying to use certain codec/container combos.

I understand that some people (especially prominent among academics who are really really into pre-post conditions and formal proofs of everything) consider assertions beneficial. Indeed, they probably do have uses, but only during the development phase on the particular software developer's computer. For libraries used by other applications, this means that assertions should NOT cause problems and/or crashes, even if the situation is impossibly convoluted. However, for some strange reason, I've found that the are never really isolated to "debug" builds only, as some manuals and specs would have you believe.

Impossible Shutdown
Another bug I've been fighting with this toolkit is the problem of the applications not shutting down properly. Namely, the console doesn't go away, even though nothing appears, and zero CPU activity is occurring after gtk.main_quit() and all other shutdown activities have been performed.

Somehow, I managed to get this to work eventually on the lab octocore I was using for a while, but this didn't hold for my own machine or the test machines this stuff needs to work on. Gah!

I've recently been dealing with quite a few really insane concurrency-type non-deterministic, and difficult to pinpoint bugs. If anyone knows of any good static analysers which may be able to plot these things out graphically (and preferably highlight the problem spots) for Python code, then I'd be interested in hearing about it!

1 comment:

  1. Hello Aligorith

    It's been a while I haven't looked at PyGTK, although I think that it's abandonned in favor of GObject introspection.

    Anyway, from my experience, even though I think the GTK API is annoying, verbosive etc... it did work in the end (but I finally switched to Qt for other reasons).

    I think you are stacking layers of potential errors. Firstly, Python has it's ways of doing things that doesn't always play well with the C objects it wraps (each object has two face : the Python face and the C face and, for example, it can occur that one of them gets deleted and not the other -> crash). You also add reference counting that can prevent proper deletions (use weakrefs).

    Secondly, threading an application can be a headache. Threading a graphical application is even worse (take care never to create a widget out of the main thread, use events to communicate from secondary thread to main thread).

    You're mixing both so it can become touchy! I don't know how you work, but here's what I'd do:
    - in "A references B and B back-references A" situations, if A doesn't strictly need B to live, make the back reference a weak reference.
    - as a general rule, if you can use a weak ref instead of a strong ref in a complex system, then use the weak ref.
    - all widgets in main thread
    - "loose coupled communication" == signal based communication from non-main threads (simply puts a new event on the event loop of the main thread which is then directed to the correct widget)

    I also experienced some issues with __del__ methods. In the end I removed all of them. If I remember well, in some situations they prevent proper deletion of the objects that implement them. They are also bad for inheritance.

    These are my ideas, and sorry, no I don't know any good static analyser that could help for this :). However, I use GDB to debug programs which mix Python and C[++] and it turns out to be just enough for me.