Older blog entries for IlyaM (starting at number 4)

23 Aug 2007 (updated 24 Aug 2007 at 10:21 UTC) »

libxml++ vs xerces C++


When I was reading "API: Design Matters" I recalled one example of good API vs bad API. Actually my example is more about good API documentation vs bad API documentation but I suspect there is a correlation between these two things. It is definitely hard to write good documentation if your API sucks.

So my story is that I had a task to read XML data in C++ application. XML data was small and performance of this part of the application was not critical so it looked like the simplest way to read this data was to load DOM tree for XML document and just use DOM API and maybe couple simple XPath queries. It was the first time I needed to do this in C++; I had no previous experience with any XML C++ libraries. So, I do google search (or maybe it was apt-cache search - I don't remember) and the first thing I find is xerces C++. Quote from project's website:
Xerces-C++ makes it easy to give your application the ability to read and write XML data.
Sounds good, just what I need. So I dig documentation and find it to be completely unhelpful as it is just Doxygen autogenerated undocumentation. Fine, I can read code, let's check sample code then. I open sample code and I find that the shortest example how to parse XML into DOM tree and how to access data in the tree (DOMCount) consists of two files which are more then 600 lines long in total. Huh? I don't want to read 15 pages of code just to learn how to do two simple actions: parse XML into DOM and get data from DOM. Other examples are even more bad. Several files, several classes just to read and print freaking XML (DOMPrint). You've got to be kidding me. It cannot be that hard.

I don't really want to waste hours to learn API I'm unlikely to use ever again. After all I don't write much C++ code and I definitely don't write much C++ code that needs XML. So time to search further. Next hit is libxml++. It is C++ wrapper over popular C XML library libxml. This time there is actually some documentation that does try to explain how to use the library. And this documentation contains an example which while being just about 150 lines manages to demonstrate most of library's DOM API.

End result: I finish my code to read my XML data in next 30 minutes using libxml++. It is simple, short and it works.

So what's wrong with xerces C++? There is no introduction level documentation at all. Examples look too complex for the problem they are supposed to show solution for. And the reason for this is that API is just bad: it requires writing unnecessary complex client code.

Update: boris corrected me about lack of introduction level documentation in a comment to this blog post. Turned out I missed it. As a weak excuse I'll blame bad navigation on the project's site :)

Syndicated 2007-08-23 20:54:00 (Updated 2007-08-24 10:00:03) from Ilya Martynov

23 Aug 2007 (updated 23 Aug 2007 at 21:10 UTC) »

4 silly mistakes in use of MySQL indexes


1. Not learning how to use EXPLAIN SELECT

I'm really surprised how many developers who use MySQL all the time and who do not know or understand how to use EXPLAIN SELECT. I've seen several times developers proposing serious architectural changes to their code to minimize, partition or cache data in their database when the actual solution was to spend 30 minutes thinking over result of EXPLAIN SELECT and adding or changing couple indexes.

2. Wasting space with redundant indexes

If you have multicolumn index it means you don't need a separate index which is subset of the first index. It is easier to explain with an example:
CREATE TABLE table1 (
col1 INT,
col2 INT,
PRIMARY (col1, col2),
KEY (col1)
);
Index on col1 is redundant as any search on col1 can use primary index. This just wastes disk space and might make some queries which change this table a bit slower.

There is one but! See below..

3. Incorrect order of columns in index

Order of columns in multicolumn index is important. From MySQL documentation:
MySQL cannot use an index if the columns do not form a leftmost prefix of the index.
Example:
CREATE TABLE table2 (
id INT PRIMARY,
col1 INT,
col2 INT,
col3 INT,
KEY (col1, col2)
);
MySQL wont use any indexes for query like
SELECT * FROM table2 WHERE col2=123
EXPLAIN SELECT shows this instantly. If you want to run this query faster either change order of columns in the index or add another one.

4. Not using multicolumn indexes when you need to

MySQL can use only one index per table in a time so if you query by several columns in the table you may need to add multicolumn index. Example:
CREATE TABLE table3 (
id INT PRIMARY,
col1 INT,
col2 INT,
col3 INT,
KEY (col1)
);
Query like
SELECT * FROM table2 WHERE col1=123 AND col2=456
would use the index on col1 to reduce number of rows to check but MySQL can do much better if you add multicolumn index which covers both col1 and col2. The effect of adding such index is very easy to see with EXPLAIN SELECT.

Syndicated 2007-08-16 12:22:00 (Updated 2007-08-23 21:02:00) from Ilya Martynov

volatile and threading


Until recently I hadn't much experience writing multi-threading programs in C++ so when I tried to I found that I'm really confused how multi-threading programs mix with volatile variables. So I did a little research and quick summary is: this topic is confusing. It looks like if you put locks around global variables shared between threads you shouldn't care about volatile flag. Definitely under POSIX threads and most likely when using other threading libraries as well. If you don't and rely on atomic operations it seems that you have to use volatile flag for shared global variables but concerning portability it is a grey area.

Longer story is below:

Suppose we have a piece of code which waits for a certain external condition to happen. The code could look like
bool gEvent = false;

void waitLoop() {
while (!gEvent) {
sleep(1);
}
...
}
Let's assume that this is a single threaded program and the external condition we are waiting for is a Unix signal. The signal handler is very simple - it simply sets gEvent to true:
void wakeUp() {
gEvent = true;
}
The problem with the code above is that compiler would optimize out check of the condition inside waitLoop() incorrectly assuming from local analysis of the code that gEvent never changes. The fix is to declare gEvent with volatile modifier which basically tells compiler that the variable can be changed at any time and that is unsafe to perform any optimization based on the analysis of local code:
volatile bool gEvent = false;
Let's take another example. The code is same but this time it is a mutli-threaded program where one thread waits for another. So waitLoop() runs inside one thread and wakeUp() eventually called from another. Is the code still correct? Probably yes if we keep volatile flag and if operations which read or write gEvent variable can be considered as atomic. The later assumptions seems to be correct for most (all?) platforms.

But what if we cannot treat operations which read or write gEvent variable as atomic? For example it might be an instance of a more complex type; for example an instance of class which contains other information then just a information whenever event have happened or not:
struct EventInfo {
EventInfo(bool happened = false, const string& source = "")
: fHappened(happened), fSource(source)
{}
bool fHappened;
string fSource;
}

volatile EventInfo gEventInfo;

void waitLoop() {
while (!fEventInfo.fHappened) {
sleep(1);
}
const string& eventSource = fEventInfo.fSource;
...
}

void wakeUp() {
gEventInfo = EventInfo(true, "wakeUp");
}
This code is still ok for single-threaded program where wakeUp() is a signal handler but is unsafe for multi-threaded program where wakeUp() runs in a separate thread as operations on gEventInfo cannot be treated as atomic anymore.

So how do we fix it? We should surround places where code reads or writes gEventInfo with locks to make sure only one thread accesses gEventInfo at a time. I'll use boost thread library in the example.
boost::mutex gMutex;

void waitLoop() {
string eventSource;

for (bool eventHappened = false; !eventHappened; ) {
{
boost::mutex::scoped_lock lock(gMutex);
eventHappened = fEventInfo.fHappened;
eventSource = fEventInfo.fSource;
}
sleep(1);
}
...
}

void wakeUp() {
boost::mutex::scoped_lock lock(gMutex);

gEventInfo = EventInfo(true, "wakeUp");
}
Comparing this code with earlier examples it looks like we still need to declare gEventInfo variable as volatile but it turns out we don't really need to. Quote from Thread Cannot be Implemented as a Library [PDF]:
In practice, C and C++ implementations that support
Pthreads generally proceed as follows:
  1. Functions such as pthread_mutex_lock() that are guaranteed by the standard to “synchronize memory” include hardware instructions (“memory barriers”) that prevent hardware reordering of memory operations around the call.
  2. To prevent the compiler from moving memory operations around calls to functions such as pthread_mutex_lock(), they are essentially treated as calls to opaque functions, about which the compiler has no information. The compiler effectively assumes that pthread_mutex_lock() may read or write any global variable. Thus a memory reference cannot simply be moved across the call. This approach also ensures that transitive calls, e.g. a call to a function f() which then calls pthread_mutex_lock(), are handled in the same way more or less appropriately, i.e. memory operations are not moved across the call to f() either, whether or not the entire user program is being analyzed at once.
So at least if you using POSIX threads (boost::threads under Linux uses them) your code is probably safe without use of volatile as long as you use locks around global variables shared between several threads. Good question whenever this example code is portable to other platforms; after all boost::threads supports threading libraries other then POSIX which may have other rules for mutexes and locks. I haven't researched this yet as for now I don't really care about other platforms.

Some interesting links on this topic:
  • A Memory model for C++: FAQ - mentions shortly reasons why volatile keyword is insufficient to ensure synchronization between threads and has links on papers for further reading.
  • http://www.artima.com/cppsource/threads_meeting.html - Not much to read there but I love this quote: "Not all the dragons were so easily defeated, unfortunately. Among the issues guaranteed to waste at least 20 minutes of group time with little or nothing to show ... What does volatile mean?" (this in context of multi-threaded programs). If C++ experts cannot agree on this ...
  • Another person gets confused over use of volatile and threads. Interesting discussion on comp.programming.threads.

Syndicated 2007-07-31 14:58:00 (Updated 2007-08-12 00:44:39) from Ilya Martynov

boost::thread and boost::mutex tutorial

For most Boost's libraries its documentation requires you to read everything from the start to the end before you can write any code. Compare that with most of CPAN modules where usually you can start using CPAN module after quickly scanning synopsis and maybe description parts of it's POD documentation. POD documentation as a rule has good examples right on top of the page. Boost's documentation usually doesn't.

So I was looking for basic usage examples for boost::thread and boost::mutex classes and initially I couldn't find any because I was using wrong search keywords. In the end I figured out how to use boost::thread and boost::mutex classes in my application hard way by reading Boost documentation without relying on any examples. But afterwards I did find a very good article on this topic with many simple examples: The Boost.Threads Library on Dr.Dobb's. So I'm posting this link here for google. It is in top 10 hits for some relevant keywords but it is not for others (for example for boost thread mutex tutorial) and this is why I missed it initially. If my blog post helps any Boost.Threads newbie to get started then I would consider time spent writing this post to be not wasted.

Syndicated 2007-07-25 16:35:00 (Updated 2007-08-02 20:49:34) from Ilya Martynov

Starting a new blog

I used to have a blog at use.perl.org but I just was too lazy to write often. One problem it seems you need a certain discipline to keep doing that. Also I just didn't like this site blogging engine. It just looked too simple and offered little control.

At the certain point I tried to switch over a blog on my personal sitebut instead of actually blogging I got carried away by designing a "perfect" system for my blog. I spend hours evaluating different software for my blog and I had very exotic requirements like being able to use SCM software to store my posts. That implied I need a blog software which uses raw files to store posts. I ended up hacking something monstrous what was a combination of Blosxom, darcs and make. And it wasn't that convenient to use at all either. In the end I probably spent much more time setting up all this then actually blogging.

So now I want to start from the scratch: pick some blogging engine which doesn't get into a way and discipline myself to actually write periodically. From my experience learning new programming languages you learn much faster when you have an actual project you are trying to implement in the new language. In a similar venue I'd expect it would be much easier to find new topics for my blog each day if I have a certain new fun project on my mind. And this new project is going to be teaching myself OCaml. Let's see how it goes.

Syndicated 2007-07-18 11:46:00 (Updated 2007-07-18 12:19:25) from Ilya Martynov

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!