6 Mar 2009 ta0kira   » (Apprentice)

Belated Response to appenwar's 'tokenizers' Blog

Most of the points have been well-spoken by others with more experience than I have; therefore, I'll stick to my own points. This has less to do with what you actually said and more to do with the principle.

One thing that always irritates me is how gcc will ignore C/C++ file extensions and take a guess, or it will default to C++. For example, a .h will only be taken as C if included strictly by a chain of C files and only if you don't use g++. One must therefore include the awkward #ifdef __cplusplus \ extern "C" { because some people don't know how to use the correct file extensions, otherwise you might have linking problems if your header is actually backed by a C source. If you have to use a C feature not carried over to C++ in a C file (e.g. the .sym = member initializer,) you can't #include your file in a C++ file even with extern "C". You can also get away with not qualifying structure variables with struct in C headers if a C++ file includes it. All of this leads to less concise code, all because of acceptable ambiguity. I do concede that early C++ used .h extensions for the standard headers, so it's partly lack of foresight.

Today I finally got around to using libxml2, which struck me as extensively (yet somehow poorly) documented and extremely ambiguous. On the other hand, it will save having to write my own compliant parser to parse the ~1.4M lines of XML I need to convert and load into a database. This has little to do with libxml2 not accepting partial errors because the data I received was probably exported from SQL using the same library. I'd actually copy the trees created by libxml2 into a more usable structure if they weren't going right into a database, but XML is meant as a format, not as a run-time representation.

If someone is actually hand-writing XML-proper, chances are they're missing the point (or they're dealing with a software interface that misses the point.) Additionally, if someone is using software other than libxml2 to generate XML, they're either missing the point or they lack the appropriate language bindings. That being said, I use my own library to assemble and parse "XML-like" structures (closer to HTML, I guess) for IPC. It wouldn't make sense for me to use formal XML for the application, and especially not libxml2. Though the formats are very similar, the run-time organization used by libxml2 isn't anywhere near being suitable for what I use the data for. Then again, I don't need any sort of standardization because the data doesn't go anywhere outside of the application. It's a symmetrical system because data importation and exportation are designed concurrently to compliment each other, which I can only assume is the case with libxml2.

Something many formal projects lack (software and otherwise) are explicit correlations between the core purposes of the project and the aspects of implementation (yes, I'm guilty, too.) If I were to author something comparable to XML, I'd explicitly state that it isn't meant to be hand-written and it's primarily intended to allow data transfer between applications with different maintainers. At the point of deciding whether or not to accept simple errors, I'd defer back to those principles and conclude that errors should not be accepted. If I were to author something like HTML on the other hand, I would account for hand-written code and acknowledge that rendering with errors is better than rejecting a file. All too often projects are approached with founding principles, yet they fail to rationally extrapolate those principles to the level of implementation (guilty, again.)

Rather than getting into everything already brought up, I'll leave it at that.

Kevin Barry

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!