Linux deobfuscation project

Posted 18 Jul 2002 at 17:18 UTC by mslicker Share This

Reguardless of the particular merits of Linux, a large body of device drivers has been written for it. Linux now serves as both a functional project and a reference for operating systems writers, driver developers ect. As a functional project the source code may serve as a proper medium of communication. As a reference, the information is virtually unreadable, obfuscated by the strange structure, bizarre coding style, and layers of abstraction present in the source code.

The numbers of devices suported by Linux is so great, that when designers of new operating systems embark on creating a system they often op for building on top Linux, despite the incredible drawbacks that come from this approach. If Linux was a suitable reference, this would increase the number of systems, advance systems with less developers and infuse new ideas into free software.

Furthermore, for some devices, Linux may be the only reference for programming a device. This is due partly to the fact that manufacturers' support of "open source" means the development of a Linux driver, not releasing the specifications as they should have done.

My idea, is to mechanically convert the Linux sources to a form that at once reflects their true structure and more directly represents their semantics. A particular machine configuration will be assumed. Therefore preprocessing can be done in it's entirety. Comments and macros will be left as hypertext annotations. File structure should be completely broken down, leaving just a call graph (or forest). Types will will be completely striped and left as an annotation, with a similar graph representation.

Currently this only exists as an idea. If you are interrested in funding this work please contact me. I say this because it is not something easily done. It would require some time to do right, and that is perhaps the only thing preventing me from doing so.

Below, I invite you comments on this idea. What are sources of obfuscation? What is the best way present the information?

Reverse Engineering, posted 18 Jul 2002 at 18:10 UTC by garym » (Master)

While I wouldn't be so scathing in my criticism of the code, I can share some of these concerns; in the course of the (now abandoned) Linux Kernelbook Project, we used Rigi to distill the source code into an object/method model and discovered as others have noted, that the structure of Linux fits the social structure of the kernel as much as it fits the technical requirements. That's probably to be expected and as it should be, but it makes for a real quagmire of innocuous interconnections for newcomers coming to the code, especially those who do not speak english and cannot readily pester the linux-kernel list for answers.

While we gave up, we did get tremendous interest for our project to discover and document the actual structure of Linux, especial from China and India but also here in the west. At the time, during the great Linux boom, several publishers were eager to replace the aging 2.0-kernel textbooks with 2.4-based updates, and we had enough interest from Macmillan to actually receive contracts. While the funding was welcome, it was also our kiss of death: intellectual property issues working with a publisher drove away precisely those people we needed to make the project work. After about four months of beating my head against a wall, the publisher pulled out and I gave up -- if you're interested, welcomes any potential moderators or contributors who'd like to restart it.

I don't follow, posted 18 Jul 2002 at 23:18 UTC by movement » (Master)

I'm not at all clear on what you want to do, or why you think it might be beneficial. You seem to have picked a straw man, and then ask for funding and for somebody else to actually fill in the details of the project ("What are sources of obfuscation? What is the best way present the information?").

Could you explain a little more clearly (with examples) what problem you are planning to solve, and how ?

Re: I don't follow, posted 19 Jul 2002 at 00:13 UTC by mslicker » (Journeyer)

No, I don't require the input of anyone else, however if people would like to provide their input here they are welcome to.


The work would require the constructing of a C parser, and some logic to translate into the desired form. Additionally the program will require human input to organise the resulting source in logical manner. Organization will most likely be with respect to logical modules, driver x, y, subsystem z ect. The final output will be browsable html, with navigation to each subsystem, call graphs, type graphs, and perhaps cross reference, with comments and other information described above as anotations.

Will this result in source which is much easier to read? I think so, just from observing my own methods with reguard to understanding what is really going in Linux.

To those interested in funding, I think the work will take two to three weeks to complete. It probably won't get done by me without a funding source.

Linux as a device reference, posted 19 Jul 2002 at 03:10 UTC by tk » (Observer)

Your aim seems to be to allow easier extracting of device programming information from the kernel sources. I'm also interested in this -- I've been wondering how to glean information about programming PCMCIA devices (OK, there are books on this topic...).

How difficult can it be to plot the kernel's structure? Writing a GNU C parser should be quite tractable: the (draft) C89 grammar and the grammar of the extensions are both publicly known. For creating visual representations of textual graphs, VCG comes in handy.

I'm not sure though if plotting the structure will make it easier to gather device information. In some ways, it seems overkill (e.g. there's no need to know about how page tables work if you just want to tweak devices). In other ways, it seems insufficient (e.g. the drivers/pcmcia/ code has so many layers in it that it's hard to figure out anything at all).

Rigi is what you want, posted 19 Jul 2002 at 03:11 UTC by garym » (Master)

What you describe was already mostly done using Rigi. Call graphs, use of defined types and structs, mapping of code to files and to dir trees, basically a network graph where you click on a line that binds two entities and it takes you straight to the source code.

the bad news is that it's maybe not as useful as you might think ;)

Rigi, posted 19 Jul 2002 at 03:47 UTC by mslicker » (Journeyer)

No, I don't want purely a graph. There should be some intelligence behind the organization, not simply a dump of the source in one form or another. If skill is not applied you may well end up with something mcuh worse than what you started with.

The google directory in particular, is as useful as it is perhaps since it is human organized. The amount that can be automated and the amount which needs human intervention I have not yet determined. For the idea to be practical for one person, the majority should be automated. I do beleive the idea is practical for one person.

In the process I also want to convert the source into an easier to read format. The C preprocessor complicates reading the source greatly. That is why I would like to do all pre-processing in the translation process. For this a machine configuration will be assumed, the process can be repeated for other configurations. Only the most basic aspects of the C language will remain. There is also the possibility of inventing a simpler notation which expresses the same semantics.

tk, Thanks for the VCG link.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page