Linux deobfuscation project
Posted 18 Jul 2002 at 17:18 UTC by mslicker
Reguardless of the particular merits of Linux, a large body of device drivers has been written for it. Linux now serves as both a functional project and a reference for operating systems writers, driver developers ect. As a functional project the source code may serve as a proper medium of communication. As a reference, the information is virtually unreadable, obfuscated by the strange structure, bizarre coding style, and layers of abstraction present in the source code.
The numbers of devices suported by Linux is so great, that when designers of new operating systems embark on creating a system they often op for building on top Linux, despite the incredible drawbacks that come from this approach. If Linux was a suitable reference, this would increase the number of systems, advance systems with less developers and infuse new ideas into free software.
Furthermore, for some devices, Linux may be the only reference for programming a device. This is due partly to the fact that manufacturers' support of "open source" means the development of a Linux driver, not releasing the specifications as they should have done.
My idea, is to mechanically convert the Linux sources to a form that at once reflects their true structure and more directly represents their semantics. A particular machine configuration will be assumed. Therefore preprocessing can be done in it's entirety. Comments and macros will be left as hypertext annotations. File structure should be completely broken down, leaving just a call graph (or forest). Types will will be completely striped and left as an annotation, with a similar graph representation.
Currently this only exists as an idea. If you are interrested in funding this work please contact me. I say this because it is not something easily done. It would require some time to do right, and that is perhaps the only thing preventing me from doing so.
Below, I invite you comments on this idea. What are sources of obfuscation? What is the best way present the information?
While I wouldn't be so scathing in my criticism of the code, I can share
some of these concerns; in the course of the (now abandoned) Linux
Kernelbook Project, we used Rigi to distill the source code into an
object/method model and discovered as others have noted, that the
structure of Linux fits the social structure of the kernel as much as it
fits the technical requirements. That's probably to be expected and as
it should be, but it makes for a real quagmire of innocuous
interconnections for newcomers coming to the code, especially those who
do not speak english and cannot readily pester the linux-kernel list for
answers.
While we gave up, we did get tremendous interest for our project to
discover and document the actual structure of Linux, especial
from China and India but also here in the west. At the time, during the
great Linux boom, several publishers were eager to replace the aging
2.0-kernel textbooks with 2.4-based updates, and we had enough interest
from Macmillan to actually receive contracts. While the funding was
welcome, it was also our kiss of death: intellectual property issues working with a publisher drove away precisely those people we needed to make the project work. After about four months of beating my head against a wall, the publisher pulled out and I gave up -- if you're interested, kernelbook.sourceforge.net welcomes any
potential moderators or contributors who'd like to restart it.
I don't follow, posted 18 Jul 2002 at 23:18 UTC by movement »
(Master)
I'm not at all clear on what you want to do, or why you think
it might be beneficial. You seem to have picked a straw man,
and then ask for funding and for somebody else to actually
fill in the details of the project ("What are sources of obfuscation?
What is the best way present the information?").
Could you explain a little more clearly (with examples) what
problem you are planning to solve, and how ?
No, I don't require the input of anyone else, however if people would
like to provide their input here they are welcome to.
Details
The work would require the constructing of a C parser, and some logic to
translate into the desired form. Additionally the program will require
human input to organise the resulting source in logical manner.
Organization will most likely be with respect to logical modules, driver
x, y, subsystem z ect. The final output will be browsable html, with
navigation to each subsystem, call graphs, type graphs, and perhaps
cross reference, with comments and other information described above as
anotations.
Will this result in source which is much easier to read? I think so,
just from observing my own methods with reguard to understanding what is
really going in Linux.
To those interested in funding, I think the work will take two to three
weeks to complete. It probably won't get done by me without a
funding source.
Your aim seems to be to allow easier extracting of device programming
information from the kernel sources. I'm also interested in this -- I've
been wondering how to glean information about programming PCMCIA devices (OK, there are books on this
topic...).
How difficult can it be to plot the kernel's structure? Writing a GNU C
parser should be quite tractable: the (draft) C89 grammar and the grammar of
the extensions are both publicly known. For creating visual representations
of textual graphs, VCG comes in
handy.
I'm not sure though if plotting the structure will make it easier to gather
device information. In some ways, it seems overkill (e.g. there's no need to
know about how page tables work if you just want to tweak devices). In other
ways, it seems insufficient (e.g. the drivers/pcmcia/ code
has so many layers in it that it's hard to figure out anything at all).
What you describe was already mostly done using Rigi. Call graphs, use
of defined types and structs, mapping of code to files and to dir trees,
basically a network graph where you click on a line that binds two
entities and it takes you straight to the source code.
the bad news is that it's maybe not as useful as you might think ;)
Rigi, posted 19 Jul 2002 at 03:47 UTC by mslicker »
(Journeyer)
No, I don't want purely a graph. There should be some intelligence
behind the organization, not simply a dump of the source in one form or
another. If skill is not applied you may well end up with something mcuh
worse than what you started with.
The google directory in particular, is as useful as it is perhaps since
it is human organized. The amount that can be automated and the amount
which needs human intervention I have not yet determined. For the idea
to be practical for one person, the majority should be automated. I do
beleive the idea is practical for one person.
In the process I also want to convert the source into an easier to read
format. The C preprocessor complicates reading the source greatly. That
is why I would like to do all pre-processing in the translation process. For this a machine configuration will be assumed, the process can be repeated for other configurations. Only the most basic aspects of the C language will remain. There is also the possibility of inventing a simpler notation which expresses the same semantics.
tk, Thanks for the VCG link.