Data on Sourceforge Projects

Posted 28 Nov 2001 at 05:34 UTC by renster Share This

Thought the advogato community might be interested in a short report I have written based on sourceforge project data. It gives a general overview of projects housed at sourceforge. Any comments are greatly appreciated.


Some cool data here, posted 28 Nov 2001 at 15:05 UTC by MikeCamel » (Journeyer)

I know that I'm referenced in this, but despite this, you should have look at this. There's some very interesting data and information about projects, number of developers, number of administrators, age, maturity and the like, and renster's done some great work on looking at possible correlations. We (the community) need to understand better how to create successful projects, and what this needs, and this sort of research is very important as an underpinning.

please try to filter out empty projects, posted 28 Nov 2001 at 17:20 UTC by Zooko » (Master)

The majority of SourceForge projects have no source code and will never have any source code. If you could somehow filter out all the projects that have no source code that would help focus your research on actual live projects.

I was trying to figure out which programming languages are most used by open source/software libre hackers, and I didn't want the zillions of dead-on-arrival projects to influence my results, so I just looked at the "top 50 most active" and the "top 100 most downloaded": popularity of programming languages among open source hackers.

Regards,

Zooko

computation of one more statistic, posted 28 Nov 2001 at 18:01 UTC by sej » (Master)

Total Projects: 29,905
Active Projects: 7,205

Consider that: Active_Projects/100 == 20 / (100 - Percentile_20th_Most_Active_Project)
Then: Active_Projects == (100 * 20) / (100 - Percentile_20th_Most_Active_Project)
So with a value of 99.7224 for the Audacity project on todays SF home page, you get: Active_Projects == 2000 / .2776 == 7,205

getting there from the LSMs, posted 29 Nov 2001 at 03:49 UTC by gregorsamsa » (Master)

We, at UNC, started working from the LSMs which gave us completed projects allbethey in the alpha mode occassionally for our A Quantitative Profile of a Community of Open Source Linux Developers which will *finally* appear in 2002 in Communications of the ACM.

While I love what renster has done, the critique that SourceForge by its nature (a nature that I think is a fine and wonderful thing) hosts more great ideas than code by raw count is a valid one. I think it would be interesting to plot which admins worked on project with code separately from those with just good ideas for example.

There is a *lot* of good stuff in the numbers in renster's report. Now it remains for us to tease out just what the numbers are saying.

re: getting there from the LSMs, posted 29 Nov 2001 at 04:14 UTC by renster » (Journeyer)

There are certainly some more numbers to play around with. I was looking for a way to graphically present a network map to show relationships among projects - ie how closely projects are related based on the number of shared admins and/or developers. It isn't really in my area of interest and I didn't want to stray too far but would certainly provide some more information. Unfortunately I don't have data on all the developers. I collected information on project admins only. Another extension would be to track project founders and the number of projects they are invovled in over time.

I have other data on all time statistics for projects. I think the next stage is to try and work with some more fine grained data on statistics by month over the life of the project. Then it will be possible to track activity and get some feel for the projects and where they are heading. Growing, declining etc. The trouble with some of the summary data is that it is incorrect. Some of the numbers don't add up and you can see discussion about this issue on the sourceforge forums. It will be necessary, for the sake of insuring data integrity, to go through the bug tracking information to follow through on things like bug submission and bug resolution etc to get a better understanding of project 'performance'. I think this would require some cooperation from sourceforge staff and I'm not sure about their stance on the whole research thing as yet (hint, hint to any sourceforge staff reading this).

I am currently looking at user comments from slashdot (and advogato perhaps when i get around to gathering it) that mention open source software development and coding it to get a handle on the nature of open source software development and the factors that are mentioned as being associated with success and failure. It's a big coding task but I hope it will provide a good starting point of factors to send out to experts and others who can then provide me with more feedback, details and direction.

thanks for your comments so far.

Lancashire article in First Monday, posted 11 Dec 2001 at 03:35 UTC by gregorsamsa » (Master)

David Lancashire, a Ph.D. candidate in the Department of Political Science at the University of California, Berkeley, has a new article in First Monday at http://firstmonday.org/issues/issue6_12/lancashire/ with some great charts and maps that would be helpful for analysing your numbers too.

Here's his abstract:
The nexus of open source development appears to have shifted to Europe over the last ten years. This paper explains why this trend undermines cultural arguments about "hacker ethics" and "post-scarcity" gift economies. It suggests that classical economic theory offers a more succinct explanation for the peculiar international distribution of open source development: hacking rises and falls inversely to its opportunity cost. This finding throws doubt on the Schumpeterian assumption that the efficiency of industrial systems can be measured without reference to the social institutions that bind them.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page