Version Control Systems: The Next Generation

Posted 14 Aug 2000 at 07:36 UTC by Simon Share This

I'm building a new version control system, and I need input from people I want to use it: people like you. If we're going to get version control right, how should we do it?

Recently, I've been playing about with Perforce, a rather nice version control system. It's very easy to use, it's got a bunch of neat features, but it's not free software. So, initially, I set off to create myself a free equivalent of Perforce; something to fix up (what I see as) the deficiencies in CVS and keep Perforce's good bits. I'm calling the system `Perverse' - the Perl Version System.

It's gone beyond that, though. This is the chance I have to create a free version control system and get it right. (Read: "dammit!") To do this, I need input from the people I want to use the system: the developers, you guys. You can see the current wishlist at http://perverse.sourceforge.net/wishlist.html, but I think Advogato is a good place for us to have a discussion on version control and how it should be done.

Yes, I know that CVS does this and Aegis does that and Bitkeeper does the other; now's the time to think about what we want from a perfect version control system, not a currently existing one.

Then all me and the boys have to do is go away and write it...


bleh. go for Subversion, posted 14 Aug 2000 at 08:24 UTC by gstein » (Master)

Great. Yet Another Version Control System. And echoes of a newbie... "Wow. Wouldn't this be great?! I'm gonna build <this>! Okay. Now everybody tell me what it needs! Oh, and who wants to help? I don't have any code, so a lot of help would be great!"

On the other hand, Subversion is a project funded by Collab.Net. It is being led by Karl Fogel, one of the original CVS developers. A complete design is there, and initial coding is in the works. There are several dedicated developers (Karl and several others) producing the code, rather than somebody's "wish" to see something better.

Seriously. By all means, Perverse and other systems can and should be built. Exploration of ideas through multiple development streams. That is what Open Source is all about. But I'm going to place my bet on Karl and friends.

[ disclaimer: yes, I'm also working on Subversion. But I'm only doing the network coding. Karl, Ben, and Jim are defining what Subversion really "is". ]

Suberversion vs Others, posted 14 Aug 2000 at 16:03 UTC by eivind » (Master)

My personal view, after having worked on version control system design for use in Open Source, is that Subversion is a waste of time. It is an incremental improvement over CVS, but it avoids the thorny and truly important issue: Distributed Branching.

This is very difficult to retrofit properly on a design, and not spending the extra time up front to support this means that Subversion IMO will be at best a short term solution which has to be replaced shortly (possibly with another version control system also named 'Subversion' in order to keep market share), and at worst will be actively destructive for the Open Source community, by being a block on distributed branching and forcing the bar for replacement systems in a fashion that make them later in coming.

Eivind.

Distributed branching is not obviously a good thing..., posted 14 Aug 2000 at 17:16 UTC by danwang » (Master)

If you just want to keep a bunch of geographically distributed developers working on the same project with tight coordination. Distributed branching does not seem so critical...

In other cases, I can see where it would be useful, but I can see how it would make things worse by making "code forking" easier...

distributed repositories, easier branching, better add/remove/rename support, posted 14 Aug 2000 at 17:42 UTC by sej » (Master)

It would be nice to have distributed repositories that can support asynchronous programming teams. One team should be able to selectively incorporate changes from a remote repository on a commit-by-commit basis.

I've never tried cvs branching, but I saw the cheat sheet someone wrote up for themselves on all the steps involved. I'd like branching to work much like asynchronous distributed repositories, with support for partial merges of two branches. Sometimes two branches (or two repositories) will never synch up completely, because one branch (or repository) contains custom modifications that don't belong elsewhere. Direct support for this kind of usage would be nice. You can do it today with patch, but that takes lots of discipline, and is prone to error.

Finally, I've always found that any attempt to change the MANIFEST of a source distribution (add new files, delete old files, rename or migrate files) always required extra patience with cvs. You have to delete files first, then remember what they were to invoke "cvs remove". Is there any rename capability? How about moving a directory?

All praise to Brian Berliner (and Larry Wall and whoever wrote rcs) for making me fairly happy with my diff management tools. As well as Brian Hogencamp for the cvs wrapper scripts I use every day. Are you the next in line?

Distributed branching, posted 14 Aug 2000 at 17:45 UTC by eivind » (Master)

It is not critical for a small project with few developers and tight coordination. OTOH, for that kind of project, almost anything will work.

For a large project, disitrbuted branching is good in that it makes a much larger continium of forks, making it possible for forks to live for a while before being folded back into the old project, or for a fork and its parent to share the minor changes (bugfixes and minor, clear improvements).

Distributed branches also has another major advantage: Easier scalability, and "routing around" bad management. In a world owhere development is done in local branches and merged into "larger" branches (more used ones), whoever does the best integration wins (technically).

Forks are bad due to wasted effort; a large part of the reason for the wasted effort is the lack of good enough tools for handling forks. Small forks and rejoins can be very healthy for a project (e.g, the gcc -> egcs -> gcc development.)

Eivind.

CVS wish list, posted 14 Aug 2000 at 18:24 UTC by imp » (Master)

OK. Here's my CVS wishlist. I agree with other posters that we should do an incremental change to CVS.

First, the import/vendor branch mechanism is weak. Once a file leaves the vendor branch, it can never return. Well, that's not entirely true, you can set the default branch back to the vendor branch, but then you lose the ability to get the right file with the -D command. So support for tracking the date/time when default branches changed so that you could make this work.

Second, CVS is too slow. It takes forever to cruise through the FreeBSD cvs repo. A better way to manage the repository is needed. Perforce is nice because it KNOWS all the chagnes to the repo relative to the current working tree. cvsup is quite a bit faster than CVS because it has done many good optimizations.

There needs to be something like Perforce change numbers where each change collects a number of files. This will allow bonehead mistakes to be detected and avoided more easily (eg, forgetting to commit a file, maybe in a different directory).

Branching and merging needs to be easier in CVS. If you've ever done lots of branching in both CVS and Perforce, you know how much more painful it is to do on CVS than Perforce. If branches were easier, it would be easier to work collaberatively on experimental things in a distributed manner (eg NEWCARD would be a branch, not in the mainline).

One could argue for days about the explicit lock, vs catch all change model differences between CVS and perforce. I love and hate both of them.

One thing that needs to be considered is remote distributed as well as remote mirroring. Perforce makes the mirroring hard.

look at teamware, posted 14 Aug 2000 at 18:37 UTC by sergent » (Journeyer)

Look at TeamWare from Sun. This is the system that BitKeeper was (mostly) based on. You should be able to use their try and buy thing to get a 30 day license for it if you want to experiment to see how it works... you can certainly read the docs on docs.sun.com.

Its primary deficiency, in my mind, is the lack of support for anything other than filesystem mounts to talk to remote workspaces. Being able to use something like TeamWare but with a protocol in between would be ideal in my mind.

TeamWare gets the "distributed branching" thing very, very right, in my mind... it's just that the distribution part only works properly with NFS over local networks.

Obscure sublanguages..., posted 14 Aug 2000 at 18:39 UTC by Uruk » (Apprentice)

This kind of reminds me of the discussion on autoconf and automake...

What I want to do as a developer is develop. What I don't want to do as a developer is spend a lot of time wrapped up in the picky details of developing that don't have anything to do with the functionality of the program. What I mean by these things are autoconf, automake, worrying about portability, using cvs, pushing and pulling files all over the internet, etc. I just want to code. After you learn the UNIX shell, sed, awk, perl, C, java, python, automake, autoconf, and numerous other languages and sublanguages, it's not fun or a challenge to keep learning obscure syntax that is only good for one thing, it's pretty friggin' annoying.

Granted, some of this is unavoidable. I'm not going to stand up and say that autoconf and automake are great, but they are a necessary evil, because portability is still an issue. What I don't like is how every single program as it increases in flexibility ends up tacking on their own extension language and/or scripting facility. The myriad of different things a programmer has to know just to write a program that can successfully write "Hello World" over a telnet connection and compile on more than about 3 OS's is really ridiculous.

So, in my opinion, what do I want in a versioning system? I want it to be controllable through a language that is already known, (something like perl or guile would be good, but for gods sakes, don't create another language or scripting facility) and I want the tool to get out of my way. A versioning system has nothing to do with the functionality of the program it is managing and as such should take as little time to administer, use, install, update, troubleshoot, etc. as possible.

CVS is a kick ass system, and I actually read the CVS manual, (I think Per Bothner wrote it, or?) and I learned it, but I would have much preferred something more simple. Because really, you can't really use CVS without also using diff, patch, and many other tools. Sure UNIX is cumulative, but the learning curve (not in difficulty, but merely in time) is way too steep for a tool that doesn't actually contribute to the program but only the organization of the source code.

All of that ranting said, I don't know how to build a program that would fit my own needs, but then again, I don't have to, because I'm not proposing to implement a versioning system. :) What I do know is that if it uses some sort of language or scripting, it should use something widely known, it should stay as simple as possible, and using basic UNIX shell knowledge, the user should be able to sit down in front of it and pretty much figure it out. You'll know you failed if you end up with a 200 page manual, 8,000 features, a nifty scripting language that when used feels like trying to drive a nail with a wrench, and 10 O'Reilly books on your program. Because if it warrants that much work and documentation in order to use it, it probably isn't done well.

I'm not down on complex programs, after all, hell I'm an emacs bigot. I'm just against losing a hefty portion of time that would otherwise be productive learning tools that are merely a means to an end. A versioning system is not an end in itself (like, say, nethack) it's a tool that I want to use and put away. Please make sure that your version of this tool isn't like Microsoft's paperclip. :)

subversion clarifications, posted 14 Aug 2000 at 20:00 UTC by jrobbins » (Master)

Two quick clarifications on the subversion project:

First, "funded by CollabNet" is too strong. It is being hosted at subversion.tigris.org and CollabNet is bearing the hosting costs. Also, some of the core people working on it are CollabNet employees. But then again, some of the most important people working on it are employed elsewhere. Actually, there is probably as good an overlap between core subversion contributors and advogato users :)

Second, distributed branching has been raised as a desirable issue from the start. The decision was made to focus on "cvs without the bugs" first, simply to get something up and working. Work on the design has explicitly tried to keep options open for things like distributed branches.

Also, I invite interested developers to participate in all the projects on tigris.org. The "twin goals" of the site are to develop a project hosting infrastructure, and to host open source software engineering tool projects. The contributions of the more senior open source developers of advogato are very welcome.

Re: CVS wishlist, posted 14 Aug 2000 at 20:28 UTC by rillian » (Master)

I'd also rather people just work on improving CVS.

My wishlist is more specific than imp's. To wit:

  1. cvs 'add' and 'remove' shouldn't require write access to the repository. The changes should be kept within the working tree until commit time. You should also be able to make a cvs 'diff' containing added/removed files. This would make it much easier to submit/merge patches by/from anonymous users.
  2. There should be a cvs 'move' or 'rename'.
  3. Directories should be treated on equal footing with files. This combined with the 'move' functionality will finally make it possible to re-arrange your tree without doing manual surgery on the repository (and consequently losing the associated revision history). Apparently this is difficult to do, since I understand there have been many rejected proposals in this vein. I'd start by trying to extend the attic functionality to handle directories and the 'move' command simultaneously. Not that I've tried. :)
  4. The user inferface isn't very obvious; but I don't see much that can be done without breaking everyone's scripts other than simplifying things for novice users. The above will let us get rid of -d. Would be nice if we could safely get rid of -P on updates as well.

If someone were to tackle the above, I'd be greatful.

PRCS, posted 14 Aug 2000 at 21:04 UTC by matt » (Journeyer)

I was very intrigued by PRCS; especially its Xdelta filesystem. I very much dislike the fact that it relies on Berkeley DB 3, which has an icky license. Actually, I'm most intrigued by Xdelta; I can see many uses for it beyond source code control. (Think wiki with WayBackMode yet not relying on a source code version control system, which is, IMO, too beefy for its needs.)

I have to agree with the apparent consensus here that we don't need yet another version control system project. There are plenty of half-finished ones out there already.

Keep it simple, posted 14 Aug 2000 at 23:32 UTC by philhunt » (Journeyer)

The thing I most dislike about CVS is it is too complex to learn.

I suggest that your Perverse system should have both a command line and a GUI interface. The command line should be written first, and the GUI version should run command line scrpit to operate.

Firstly, you should write the complete documentation needed to use the command line interface; this should include the concepts involved, and the actual commands. If this comes to more than about 3 pages of A4, it is too complicated. Review this on the web until you get a simple, coherent, consistent user interface.

Some collected replies, posted 15 Aug 2000 at 00:05 UTC by Simon » (Master)

bleh. go for Subversion.
Yeah, who needs innovation anyway. You've got Minix, what are you complaining about?

Yet Another Version Control System.
Yep, the point of asking here is to find out what would stop it becoming yet another VCS.

And echoes of a newbie...
We'll let posterity decide that one... :)

Now everybody tell me what it needs!
Hmm. A lot of Open Source software is written to scratch an itch; this is all well and good, and a lot of really good software was written this way. Unfortunately, there's a lot of really bad software written this way too, because developers have been developing in a vacuum scratching an itch that they and nobody else have. If I'm going to scratch other people's itches, I don't think it's unreasonable for me to ask people what they are...

Oh, and who wants to help?
No, sir, I did not say that, and with good reason. Adding manpower to a late software program just makes it later. I'm more than happy to (in fact, I'd rather) write this myself, but then I'm grateful for the small team of very competent developers who have volunteered to help with this without me calling. But no more!

I don't have any code
No, sir, this is not true. I do not have very much code, but it's enough to self-host the project. I'm one of these rare types who doesn't believe in releasing a project while it's incomplete.

But I'm going to place my bet on Karl and friends.
I'm happy to bet that what they do will be better for some people, but not all of them.

Distributed branching does not seem so critical...
I'd say it was critical because I've come across several scenarios where you have sites that are separate and cannot talk to each other over the network. rsync and cvsup will give you read-only slave sites, but no way of getting slave changes back into the master repository.

Is there any rename capability? How about moving a directory.
CVS will bitch at you forever more if you try and do this; Perforce will bitch at you once. Why haven't people made this easy? Ho hum. :)

First, the import/vendor branch mechanism is weak.
I think the vendor model is not merely weak but actively broken.

There needs to be something like Perforce change numbers where each change collects a number of files.
Collecting multiple file edits into a single atomic change is a necessity.

I suggest that your Perverse system should have both a command line and a GUI interface.
Yes, absolutely; I'm concentrating on exporting a sensible API so people can write their own clients and reporting systems, and so have a system that fits around their existing SCCS/RCS/whatever shell hacks.

Firstly, you should write the complete documentation needed to use the command line interface; this should include the concepts involved, and the actual commands. If this comes to more than about 3 pages of A4, it is too complicated.
Oh, it'll be longer than that not because it's complicated but because I like writing. A cheat sheet should fit on a side of A4; for the beginnings of a user manual for the interface I'm planning, please see the Quickstart manual.

Think Robustness, posted 15 Aug 2000 at 01:05 UTC by dej » (Journeyer)

To anyone working on version control: consider robustness.

Can a user (inadvertently or otherwise) corrupt the repository through an otherwise legal set of commands? If the power fails halfway through a commit, will the repository be usable when the machine comes back up? What happens if the repository disk becomes full?

CVS falls down in these areas. For example:

  • mkdir dir2; cvs add dir2
  • cp dir1/* dir2
  • (change files in dir2)
  • cvs commit dir2

You have likely screwed yourself here, since the "CVS" directory contains metadata, and it was copied into "dir2". One of the metadata files is a pointer to the repository directory. When "dir2" is committed, the contents of "dir1" in the repository will be overwritten. A new user may not realize this, and even experienced users are prone to oversights at times. I have done this, and a co-worker with 20+ years experience has done this. This problem is also messy to correct.

At my previous job, I wrote a version control system for Mentor Graphics IC design data. It was designed to be used by the typical absent-minded engineer. The software was setuid root. It gave up root privileges while keeping the ability to switch between the user's UID and a project-specific database UID. All metadata was kept in the repository, owned by the database UID. It was simply not possible for users to inadvertently goof up. Coupled with an atomic commit scheme in the repository, the system was almost bulletproof.

I have tried to introduce CVS to new users. I start off with the basics: checkout and commit. I don't get into branching or anything else. Inevitably, they try something that either damages the repository or "poisons" their work directory requiring a knowledgeable user to fix.

Please don't have history repeat itself. And never underestimate the stupidity or resourcefulness of your users. (My version control system checked that the invoking user owns his home directory and will refuse to continue if it is owned by someone else. This check has been triggered through production use. :-)

Version Control Wish List, posted 15 Aug 2000 at 03:42 UTC by chromatic » (Master)

The one thing I've been missing in my seven or eight months of CVS experience is the ability to commit a bunch of files, but give them individual messages. It would also be nice if it built my changelog out of those commits, but now I'm just talking crazy talk. (Why not use a regexp or a list of files to check out? Auto-tar and gzip on export? Ack, too many ideas!)

It would be kinda nice to be able to review changes checked in by contributors before accepting them, but I guess there's always the possibility to revert to a prior version.

Writing a version control system is hard, posted 15 Aug 2000 at 06:40 UTC by Bram » (Master)

A word to the wise - CVS is lacking in features because writing a version control system is very, very difficult.

To begin with, there are a million fiddly weird cases you have to deal with. What do you do when someone removes a file and someone else modifies it? What if someone deletes a directory when someone else adds a file to it? How do you do merges? All of these are very deep and tricky issues to deal with.

Less obvious from a developer standpoint, but glaringly obvious to the end user, is that a version control system must be very, very stable. A single corrupted repository will get people to swear off a system forever, and you *cannot* break backwards compatibility with peoples's old scripts.

We're not talking about simple 'try to not have so many bugs' levels of stability here, I mean setting up the version control system so that it runs the entire regression testing suite every time you try to commit and rejects you if it fails.

Writing a version control system is a project I'd reserve for people with a lot of experience writing very tricky system code. It isn't a 'fun' project unless you honestly enjoy working on difficult aggravating code just because it's difficult aggravating code.

Make the interface process-driven, posted 16 Aug 2000 at 08:01 UTC by cdegroot » (Master)

We evaluated ClearCase, and decided to stick with CVS. The only thing that a top-notch tool like ClearCase can do that CVS can't, is cleanly handling file moves. That must be fixed in a CVS replacement, because moves and renames are important when you start to do a lot of refactoring.

I am writing a little shell around CVS, called Trug (trug.sourceforge.net). It is process-driven, in that the developer says: "I'm now going to start to work on bug #123"; "Please backup my work"; "I'm finished with bug #123"; "I now want to integrate the work of bug #123 with the main branch"; etcetera. Trug takes care of the branching and merging commands. I think that is the way version control should be - support a process and support it in a very straightforward manner. Trug will support one process, the one I think is useful for our company. A more general tool may want to offer some "process builder's toolkit".

What's wrong with CVS?, posted 18 Aug 2000 at 22:10 UTC by aaronl » (Master)

I hear a lot of people complain about the deficiancies of CVS. I use CVS a lot and I love it. What exactly is wrong with CVS in its current state?

Wrong with CVS, posted 20 Aug 2000 at 17:54 UTC by eivind » (Master)

In answer to aaronl's question of what's wrong with CVS, I've tried to set up a list of the most important points. This list is by no means comprehensive; there are a whole bunch of minor things that can be improved, and there may be issues that are major to some projects that I have not covered. And probably none of these issues are major for all projects; they are just major for some, and subtly influence how the projects run.

  • Lack of versioned directory structure.
    The direct mapping of the client view to the repository layout[1] blocks reorganization of software projects. This is especially important for medium size projects that went into CVS at "birth", as they usually haven't had a really careful engineering of the directory layout (or the requirements has changed under way.) Small projects can live with any layout; large projects have usually gone through this and ended up with a fairly workable, stable layout (though I still see a bunch of problems in FreeBSD's source layout that we would have fixed if it was cheap.)

  • Lack of distributed branching.
    This is seldom a problem for small projects, and usually a problem for large projects, especially combined with CVS' access control model. This flaw makes the people with commit access to a project have a lot more comfortable working environment than people outside it, making a much harder distinction between developers, thus requiring giving out commit privileges to get efficient developers - which again create the need to do social integration and risk evaluation for those developers. This tends to limit CVS-using projects to a single, close community, which stops scaling.

  • Branches are expensive and a kludge
    Branches in CVS are based on tags and perverted version numbers. Creating a branch requires touching all files involved in the branch (which for a full branch is all files in the repository). They are fairly slow, as the repository needs to walk differences back from the head revision and up the branch to get the newest branch revision. They also are not safe - the tag for a branch can be moved with 'cvs tag -F', which will make working on the branch not work.

  • Repeated merges are not handled automatically
    When you merge between branches, CVS keeps no record of your merge. This means that repeated merges between branches must rely on external metadata (which can be handled by separate tools, but usually is just based in the developer's head.) This leads to a lot of pain in projects where branches are heavily used.

  • CVS' less-than-brilliant access control model
    CVS does not have a built-in detailed access control model, but requires[2] the developers to get shell access on the machine. There are some kludges out to get more fine-grained access control (e.g, the CVSROOT/avail and CVSROOT/access stuff FreeBSD use), but these are not effective against malicious users. This means you need rather good screening before giving commit access even to limited parts of your repository, which again makes it hard to scale projects.

  • CVS is fairly slow
    There are a bunch of reasons for this, but the point probably speaks for itself.

  • Metadata is not versioned
    If you move a tag or rename a file, the information on how it was is *gone*. Not good. You can 'cvs remove' and copy a file, but this also mess up tags.

  • Commits are not atomic
    This makes it a pain if your link disappear during a commit or similar.

  • There is no integration with job tracking
    Use of external job tracking systems require the developer to remember to close jobs after commit, and does not make the change atomic.

[1] For anything beyond a toy project, using modules to simulate renames is likely to lead to disaster. It may be possible to use modules to get renames to work with a wrapper, but this has a series of problems, including but not limited to the fact that modules are not branchable.

[2] Setting 'cvs server' as the shell for a user does not avoid shell access; CVS is not made to be trusted that way. You can get some protection through chroots, but getting detailed access control for your repository in the face of potentially hostile developers is not really feasible - and even trying to do so is so much work that projects do not do it, which is the main point. "Impossible" and "So inconvenient that people do not do it when they need it" are effectively equivalent for tools.

Eivind.

Two more CVS limitations, posted 22 Aug 2000 at 17:58 UTC by nelsonminar » (Master)

Two more DVS limitations for the list:

  • No changesets
    There is no way to check in changes to a group of files and mark them as one atomic change. This is crucial when modifying projects of any size; the .h and .c file have to go together, you know?
  • No file move/rename
    CVS (and Perforce!) don't cleanly support renaming or moving files. Not such a huge deal in a C project, but a real problem for Java where the filename is the class name.

there's more to source control than just the source..., posted 10 Sep 2000 at 05:39 UTC by dannys42 » (Apprentice)

Okay, here's some random thoughts...

One thing that seems to come up a lot when I use CVS is being able to control other aspects of files. As other people have mentioned, being able to move files and directory structures around, and having the control system understand that. You should be able to move directories around, modify files, and all that in one commit. And you should be able to check out the project from before any of those moves, modifies, and whatever, and the control system will put it all in the right place.

Another thing to keep handy is being able to understand file permissions, user/group ownerships, links, named pipes, etc. etc. The file permisisons and such, obviously would have to be handled as some sort of "checkout final" (opposed to the usual "checkout working"). But as all things, this should be optional.

Finally, the control system should allow for easy integration with a "policy" system. (perhaps the previous idea of file permissions/ownerships and such fall under the policy control system). For example, a project should be able to say only certain users can add files to a directory, or only certain users can modify these files, etc. etc. But these issues are project policies, not source control.

And as all open source projects, the program should have a good API, and allow for loadable modules, to leverage many people writing small components.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page