5 Dec 2012 danstowell   » (Journeyer)

How to remove big old files from git history

I've been storing a lot of my files in a private git repository, for a long time now. Back when I started my PhD, I threw all kinds of things into it, including PDFs of handy slides, TIFF images I generated from data, journal-article PDFs... ugh. Mainly a lot of big bloaty files that I didn't really need to be long-term-archived (because I already had archived the nice small files that generated them - scripts, data tables, tex files).

So now I'm many years on, and I know FOR SURE that I don't need any trace of those darn PDFs in my archive, I want to delete them from the git history. Not just delete them from the current version, that's easy ("git rm"), but delete them from history so that my git repository could be nice and compact and easy to take around with me.

NOTE: Deleting things from the history is a very tricky operation! ALL of your commit IDs get changed, and if you're sharing the repos with anyone you're quite likely to muck them up. Don't do it casually!

But how can you search your git history for big files, inspect them, and then choose whether to dump them or not? There's a stackoverflow question about this exact issue, and I used a script from one of the answers, but to be honest it didn't get me very far. It was able to give me the names of many big files, but when I constructed a simple "git-filter-branch" command based on those filenames, it chugged through, rewriting history, and then failed to give me any helpful size difference. It's quite possible that it failed because of things like files moving location over time, and therefore not getting 100% deleted from the history.

Luckily, Roberto is a better git magician than I am, and he happened to be thinking about a similar issue. Through his git-and-shell skills I got my respository down to 60% of its previous size, and cleared out all those annoying PDFs. Roberto's tips came in some github gists and tweets - so I'm going to copy-and-link them here for posterity...

  1. Make a backup of your repository somewhere.

  2. Create a ramdisk on which to do the rewriting - this makes it go MUCH faster, it can be a slow process. (For me it reduced two days to two hours.)

    mkdir repo-in-ram
    sudo mount -t tmpfs -o size=2048M tmpfs repo-in-ram
    cp -r myrepo.git repo-in-ram/
    cd repo-in-ram/myrepo.git/
    
  3. This command gets a list of all blobs ever contained in your repo, along with their associated filename (quite slow), so that we can check the filenames later:

    git verify-pack -v .git/objects/pack/*.idx | grep tree | cut -c1-40 | xargs -n1 -iX sh -c "git ls-tree X | cut -c8- | grep ^blob | cut -c6-" | sort | uniq > blobfilenames.txt
    
  4. This command gets the top 500 biggest blobs in your repo, ordered by size they occupy when compressed in the packfile:

    git verify-pack -v .git/objects/pack/*.idx | grep blob | cut -c1-40,48- | cut -d' ' -f1,3 | sort -n -r --key 2 | head -500 > top-500-biggest-blobs.txt
    
  5. Go through that "top-500-biggest-blobs.txt" file and inspect the filenames. Are there any you want to keep? If so DELETE the line - this file is going to be used as a list of things that will get deleted. What I actually did was I used Libreoffice Calc to cross-tabulate the filenames against the blob IDs.

  6. Create this file somewhere, with a name "replace-with-sha.sh", and make it executable:

    #!/usr/bin/env sh
    TREEDATA=$(git ls-tree -r $2 | grep ^.......blob | cut -c13-)
    while IFS= read -r line ; do
        echo "$TREEDATA" | grep ^$line | cut -c42- | xargs -n1 -iX sh -c "echo $line > 'X.REMOVED.sha' && rm 'X'" &
    done < $1
    wait
    
  7. Now we're ready for the big filter. This will invoke git-filter-branch, using the above script to trim down every single commit in your repos:

    git filter-branch --tree-filter '/home/dan/stuff/replace-with-sha.sh /home/dan/stuff/top-500-biggest-blobs.txt $GIT_COMMIT' -- --all
    
  8. (Two hours later...) Did it work? Check that nothing is crazy screwed up.

  9. Git probably has some of the old data hanging around from before the rewrite. Not sure if all three of these lines are needed but certainly the last one:

    rm -rf .git/refs/original/
    git reflog expire --all
    git gc --prune=now
    

After I'd run that last "gc" line, that was the point that I could tell that I'd successfully got the disk-space right down.

If everything's OK at this point, you can copy the repos from the ramdisk back to replace the repos in its official location.

Now, when you next pull or push that repository, please be careful. You might need to rebase your latest work on top of the rewritten history.

Syndicated 2012-12-05 05:25:35 (Updated 2012-12-05 05:39:09) from Dan Stowell

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!