Thursday, November 21, 2013

Thoughts on Rebase-Based Git Workflows - Rebase Considered Harmful

Personally, I'm not a fan of the "rebase" tool in Git workflows. Having initially tried it when starting out using Git over 2 years ago, abandoning it not long afterwards, and finally having to (grudgingly) start using it again for Blender's repos, IMO, it ends up being more of a hassle than benefit.

In all my other repos, I've settled on using an GitFlow-inspired workflow, which I think works better in the long run for distributed/parallel development streams with small granular commits - something we're slowly migrating towards. This post discusses some of my concerns with rebase-based workflows, and the kinds of issues that arise from using it.

The whole point of rebase is inherently to make it seem that everything occurred in an orderly, linear chronology (ala the old centralised version control systems - CVS and SVN). That, and keeping a bunch of commits some author may have made since the last push to the public repo packed together so that they can be found more easily.

Sure, there are some times when that this can be somewhat helpful: for example, if two developers are working on two independent threads of work in the same branch (e.g. bug fixes in master). Developer A pushes 1 or more commits before Developer B has finished her 2-part fix. As a result, when Developer B pulls in the changes from the public repo prior to pushing her changes, Git ends up "zip-merging" the two lines of development together, interleaving Developer A's commit in between the two commits from Developer B.

As long as there are no conflicts between the two sets of changes, rebase works fine. However, if part of Developer A's commit overlaps with B's changes, we end up with a situation where B will have to fix these conflicts half-way through the rebase process. But, unlike if developer B performed a merge, with rebase, any conflicts that developer B had to make to get their code patched up in response to changes made in the public repos a kindof silently "absorbed" into the commit stream.

One of my gripes with rebase-based workflows is that they try to erase the fact that several branches may have co-evolved - maybe divergently, but still, they were developed at the same time and may have needed some cross-pollination at some point to patch over conflicts. In other words, using rebase destroys the context surrounding the relative evolution histories of the two divergent branches. This includes information such as when they started being worked on and/or diverging, the relative order of changes made on either side and hence why they ended up conflicting, when merges occurred and what happened with conflicts needing to be resolved (i.e. how those conflicts were resolved).

That last point is something that we need to be very careful about, as quite often, that is precisely the place where errors end up creeping into the code: The fact of the matter is that quite often, developers end up merging the code when they're close to wanting to push back to the server (or perhaps at the start of a work session). In such cases, the developer in question may actually be in a slight rush/hurry to get the merge over with - either to tend to the next thing, or just out of excitement to get back to work on whatever they were planning on working on (of course, this doesn't cover every case, but it's also not likely to be a rare event). Their focus is elsewhere, any any distractions during this routine process (much like taking out the trash or letting a horde of updates get installed) are things that aren't really welcome. As a result, if they end up encountering some conflicts, especially in an area they are not entirely familiar with, they may end up just "fudging" some solution together to just get that out of the way.

For example, after slogging through a wash of 300 lines of conflicted code (from the pure fact that another dev had just performed some form of large-scale automated code cleanup on pieces of code that they'd subsequently rewritten in parallel - for reference, this happens in our codebase a lot, and it has also happened in my own codebases too), they perhaps end up "just" overlooking one of the few cases where actually some other important change was made, resolving the conflict here like they did in all 299 prior cases - delete whatever was in the other guy's code! Well, in this case, we may not really have an obvious clue that this just happened and/or information about the development history to be able to piece together what should have been the correct merge solution at the time (when we later discover the error sometime later - hopefully not after too long, depending on the robustness of the test methodologies in place).

Related to this issue of lost co-evolution history is the issue of timestamps on commits. Unless you're looking at the local repo where the rebased branch lives, it's often not easy to see at a glance when those commits actually happened - thus making it harder to determine the evolution history. Sure, this info is still there somewhere (buried under "Author" time vs "Committer" time which is shown in the log summaries). But a casual inspection of the logs will only reveal that Developer C made some 5 to 20 commits within the space of 1 second. Clearly this can't be too healthy.

Then there is the issue of being able to identify which feature a series of commits was related to. Quite often, when developers work on a feature and end up performing a series of granular commits, the log messages for some of these might not actually make that much sense on their own (without the identifying umbrella of the branch that they were committed in).

With rebase, these are firstly transplanted onto the tip of the branch they're getting merged into, and then secondly become one with the branch that they've been merged with (if repo policies insist that they keep a linear history going). The problem then is that in this case, Git actually has no concept of when the branch started anymore, and only has information about when it ended (i.e. the last commit made to the branch before it was merged into and became part of the main branch).

Thus, if a developer were to later try to find the series of commits that lead to a particular feature, it's no longer as simple as simply trying to find the relevant branch, scrolling back to the first commit in that branch (i.e. when the fork starts to diverge), and then being able to make sense of the commits in that branch relative to the branch's name. Instead, they may even be faced with a series of commits which make relatively little sense (like "Got it working at last!").

My own workflow has evolved to try to account for this problem. That is: development for each feature requiring more than 1 quick commit is done in a separate branch, and then merged back in using "git merge --no-ff" to ensure that a merge commit is generated, and the branches are kept separate/easily identifiable + searchable.

Confusion for Newbies and Cognitive Load for Everyone Else
IMO, a rebase-heavy workflow ends up being quite confusing to work with and/or pick up. For newbies to Git/DVCS's, it means that they're now juggling one more concept about how to manage their commits depending on where those commits are located and/or whether other commits have happened since. In many other domains, this is otherwise known as a superhighway to mode error hell.

There are several highly confusing scenarios which happen here, at least one of which can end up in a somewhat ugly state of affairs:
1) A developer forgets whether the branch they're working on has been pushed to the official repo or not. They proceed to grab/update their branch with changes from the main branch that they forked off. Now, since they can't remember whether they've shared this code (they probably assume that they haven't!), they proceed to perform a rebase operation, then try to push. Whoops!
2) A developer has finished working on a feature in a branch. They pull the latest changes from the main repo, and then dutifully rebase their local copy to get all their changes up at the tip of the branch ready for pushing. To be sure, the developer then proceeds to recompile and double-check that nothing was broken. A few minutes later, having done this, they find that another 2 developers have committed/pushed changes to the main repo. So, they're forced to pull/update again. But lo and behold: after doing a rebase, Git has now ended up making two copies of the commits they dev was working on - a copy of the branch pre-rebase_1 as a separate branch located back a few days, and an inlined set of the merged commits stuck to the tip of the branch (but with no branch pointer now).   [For reference, this is the exact scenario I encountered earlier this evening!]

What we see here is that there is some highly confusing behaviour here, where users need to do things differently in order to please the machine into letting them accomplish the same basic goal: "update the working copy of my branch with the changes from upstream". However, to achieve this goal, they need to recall at least 2 different sets of commands (with slight differences between how these work), try to pick the right one based on maintaining and recalling details about the current state, and then finally execute these without making procedural errors midway. In other words, this is a pretty shoddy user experience/workflow!

If however we just stuck to just using a plain old pull -> merge/merge-no-ff -> push type of thing, we've reduced the complexity of the task a lot. 

Potential Benefits for Rebase?
It's not all bad for rebase I guess. There comes a point I guess when you've got quite a few branches going in parallel with a feature-branch approach, all frequently merging too and from that things start getting a bit messy and hard to follow at times. Having said that, I have also found that in cases like this, the filtering tools available as part of ANY decent log viewer for Git will make light work of that clutter.

Then there are the people who like to "squash" or "recombine" their commits in different ways to make themselves look more coherent than they actually were during the development process. To each and his own, but personally, I'm not that fond of this type of tinkering with version history. It's like these guys are trying to be time-butchering Heston Blumenthal's.

(On a side note, while I'm against this sort of history manipulation, I'm perfectly fine with the use of "git push --force" and/or history rewriting to get rid of files that shouldn't have been committed - things like evil binary temp files that somehow slipped in, sensitive info, or stuff with potentially incriminating legal consequences. That, or in the event that someone manages to wipe out the entire repo or large swathes of it by "accident" - malice or no malice intended.)

1 comment:

  1. Lawrence D’OliveiroMarch 18, 2015 at 2:43 PM

    Rebase is for private branches. For example, I work on some patches for submission to an upstream project, then use git-diff to generate the patch file which I post. Then, some changes happen upstream which break my patches. I rebase my private branch, fix up the conflicts, generate a new patch with git-diff, and submit that to replace the previous patch. And so on. Then, when (some version of) my patch finally gets accepted, I delete my private branch.

    At no point does anyone see this branch but me. It lives only in my copy of the repo.