domingo, marzo 01, 2009

Re: clarification on git, central repositories and commit access lists

http://lwn.net/Articles/246381/

¿Linux es borde, autoritario y egocéntrico?
Discutible

Lo que es seguro es un genial programador y un tío constructivo que hace y ayuda mucho




From: Linus Torvalds
To: Adam Treat
Subject: Re: clarification on git, central repositories and commit access lists
Date: Mon, 20 Aug 2007 11:41:05 -0700 (PDT)
Message-ID:
Cc: kde-core-devel-AT-kde.org

On Sun, 19 Aug 2007, Adam Treat wrote:
>
> I just watched your talk on git and wanted to ask for clarification on a
> few points. Many of us in the KDE community are interested in git and
> some even contemplate using git as the official SCM tool in the future.

As you are probably aware, some people have tried to import the whole KDE
history into git. Quite frankly, the way git works (tracking whole trees
at a time, never single files), that ends up being very painful, because
it's an "all or nothing" approach.

So I'm hoping that if you guys are seriously considering git, you'd also
split up the KDE repository so that it's not one single huge one, but with
multiple smaller repositories (ie kdelibs might be one, and each major app
would be its own), and then using the git "submodule" support to tie it
all together.

> However, I think a few issues have been confused and want to see if you
> can clarify.

Sure.

> Your talk focused heavily on the evils of a central repository versus
> the benefits of a distributed model. However, I wonder if what you
> actually find distasteful is not a central repository per se, but rather
> designing an SCM that relies upon *communication* with a central
> repository to do branching/merging or offline development.

I certainly agree that almost any project will want a "central" repository
in the sense that you want to have one canonical default source base that
people think of as the "primary" source base.

But that should not be a *technical* distinction, it should be a *social*
one, if you see what I mean. The reason? Quite often, certain groups would
know that there is a primary archive, but for various reasons would want
to ignore that knowledge: the reasons can be any of

- Release management: you often want the central "development" repository
to be totally separate from the release management tree. Yes, you
approximate that with branches, but let's face it, the people involved
usually have a lot of overlap, but the overlap is not total, and the
*interest* isn't necessarily the same.

For an example of "release management", think of multiple different
vendors. They would probably always start with your "central" release
tree (which in turn may well be different from your central development
tree!), but vendors invariably have their own timetables and customer
issues, so they usually need to make decisions that may not even make
sense for the "official" tree.

Examples of this in the kernel is how my tree is the central
development tree, then we have the "stable" tree (which is a *separate*
thing, maintained totally separately, but obviously based on my
releases), and then each vendor tends to have their own "release
trees". They are all different, they all have different policies and
reasons for existence, and they are *all* "central" depending on who
looks at them.

- Branching. Yes, you can branch in a truly centralized model too, but
it's generally a "big issue" - the branches are globally visible
things, and you need permission from the maintainers of the centralized
model too.

Both of those are *horrible* mistakes: the "globally visible" part
means that if you're not sure this makes sense, you're much less likely
to begin a branch - even if it's cheap, it's still something that
everybody else will see, and as such you can't really do "throwaway"
development that way. And let's face it, many cool ideas turn out to be
totally idiotic, but it might take a long time until it's obvious that
it was a bad idea.

So you absolutely need *private* branches, that can becom "central" for
the people involved in some re-architecting, even if they never ever
show up in the "truly central" repository. That's a huge deal for
development.

The other problem is the "permission from maintainers" thing: I have an
ego the size of a small planet, but I'm not _always_ right, and in that
kind of situation it would be a total disaster if everybody had to ask
for my permission to create a branch to do some re-architecting work.

The fact that anybody can create a branch without me having to know
about it or care about it is a big issue to me: I think it keeps me
honest. Basically, the fundamental tool we use for the kernel makes
sure that if I'm not doing a good job, anybody else can show people
that they do a better job, and nobody is really "inconvenienced".

Compare that to some centralized model, and something like the gcc/egcs
fork: the centralized model made the fork so painful that it became a
huge political fight, instead of just becoming an issue of "we can do
this better"!

There are other reasons for having a *social* network that tends to have
one or two fairly central nodes, but not having a *technical* limitation
that enforces that. But the above are the two biggest and most important
reasons, I think-

> After all, your repository acts as a de-facto central repository of the
> linux kernel in as much as everyone pulls from it. Without such a
> central place to pull the linux kernel would not exist, rather what
> you'd have is a bunch of forks which perhaps merge with each other from
> time to time.

Well, I do want to make it clear that we *do* have such forks that pull
from each other too. So the kernel actually does use the technology, it's
just that you have to be involved in the particular subprojects to even
know or care about it!

So it's not strictly true that there is a single "central" one, even if
you ignore the stable tree (or the vendor trees). There are subsystems
that end up working with each other even before they hit the central tree
- but you are right that most people don't even see it. Again, it's the
difference between a technical limitation, and a social rule: people use
multiple trees for development, but because it's easier for everybody to
have one default tree, that's obviously what most people who aren't
actively developing do.

To put this in a KDE perspective: it would make tons and tons of sense to
have one central place (kde.org) that most developers know about, and
where they would fetch their sources from. But for various reasons (and
security is one of them), that may not be the main place where most "core
developers" really work. You would generally want to have separate places
that are secure, and those separate places may be *different* for
different developer groups.

For a kernel example: the "public" git tree is on the public kernel.org
servers (including "git.kernel.org"), but that is actually not a machine
that any developers really ever push to directly.

Many kernel developers use other kernel.org machines (because we have the
infrastructure), but others will use their own setups entirely, because
they might have issues like bandwidth (ie kernel.org may be reasonably
well connected, but while it has mirrors elsewhere, the main machines are
in the US, so some European developers prefer to just use servers that are
closer).

So if you look at my merge messages, for example, you'll see things like
merges from lm-sensors.org, git.kernel.dk, ftp.linux-mips.org, oss.sgi.com
etc etc. The point being that yes, there is a central place that people
know about, but at the same time, much of the *development* really happens
outside that central place!

> For any software project to exist as opposed to a bunch of forks I think
> you *have to have* a central repository from which everyone pulls, no?
> Of course many branches might exist, but those branches must pull from a
> central repository if they want to share *at least some* common code.

Practically speaking, you'd generally have one or a few central
repositories, yes. But no, it really doesn't have to be a single one. And
I'm not just talking about mirroring (which is really easy with a
distributed setup), I'm literally talking about things like some people
wanting to use the "stable" tree, and not my tree at all, or the vendor
trees.

And they are obviously *connected*, but it doesn't have to be a totally
central notion at all.

Think of the git trees as people: some people are more "central" than
others, but in the end, the kernel is actually fairly unusual (at least
for a big project) in having just *one* person that is so much in the
"center" that everybody knows about him.

In most other projects, you literally would have different groups that
handle different parts. In the KDE group, for example, there really is no
reason why the people who work on one particular application should ever
use the same "central" repository as the people who work on another app
do.

You'd have a *separate* group (that probably also maintains some central
part like the kdelibs stuff) that might be in charge of *integrating* it
all, and that integration/core group might be seen to outsiders as the
"one central repository", but to the actual application developers, that
may actually be pretty secondary, and as with the kernel, they may
maintain their own trees at places like ftp.linux-mips.org - and then just
ask the core people to pull from them when they are reasonably ready.

See? There's really no more "one central place" any more. To the casual
observer, it *looks* like one central place (since casual users would
always go for the core/integration tree), but the developers themselves
would know better. If you wanted to develop some bleeding edge koffice
stuff, you'd use *that* tree - and it might not have been merged into the
core tree yet, because it might be really buggy at the moment!

This is one of the big advantages of true distribution: you can have that
kind of "central" tree that does integration, but it doesn't actually have
to integrate the development "as it happens". In fact, it really really
shouldn't. If you look at my merges, for example, when I merge big changes
from somebody else who actually maintains them in a git tree, they will
have often been done much earlier, and be a series of changes, and I only
merge when they are "ready".

So the core/central people should generally not necessarily even do any
real development at all: the tree that people see as the "one tree" is
really mostly just an integration thing. When the koffice/kdelibs/whatever
people decide that they are ready and stable, they can tell the
integration group to pull their changes. There's obviously going to be
overlap between developers/integrators (hopefully a *lot* of overlap), but
it doesn't have to be that way (for example, I personally do almost *only*
integration, and very little serious development).

> A central repository is also necessary for projects like KDE to enable
> things like buildbots and commit mailing lists.

I disagree.

Yes, you want a central build-bot and commit mailing list. But you don't
necessarily want just *one* central build-bot and commit mailing list.

There's absolutely no reason why everybody would be interested in some
random part of the tree (say, kwin), and there's no reason why the people
who really only do kwin stuff should have to listen to everybody elses
work. They may well want to have their *own* build-bot and commit mailing
list!

So making one central one is certainly not a mistake, but making *only* a
central one is. Why shouldn't the groups that do specialized work have
specialized test-farms? The kernel does. The NFS stuff, for example, tends
to have its own test infrastructure.

Also, it's a mistake to think that one site has to do everything. That's
not what we do in the kernel, for example. Yes, we have kernel.org, and
it's reasonably central, but that doesn't mean that everything has to, or
even should, happen within that organization.

So we've had people do build-bots and performance regressions, and
specialized testing *outside* of kernel.org. For example, intel and others
have done things like performance regression testing that required
specialized hardware and software (eg TPC-C performance numbers).

So we do commit mailing lists from kernel.org, but (a) that doesn't mean
that everything else should be done from that central site and (b) it also
doesn't mean that subprojects shouldn't do their *own* commit mailing
lists. In fact, there's a "gitstat" project (which tracks the kernel, but
it's designed to be available for *any* git project), and you can see an
example of it in action at

http://tree.celinuxforum.org/gitstat

(or get the source code from sourceforge), and the point is that all of
this was done entirely *outside* the kernel.org framework.

So centralized is not at all always good. Quite the reverse: having
distributed services allows *specialized* services, and it also allows the
above kind of experimental stuff that does some (fairly simple, but maybe
it will expand) data-mining on the project!


> These tools are important to the way we work and provide for many eyes
> constantly reviewing changes to the codebase as well as regular
> regression testing across diverse platforms. In the future, whether git
> or svn, I see no advantages in getting rid of a central repository from
> which everyone pulls. I wonder whether you really disagree.

So I do disagree, but only in the sense that there's a big difference
between "a central place that people can go to" and "ONLY ONE central
place".

See? Distribution doesn't mean that you cannot have central places - but
it means that you can have *different* central places for different
things. You'd generally have one central place for "default" things
(kde.org), but other central places for more specific or specialized
services!

And whether it's specialized by project, or by things like the above
"special statistics" kind of thing, or by usage, is another matter! For
example, maybe you have kde.org as the "default central place", but then
some subgroup that specializes in mobility and small-memory-footprint
issues might use something like kde.mobile.org as _their_ central site,
and then developers would occasionally merge stuff (hopefully both ways!)

> In your talk you also focus on the evils of commit access lists,
> comparing and contrasting with the web of trust the kernel uses where
> you have no commit access lists at all. However, isn't the kernel model
> just a special case? The linux kernel has a de-facto commit access list
> of one: you.

No, really. It doesn't. It's the one you see from the outside, but the
fact is, different sub-parts of the kernel really do use their own trees,
and their own mailing lists. You, as a KDE developer, would generally
never care about it, so you only _see_ the main one.

> This might work well for the kernel, but I fail to see how this really
> reduces politics. Many are still constantly pushing and arguing to
> merge their branches upstream into your repository. Would having a
> central repository where you and all your trusted lieutenants push their
> changes really be very different?

Yes it would be. You only see the end result now. You don't see how those
lieutenants have their own development trees, and while the kernel is
fairly modular (so the different development trees seldom have to interact
with each others), they *do* interact. We've had the SCSI development tree
interact with the "block layer" development tree, and all you ever see is
the end result in my tree, but the fact is, the development happened
entirely *outside* my tree.

The networking parts, for example, merge the crypto changes, and I then
merge the end result of the crypto _and_ network changes.

Or take the powerpc people: they actually merge their basic architecture
stuff to me, but their network driver stuff goes through Jeff Garzik - and
you as a user never even realize that there was another "central" tree for
network driver development, because you would never use it unless you had
reported a bug to Jeff, and Jeff might have sent you a patch for it, or
alternatively he might have asked if you were a git user, and if so,
please pull from his 'e1000e' branch.

For an example of this, go to

http://git.kernel.org/

and look at all the projects there. There are lots of kernel subprojects
that are used by developers - exactly so that if you report a bug against
a particular driver or subsystem, the developer can tell you to test an
experimental branch that may fix it.

> The KDE community has a very large commit access list and it is quite
> easy to join. Having a central git repository with a large set of
> committers would seem to map well with our community. I fail to see any
> harm in this model. The web of trust would still exist, it would just
> be much larger and more inclusive than the model the kernel uses. I
> wonder if you disagree.

Hey, you can use your old model if you want to. git doesn't *force* you to
change. But trust me, once you start noticing how different groups can
have their own experimental branches, and can ask people to test stuff
that isn't ready for mainline yet, you'll see what the big deal is all
about.

Centralized _works_. It's just *inferior*.

> Another sticking point is the performance implications of a git
> repository managing something the size of the KDE project. I understand
> the straightforward solution: just define content boundaries with a
> separate git repo for each submodule: kdelibs.git, kdebase.git,
> kdesupport.git, etc, etc. And then have a super git repo with hooks
> that point to these submodules. However, I think this leads to a few
> problems.
>
> What if I want to make a commit to kdelibs that will require changes in
> other modules for them to compile. I will no longer be able to make a
> single atomic commit with changes to multiple submodules, right?

Sure you will. It's hierarchical, though.

What happens is that you do a single commit in each submodule that is
atomic to that *private* copy of that submodule (and nobody will ever see
it on its own, since you'd not push it out), and then in the supermodule
you make *another* commit that updates the supermodule to all the changes
in each submodule.

See? It's totally atomic. Anybody that updates from the supermodule will
get one supermodule commit, when when that in turn fetches all the
submodule changes, you never have any inconsistent state.

> Also, won't we lose history when moving files/content between
> submodules?

Yes. If you move stuff between repositories, you do lose history (or
rather, it breaks it as far as git is concerned - you still obviously have
both *pieces* of history, but to see it, you'd have to manually go and
look).

The point of submodules is that they are totally independent entities in
their own right, so that you can develop on a submodule without having to
even know about or care about the supermodule.

Git actually does perform fairly well even for huge repositories (I fixed
a few nasty problems with 100,000+ file repos just a week ago), so if you
absolutely *have* to, you can consider the KDE repos to be just one single
git repository, but that unquestionably will perform worse for some things
(notably, "git annotate/blame" and friends).

But what's probably worse, a single large repository will force everybody
to always download the whole thing. That does not necessarily mean the
whole *history* - git does support the notion of "shallow clones" that
just download part of the history - but since git at a very fundamental
level tracks the whole tree, it forces you to download the whole "width"
of the tree, and you cannot say "I want just the kdelibs part".

> And how will we break up the existing history between all of these
> submodules?

There's a few options for that.

One is to just import the SVN history per directory in the first place,
but that makes it hard to then tie the history together in the
supermodule.

The better approach is probably to import the *whole* thing (which will
require a rather beefy machine), and then split it up from within git.
There are various tools on the git side to basically rewrite the history
in other formats, including splitting up a bigger repository (google for
"git-split", for example).

But I certainly won't lie to you: importing all the history of KDE is
going to be a fairly big project, and it will require people who have good
git knowledge to set it up. I suspect (judging by some noises I've seen on
the git mailing list and irc channel) that you have those kinds of people
already, but it may well be a good idea to _avoid_ doing it as one big
"everything at once" kind of event.

So seriously, I would suggest that if there is currently some smaller part
of the KDE SVN tree, and the people who work on that part are already more
familiar with git than most KDE people necessarily are, I suspect that the
best thing to do is to convert just that piece first, and have people
migrate in pieces. Because any SCM move is going to be a learning process
(the CVS->SVN one is much easier than most, since they really are largely
just different faces of the same coin - no real changes in how things
fundamentally work as far as the user experience is concerned).

> Finally, a couple points... CVS/SVN might be stupid and moronic, but I
> think it is good to note they are not nearly as bad as some other SCM's.
> Many SCM's used by some of the largest codebases in the world are still
> lock-based. If you think it is difficult to branch/merge using a
> central server, remember that some poor folks can't even *change a
> single file* without asking the central server for permission.

Sure. Crap exists. That doesn't make CVS/SVN _good_. It just means that
there are even worse things out there.

> It is also good to note that a free distributed SCM was not available
> until recently. The kernel community might have had a special deal with
> BitKeeper, but the same didn't apply to all open source projects AFAIK.
> When KDE moved to svn it was the best tool for the job. That might have
> changed when git became easier to use, but at the time it was simply too
> big of a barrier for new developers and too new. And from what I
> understand git support on other platforms is a recent development.

Git works pretty well on any random unix (although most users are on
Linux, with a reasonable minority on OS X - everything else tends to be
pretty spotty, and can at times require that you add compiler options
etc).

The native windows support is pretty recent, and still in flux. It's now
apparently quite usable, although I don't think there's any real
integration with any native Windows development environments (ie it's all
either command line or the "native" git visualization tools like git-gui
or gitk).

Linus


No hay comentarios: