Thursday, December 29, 2005

Perfect... Versioning

This is based on stuff I posted in a thread on the Perl module-authors mailing list in Jan 2004. While I did a little bit of work on it at the time, I was barking up the wrong tree a little bit and didn't produce anything very useful. This posting is the result of the ideas swimming around in my head for two years. No working code is provided yet but it should be pretty easy :)

This is the second draft, reposted on 2006-01-01. Thanks to Rob Ewaschuk for comments on the first draft.

The goal

When I write the perfect language, it'll have a versioning system quite different to other languages. It will be possible for an author to say "I developed this using SomeModule 1.2.3, so when you're loading modules make sure that the version of SomeModule you load is a suitable match for 1.2.3". The user need not have 1.2.3 installed. As long as whatever version is installed is compatible then everything should just work. It will be possible to express this easily. So if the author used a bunch of module from Apache httpd 2.3.4 then he can just declare the version number of the overall httpd package and not have to declare versions for all the individual modules he used.

The problem with the status quo

At the moment versioning usually involves attaching a number or a pseudo-number like 1.3.5 to a collection of code, for example, a library. The client code (the code which uses the library) may get a chance to specify a minimum version that is acceptable, so in Perl you can write require SomeModule 1.3.5; and it will search for SomeModule.pm, load it and check the $VERSION variable to make sure it is greater than or equal to 1.3.5.

The first problem here is quite superficial, we are not comparing integers, we are not even comparing floats we are comparing triples (or n-tuples) of numbers so we need a definition of "greater than or equal to". Not a big problem, you look at the 2 versions, one component at a time. If all components are the same then they are equal, otherwise you find the first component where there is a difference and use that to decide which is greater.

The second problem (and this is not superficial) is that this definition of greater than is almost completely useless for deciding whether a particular version use suitable for the client trying to use it. In the example above, Perl will quite happily load version 7.1.9 of SomeModule even though the interface has changed radically from 1.3.5 and if you're lucky, the program will die cleanly. If you're unlucky, the program will not die cleanly but will corrupt your data or format your hard disk.

An example of another approach is the Apache project's versioning rules. They give a meaning to each of the 3 numbers in a version: MAJOR.MINOR.PATCH . Versions which differ only in their PATCH number should be compatible, versions which differ in their MINOR number should be compatible in one direction. That is, the version with the higher MINOR number should be compatible with the other but may have extra features. Versions which differ in their MAJOR number may be completely incompatible.

This second system would seem to be better and maybe it is but it also has problems. Let's say that the client was written to use SomeModule 1.3.5. Then it should work happily as long as the available version of SomeModule is 1.x.y where x > = 3. Several things can go wrong here. If 1.3.2 is installed then, according to the rules, that should be OK but in reality it may not be. There's the question of why were 1.3.3, .4, and .5 were released. If it was just to improve performance then there's probably no problem but if there is a serious bug in 1.3.3 then it doesn't matter that they are API compatible, we really don't want to accept it. In general, a simple version checker can only go by the rules and the rules only cover a tiny range of the information that's needed to make a decision. So Apache's rules give you a definition of "compatible" which is purely about interfaces and has some serious pitfalls when you try to apply it to functionality.

It might seem that Apache's rules could be tightened a little to improve the situation. Just like the Perl scheme above, We could err on the side of caution and say that a lower PATCH number is never acceptable. Of course there's no guarantee that a higher PATCH number is acceptable either, there's the distinct possibility that 1.3.7 introduced a nasty bug but we go ahead and use it anyway. The only way to guarantee compatibility is to insist on using the exact version that the developer used. Since lots of developers can use lots of different versions, of a module, this level of version strictness leads to "DLL hell", the requirement to have lots of different versions of shared modules one version for each application that uses it. A related problem comes from the MINOR number. The developer has used the latest bleeding edge version of SomeModule even though he hasn't used any of the bleeding edge features. A version checker may refuse to load version 1.2.5 because the minor number is too low when actually, all the requested features are available 1.2 .

The next problem is that to follow this version system strictly is very difficult. Whenever a change is introduced that breaks compatibility with an older version, the MAJOR version number should be increased. Apache's HTTP server is at version 2 after nearly (or is it more than?) 10 years of development. There are two possibilities

  1. Apache's interface (not just function prototypes and header files but the behaviour and meanings too) is incredibly stable
  2. Apache's developers have have not been sticking to their own versioning rules
I don't know which is the case in reality but neither of them are great. The second would be bad for quite obvious reasons. The first is bad because it's highly likely that something somewhere in the code base could really do with some fixing up in way that would break compatibility. Let's call this broken code the BadAPI module. No matter how obscure the BadAPI module is, fixing it would require bumping up the MAJOR version of the entire httpd package. Where does this leave all the things which use version 2.x.y? Firstly, the things that actually use the BadAPI module need to decide whether to change entirely to version 3 or to continue to support version 2. Even though version 2 has a big problem, they've managed to work around it up until now and it's the version that everyone is using and will be using for a long time to come so dropping support for it is likely not an option. Secondly, the vast majority, those that that never even knew about the BadAPI module, will work equally well with versions 2 and 3. Should they accept either version? What about future versions? This could get messy very quickly but will make their users happier.

So for a package that doesn't want to be held back by compatibility with broken interfaces, the MAJOR number could increase quite quickly over time. To minimise trouble for end users, clients should be able to load any compatible version of the package. This brings me neatly to another Perlism, only.pm. This is a clever module that seems to provide a neat solution to this problem (but I don't believe that this is the right problem to solve). With this module a client can say: # Only use SomeModule if version is between 0.30 and 0.50 # but not 0.36; or if version is >= to 0.55. use only SomeModule => '0.30-0.50 !0.36 0.55-'; and it will load a acceptable version of SomeModule.pm. So now we have a convenient way of specifying what the acceptable versions are and we can refuse to load a version 0.36 because it has a known problem. However there are 2 problems with this. The first is specific to Perl - we're using greater than as a test for compatibility, so, 7.12 being greater 0.55 means that it will be considered an acceptable version when in reality it's nothing like the old version. The second is more general - even if we used Apache's MAJOR.MINOR.PATCH system, only.pm cannot not predict the future and so cannot know about versions which come out after the client was written. It needs to be told by the client what to do in this case. In the example above the client has specified that anything above 0.55 is OK so on the Apache system, we would accept 0.57.23 but reject 1.2.3. It could just as easily have said that only the versions listed above are OK and that future versions should not be trusted at all. Either way is likely to lead to trouble for the end user, either an unacceptable version will be loaded or a perfectly acceptable version will be refused. The reality is that a human author may be better at predicting the future than a simple version checker but it's still a fundamentally impossible task.

Another Perlism is version-Limit which makes a fair bash at solving part of the problem. It provides a mechanism for a module author to provide useful information about the current version and so can prevent loading of a bad version in some cases. The main problem is that it assumes that compatibility has been retained unless the author specifies otherwise, a potentially dangerous assumption that can be avoided with a different system. It also puts something of a burden on a module author to keep all the compatibility information accurate and even then the information only relates to the current version, errors are not fixable and bugs found later cannot be taken into account.

How it should be

So what now? Have I invented an algorithm for predicting the future? Of course not. The point is that it should not be the client's responsibility to predict the future. For a given client and a given version of a module it is possible to figure out whether they will work together (assuming no hidden bugs). It's just not possible to do this when all you have is a version number. That's why I said that only.pm is solving the wrong problem, in fact it's trying to solve an unsolvable problem - there simply isn't have enough information available. The most a version number can tell you is that compatibility has been lost, that's all, it can claim that compatibility has been retained but it's frequently wrong (a misleading claim that will last forever) and it can never tell you that compatibility has been lost for one component but retained for another. However, the information does exist. The author of the module knows (in principle) what has changed from version to version, what remained compatible and what broke. At the moment, in a best case scenario, a tiny snippet of this information is available in machine readable form in the version number and the rest is in human readable form in the change log, the revision control system or just the author's head. If there was an easy way to express this information so that it could be used by a version checker then there would be no need to try to predict the future - all the necessary information would be present at module loading time.

What's needed here? First off, what is "all the necessary information"? There are two sides to this question, what information must the client supply and what information must the module supply?

The client needs to say what features of the module it will use and what version of those features it is expecting. This raises the question of what is a feature. If the client simply says "I was developed to run with Apache httpd 2.3.5" then we're back in the same boat, breaking BadAPI's interface will break the interface of the whole package and even if the client doesn't use BadAPI it should refuse to load the new version. So there needs to be a way to specify a finer level of granularity, for example, the client could say that I used functions from these files or just these 7 functions. Module authors could also break their modules down into meaningful components and provide names for these components. One component could include parts from multiple files, a single file could include parts from multiple components and components could overlap. The client should be able to get as low level as it likes, listing off all the functions, structures and named constants individually. That sounds pretty tedious but since this is the perfect language, it could be automatically provided by the compiler.

As well as breaking its interfaces into sensible pieces, the module needs to provide a matrix giving a yes or a no for the compatibility of every pair of versions. The version checker should be able to tell from this matrix that the bar module in version 3.1.2 is compatible with the one in 2.3.5 and so 3.1.2 is a perfectly acceptable version for this client. With this information we can have a function suitable(component, requested_version, released_version) which returns a boolean and using this function we can select the "best" version of the module from all those that are available.

How to get there

This matrix contains a lot of information and it grows as c * v2, where v is the number of versions released and c is the number of components. Supplying all this information sounds like a huge burden on the author of a module but it need not be. Also, storing and managing this information need not be a burden either.

With each release of the module, the author would only have to add some information about how this release relates to other earlier releases. This incremental information can be used to construct the full compatibility matrix. How little can we get away with? For each component we need to know how this release relates to every other release. What are the possible relations between 2 releases of a component. I already said that suitable() returns a boolean but actually it's useful to consider slightly more complex relationships. By adding a little more complexity we can supply less information. For two versions, A and B we can have

  • A = B, they're identical
  • A < B, B is a suitable replacement for A
  • A > B, same as B < A, A is a suitable replacement for B
  • A ! B, neither of them is a suitable replacement for the other
and these have nearly all of nice properties you would normally associate with =, <. and >. The difference is that it is possible to find two versions that cannot be compared. That is, they are not equal to each other and neither of them is greater than the other, they are incomparable and that's where the ! comes in [1]. We can use these rules to chain pieces of information together to make more information. We can figure out how version A relates to version B without having to directly state that relationship. The rules are:
  • if A = B and B = C then A = C
  • if A < B and B < C then A < C (same for >)
  • if A = B and B < C then A < C (same for >)
  • if A = B and B ! C then A ! C
The final rule is that everything is assumed to be incomparable until proven otherwise. The reason for this is that loading the wrong version could be disastrous whereas refusing to load a module that is compatible is just inconvenient and can be fixed just by providing more information.

Take as an example, a module that implements various features of a Dog, it has 3 components, Barking, Biting and LegHumping and there have been several releases of this module with varying degrees of compatibility. Lets say that the releases have just been given simple, single-integer version numbers, 1, 2, 3 and so on. How can we express useful information about these releases? Say, for example, that version 2 kept compatibility for all three components but added some new Barking features (maybe a growl() function) so we could say (in Perl) something like this {Barking => ">1", Biting => "=1", LegHumping => "=1"} better still, if we defined a Dog component that was made up from the other three, we could say {Dog => "=1", Barking => ">1"} meaning that all of Dog stayed the same except Barking which stayed compatible but gained some extra features. From this information we can deduce a 2x2 compatibility matrix for all three components. The matrices for Biting and LegHumping are not very interesting. They're all 1s because anything requiring version 1 or version 2 of Barking or LegHumping should be happy with either version. The matrix for Barking looks like this

Requested
1 2
Available 1 1 0
2 1 1

Let's say in version 3 we realise that Barking was done all wrong and we should totally change the interface, we add the following information about version 3: {Dog => "=2", Barking => "!2"} and that gives us a compatibility matrix that looks like

Requested
1 2 3
Available 1 1 0 0
2 1 1 0
3 0 0 1

So it's possible to deduce a lot of information from a little. A module author could simply define %VINFO = ( 1 => Init(Dog => ["Barking", "Biting", "LegHumping"], 2 => { Dog => "=1", Barking => ">1" }, 3 => { Dog => "=2", Barking => "!2" }, ); somewhere and the version checker would do the rest of the work.

This raises the interesting question of where this information should live. It might seem like a good idea to bundle it along with the rest of the module but then we would miss an interesting opportunity. No code has no bugs and since this version info is code, the chances are it's going to have bugs in it. What kind of bugs can version information have? It could incorrectly assert that 2 versions are or are not compatible and it could forget to assert that 2 compatible versions are compatible. These bugs can arise in 2 ways. Firstly they can be due to a mistake in constructing %VINFO, secondly they can arise when bugs are found in the module's code and 2 versions thought to be compatible turn out not to be, presumably because one of them has a bug. If the version info is bundled with the module then a user must upgrade the module just to get the version info. This is not desirable - it should be possible to have the latest version info without the latest code. For example, let's say we're up to version 5 of our Dog module and we have %VINFO = ( 1 => Init(Dog => ["Barking", "Biting", "LegHumping"], 2 => { Dog => "=1", Barking => ">1" }, 3 => { Dog => "=2", Barking => "!2" }, 4 => { Dog => "=3" }, # new algorithm for performance improvements 5 => { Dog => "=4" }, # even better algorithm for more performance improvements ); and then we find out that actually, the changes we made for version 4 introduced a bug and that our dogs would no longer whether they're biting the hand that fed them. It's not a big deal because version 5 uses a totally different algorithm and does the check correctly but still, anybody using a version 4 Dog for Biting could be in trouble. We shouldn't need to release a new version of the Dog module because we haven't changed the code, we just need to release a new set of version info: %VINFO = ( 1 => Init(Dog => ["Barking", "Biting", "LegHumping"], 2 => { Dog => "=1", Barking => ">1" }, 3 => { Dog => "=2", Barking => "!2" }, 4 => { Dog => "=3", Biting => "bug" }, # new algorithm for performance improvements but has a bug 5 => { Dog => "=4", Biting => "=3" }, # even better algorithm for more performance improvements, bug was fixed ); So now, users can just keep up to date with the version info and they will be informed whenever a bug is discovered in a version they are using. Even when there is a bug, there's no need to upgrade the package unless they are actually using the buggy component. Of course it's possible that the version information supplied by the author will be just plain wrong, so it should also be possible to adjust and augment the information both site-wide by installing adjustments/extra information alongside the "official" version and at the point of use if, for example, the client knows that a certain version is compatible enough for its purposes even though it may be marked as incompatible.

There are further tricks and options that are desirable to have. For example, a component may have have had multiple interfaces over it's history and it's possible that a particular version would be compatible with more than one of them. So it should be possible to say: 10 => { Dog => "=9", Biting => [">5", ">7"] } # is compatible with both versions even though they are not compatible with one-another Similarly it could be useful to provide annotations and other information like deprecation or warnings that should be output on loading, perhaps about security or performance. This kind of information exists in parallel to the compatibility information and does not influence the result. It would be useful to have a convenient way to propagate this information from version to version but that's a problem that can be addressed afterwards, for the moment we'll assume that all such information has been provided explicitly [2]. It would also be useful to be able to provide information for multiple versions at a time, like: ["10-17", "22"] => { Dog => "=9" } } although one problem with this is that it means that the keys now need to be parsed rather than simply treating them as a single version number and in Perl in particular, it's not possible to have anything but a scalar as the key of a hash. A way around this is to make VINFO an array of pairs, the first being the version number(s), the second being the information. Since this is purely for convenience of the module author, for the sake of simplicity, I'm going to stick with specifying information for a single version at a time, it has no effect on the tricky parts of the solution.

Implementation and efficiency [3]

In the Dog example, there are 4 components and 5 versions which gives 4 5x5 matrices. That's 100 boolean values. Already that's quite a lot of data to compute every time a module is loaded. A real module could have hundreds of releases and tens of components, so computing the entire compatibility matrix for each component each time is something we want to avoid. In fact it is possible to perform a large amount of the computation at the time a version module is installed or when the version information is updated. The results could then be stored in such a way that little or no calculation need be done to check compatibility. It is also worth noting that we need only store compatibility information related to versions which are actually installed, which will usually reduce requirements significantly.

There are several simple things that reduce the amount of work needed and simplify the algorithm. First off, the initial VINFO data can contain information about multiple components and each entry can provide information about one, some or all of these components. The first step is to separate this information out by component and once that's done there is no interaction between the components. Also, as part of this step the extra information about bugs, performance etc is removed and stored somewhere else. What we end up with is, for each component, an array, indexed by version number where each element of this array is one or more of "=X", "<X", ">X" and possibly "!X" where X is version. From now on we'll just deal with a single component and we'll call this array RELATIONS.

We can think of RELATIONS as a database of facts. At the moment, the database is indexed somewhat randomly. This indexing came from concerns about human convenience for data input but it's not very useful for efficiently extracting compatibility information. The next thing we need to do is rearrange these facts to make them more useful. With each rearrangement it would be useful to check for "obvious" contradictions - the earlier they are detected, the better the information we can provide to the author and the easier they are to debug. An obvious contradiction occurs whenever RELATIONS[X] (including INCOMPARABLES[X] - see below) contains 2 different relations for Y.

The first rearrangement is to simplify RELATIONS by searching for any occurrence of "=". If we find that X = Y for some X and Y then we choose one of them, let's say Y and remove Y from RELATIONS and replace all occurrences of Y with X. If there are other declared relations for Y then we must append these to RELATIONS[X]. We must also note somewhere that Y = X so that if we are asked about Y we know that we should rephrase the question in terms of X. A typical component, should have long periods of stability during which the interface does not change. For example, if it is a relatively stable part of a larger module then it may remain completely untouched while other parts undergo significant changes. In this case, removing all the "=" facts could reduce the size of RELATIONS significantly.

The next rearrangement is to remove all occurrences of "!" from RELATIONS and put them in another ARRAY called INCOMPARABLES. If X ! Y then we should add 2 entries, appending Y to INCOMPARABLES[X] and X to INCOMPARABLES[Y].

Next we should replace all ">" facts with "<" facts. Examine all the entries in RELATIONS. If RELATIONS[X] includes ">Y" then delete this fact and add "<X" to RELATIONS[Y].

At this stage all the hard work has been done. RELATIONS consists entirely of "<" facts. Given a requested version R we can look in RELATIONS[R] to find X1, X2, ..., Xn such that R < Xi for each i. That means Xi is a suitable replacement for R for each i. We can then use each of these Xi as indices in RELATIONS to find further suitable replacements. Eventually we will have a list of all the suitable replacements for R. If any of these replacements occur in INCOMPARABLES[R] then we've hit a contradiction. We must also check for another type of contradiction. It's possible that the author has declared (directly or indirectly) X < Y and Y < X. In this case the algorithm above will loop forever. To avoid this, we need to keep track of previously seen versions. Happily if we memoize the calculation, we get this error checking for free.

Clearly there are dangers here in terms of bugs and contradictions so it is essential that a comprehensive test framework be provided. Before releasing updated version information it is necessary to exercise all combinations of inputs to ensure no contradictions occur and it should be easy for an author to test that certain relations are or are not implied by a particular set of relations.

So now we can use RELATIONS to find all the released versions which are suitable for a given requested version, this allows us to generate the full compatibility matrix for a component and, in principle, we're done. In practice, doing all of this whenever a module is about to be loaded is going to be terribly slow.

The ultimate goal is to be able to get an answer to suitable(component, requested_version, released_version) and to do this efficiently. In practice, for a particular installation the number of installed releases of a package will be small and so the number of values we might use for released_version will be small. This means we can produce an optimised set of compatibility matrices that don't include any information about releases that are not installed. There are a variety of ways of organising this information efficiently. One way is to store the matrix directly - for each possible requested version we could have a bit-vector, I bits long, where I is the number of installed versions. Another way is to map each version to a sorted list of compatible installed versions. Yet another is to store a sorted list of incompatible versions. Which is most efficient really depends on the pattern of compatibility and the number of installed versions. It should be possible to choose at install time which of these is most suitable for the package being installed.

That's it, all that remains is to shut up and code.

[1] Versions form a partial order. Regular numbers (things that represent amounts and sizes - ints, floats etc but not complex number) form a total order because any two numbers are comparable, that is they are either equal or one of them is greater than the other. [back]

[2] This is actually information about the implementation, rather than the interface. [back]

[3] This whole thing is presumably a one-liner in Prolog. [back]

Random notes for stuff I didn't elaborate on

Random thoughts that probably make sense only to me.

  • "bug", "deprecated", "slow" - bug numbers, bug closures
  • ["17-20" => {Barking => "=16"}, ["21" => {Barking => ">20"}], ["18-20" => {Barking => "bug"}]
  • Overriding the published compatibility matrix, site-wide and use specific.
  • Objectify all strings and other input in the constructors.
  • find a compatible version
  • use a binary tree with ranges, when you check left, compare with the max, check right, compare with the min.
  • use [] instead of {}
  • I would guess (hope!) that stability is the common case so we could assume everything stays compatible with the previous version except components that are specifically mentioned. There are multiple ways of figuring out which was the previous version depending on your version numbering scheme but to avoid getting bogged down, let's say that the previous version needs to be specified explicitly (even with one of the automatic scheme's it will be necessary to provide explicit overrides anyway). So, now everything is relative to a with everything being
  • any component that is not specifically mentioned remains compatible with the "closest prior" version. There are multiple ways to define "closest prior". The easiest way to do this is to use a consistent version numbering scheme where as each new version is released you either increase the last component (e.g. 2.3.5 -> 2.3.6) or add one or more components (e.g. 2.3.5 -> 2.3.5.1), increase one of the earlier components and drop all those that follow (e.g. 2.3.5 -> 2.4)
  • To find the closest we take the last component of the version number and reduce it by by 1, if that's an existing version then that's the closest prior, if not, repeat until you either find or you reach 0. If you reach 0, start reducing the the next component of the version number and so on. The closest prior version to 2.6.29 is probably going to be 2.6.28 assuming it was ever released.

No comments: