Sunday, August 31, 2014

The Freeze Period

Today I’d like to talk about a subject on which I have already had some heated discussions in several different settings: The goal and usefulness of a freeze period.

Usually the intention of a freeze period is to freeze the code base in one of the test environments before release to production for a determined period, say one or two weeks, with the aim of avoiding regression bugs after the production release. The argument is that running a period of time in a lower environment without regression issues is a reliable  indication that there will be no regression issues after production release either. I have several observations with respect to this line of reasoning:

  • In order for this to valid, you need to be sure that the infrastructure of your lower environment is pretty much if not exactly the same as your production environment. Less performant hardware for instance may make it difficult to deduce if you have a performance issue with your application. Less physical memory in your lower environment will lead to disk swapping earlier than in your production environment, so you may conclude you have a performance issue when in fact there is none. Same with network equipment, processor speed etc. If you use load balancing in production, you need to use it in your lower environment as well.
  • If your application manages data, then the data in the lower environment must be sufficiently close to the production data both in quantity as in quality in order to detect application issues that are triggered by unexpected but erroneously allowed data. If you miss production data configurations in your lower environment, at some point this will trigger a production incident.
  • And then there is the matter of usage and load. These too need to resemble he actual production situation as close as possible for the freeze period to give rise to any meaningful conclusions.
  • How long is your freeze period going to be? If you want to avoid all incidents that would happen once a year, you would have to organize a full parallel production run on the exact same infrastructure and dataset for an entire year. This obviously has some cost. If you do that only for one week, you will detect all issues that would occur on a weekly basis, half of the issues that would happen once every two weeks and less than a quarter of the issues that would happen once a month. Of the yearly incidents you would detect less than two percent, and you don’t know which two percent, the highest impact or the lowest impact. Conversely, if your code changes would give rise to an incident once a year, your chance of detection it during a one week parallel production run is less than two percent.
  • So suppose you have almost same infrastructure except for server memory, a third of your production data, and you emulate 150% of the production load for several hours per day for a week using 2 standard use cases, but in fact over 10 exist. No issues are detected. How confident are you there will be no issues in production?
  • Another problem is what to do when an issue is in fact detected. If you take the freeze to be an actual hard code freeze, then this raises the question of whether to fix the issue or not. And if you do, are you then going to break the freeze and re-deploy to your lower environment, possibly missing a KPI? And do you then start to count from zero again for your freeze period? And if you do, are you then going to postpone your production release for which go to market activity has already started, missing another KPI? Problems, problems, problems. Usually this situation is resolved by a senior manager grudgingly giving permission to “break the freeze” without postponing the actual release date.

I’ve encountered situations where this dynamic led to some bizarre consequences. For instance, when an issue was discovered during a freeze period in one of my assignments, the first thing the team did was to check if it occurred in production as well. If yes, it was not regarded as a regression. Since the bug existed previously, it did not need to be fixed and no break of the freeze period was warranted. This meant that once a bug made it to production, as long as no one complained to loud about it, it would never be fixed. And if a bug did need to be fixed, instead of being happy that that bug was found and fixed before it got into production, there was disappointment about having broken the freeze.

So does a freeze period makes any sense at all? That depends on what you expect from it. A scope freeze makes sense (but is already achieved if you enforce a fixed sprint backlog), a code freeze much less so if any. Basically the evaluation that needs to be done on any defect is the same no matter when it is detected:

  • Can we still fix this in time for the release?
  • Can we still verify it in time for the release?
  • Which additional tests do we need to re-execute as a result of this fix, and can that be done in time for the release?
  • And if the answer to any of these questions is no, can it wait until the next release or is it a showstopper?


It is clear that you can fix and verify bigger issues 6 weeks before the release than you can 2 days before the release. But then again your highest risk backlog items should always be done first, so the bigger issues should come out early in your project and not right before the production release. If that happens you have problems in the area of backlog prioritization and risk assessment. A mandatory freeze period is not going to address that. A full parallel production run may be very necessary in some cases, like in industrial settings or when replacing your ERP system. But this is not necessarily the same as a freeze, as you will want to verify your fixes. My conclusion is that a freeze period is theoretically a nice idea but there are so many gaps in its concept and concessions to make in its implementation that its practical usefulness is close to zero.


Thursday, August 21, 2014

Governance in Agile Projects

Project governance is a reality in many large software development organizations.  Whether you like it or not, there are many valid reasons for software development needing to comply with governance covering a variety of topics, depending on the context in which the project is executed. Good governance covers and mitigates risks that projects may form for the wider organization without imposing too much of  a cost on the project in terms of speed of delivery and budget. These risks can be related to legal, financial, security, reputational, operational and health and safety risks and this list does not pretend to be complete.

For instance there will be a slew of security and legal requirements that need to be fulfilled if you are developing a personal banking site for a financial institution. A system to manage and consult medical information will be subject to privacy laws. In an industrial setting there will be environmental and safety requirements that may have an impact on IT systems. So how to deal with governance requirements in an Agile project, where you just want to be as productive as you can without being hindered by apparently unnecessary rules that just seem to complicate things and slow you down?

Well, first of all, too much of a good thing is not a good thing, and too much governance, like too much government, tends to have a range of unintended consequences. So when deciding on the specifics of governance that your projects need to comply with, it is important that for each of the rules and standards you intend to implement it is perfectly clear which problem you are trying to solve or which risk you are intending to mitigate. Too often you hear statements like ‘All our projects need to comply with rules so and so because that is just what we decided to do many projects ago based on our experience with these projects and it is since then the way we do things around here’.

Aha, is that so? Well, I’ve got a few questions then:

  • The projects you refer to, what kind of projects were they?
  • Do you still do the same kind of projects technology wise, functionally, organizationally, scale wise, budget wise?
  • Are the risks and issues you identified back then still relevant today?
  • Are the mitigation options still valid?
  • What about new technology since then, does this not allow for addressing some of these risks to be handled differently?
  • Are the people still the same, or have most if not all in the meantime been replaced?
  • For external customer facing projects, are the market expectations still the same, particularly in terms of quality and speed to market?
  • Is your market position still the same
  • Have new regulations come into effect, or existing ones changed or abandoned?

The answers to these questions will likely lead you to revise existing governance frameworks and specific rules. Rules for which the motivation and relevance are clear are much easier complied with.

Once you have a pragmatic and reasonable set of rules for your governance framework, it must be made sure it is in fact adhered to. The best way to do this is to explicitly include it in your definition of done, and if some rules do apply to certain user stories only, then they need to be reflected in the individual acceptance criteria for these user stories. It is of course important that the Product Owner puts sufficient emphasis on this when accepting new functionality as Done. It is an area where it can be very tempting to skip a few rules in order to be quicker in the market, but this is a risky strategy. In the type of organizations where governance becomes necessary, there will be steering committees and program boards and release management and it is up to them to make sure that team members, the Product Owner and Scrum Master are aware of the need for this, while at the same time avoiding to become just an administrative hurdle to take on the way to production. The latter will occur when the motivation for certain rules is unclear, this will lead to rubberstamping of certain aspects of governance compliance.