Feature toggle life time best practices
Feature toggles provides some great opportunities when it comes to decreasing time-to-market while keeping control of possible issues impacting your end users. In the blog-post “What are the feature toggle best practices?” we looked into best practices while working with feature toggles. Feature toggle life-time is still a challenge. Toggles are easy to create, often forgotten about thereafter, right? After all, usually feature toggles are to be considered technical debt in your code. Our experience is that feature toggles tend to life for too long.
Some feature toggle lifetime statistics first
What is the range of lifetimes of a feature flag? This was a question ignited by Kent Beck on twitter, and triggered our curiosity here at Unleash.
Fortunately, Unleash has been in use by many applications over the last few years and we have gathered statistics from 130 applications. Here is some of what we found; On average an application uses 9 toggles, but there is a huge spread. Probably because an application can either be a small micro-service or a huge monolith. We went in and looked at all toggles used by these applications, and plotted when they where marked as “archived”, which is an active act of the user when the toggle is not needed anymore.
Histogram showing the amount of toggles that falls inside each bucket. You can see from the chart that most toggles only lives for 5 days.
From the histogram we can see that about 50 percent of the toggles are archived within the 50 first days. INF is used to collect all toggles active for more than 425 days, and 11% falls into this bucket. We suspect these toggles are mostly kill switches. The rest of the toggles, about 39%, seems to live anywhere between 50 days 425 days. We have not investigating whether these toggles have been in active use, or simply been forgotten about.
The second question Beck asked was the change frequency of feature toggles. Luckily, we have metrics for that too. What we see from our numbers is that most toggles is changed 5 times in its life span (median). Below is a histogram over the change frequency of the inspected feature toggles.
Long lived vs. Short lived feature toggles lifetime
The expected feature toggle lifetime typically depends on its purpose for being created. Our experience is that there are three broad categories of feature toggles:
- Release toggles
- Experiment toggles
- Operational toggles
These categories also indicates the expected feature toggle lifetime.
In this blog post by Peter Hodgson, a fourth category, «Permission feature toggles» are also introduced. This category is to enable feature to certain set of users, e.g. Alpha users, premium features to VIP customers and similar. We in unleash-hosted will argue that this is not a good practice to use feature toggles for your up-sell purposes. When you are providing premium features to your premium customers, you need to make sure you can easily track this in your CRM system. Obviously you could integrate your favourite feature toggle management system with your back office systems. Then again, we believe you should aim for best-of-breed systems to operate your business
The release toggles are created in order to verify that the newly created feature works as expected in the live production environment. The business reason to utilize release toggles are to decrease time spent on some of the testing and validation of the developed feature upfront release to production. By doing this, the risk of issues increase. To balance this, release toggles are introduced. A best practice is to use a Gradual roll-out pattern.
Gradual rollout pattern
The gradual roll-out pattern is to enable the newly deployed feature for a small number of users. Often this pattern is referred to as Canary releases. At unleash-hosted we usually refer to this pattern as Gradual roll-out pattern, as we think this is more descriptive. First, the newly developed feature is enabled only for the developer, or the developer team. This is easily achieved by using the userWithId activation strategy in unleash-hosted. This allows the development team to verify the deployed feature without the risk of impacting any end users.
Second, the team might enable the feature for some or all employees of the company. This will increase the number of users facing the new feature, while limiting the impact of any wrongdoing. This is done by using the <Riktig navn> activation strategy in unleash-hosted. The last step is to enable the new feature for a small percentage of the end users. Depending on the traffic, this can be anywhere between 1% – 25% of the users. The purpose of this step is to validate the release by exposing it only to a small number of end users. If any issues are discovered, the team has an easy way back. This step is done by a gradual roll-out strategy in unleash-hosted.
The product owner can decide to introduce the new feature to the Alpha/Beta or users in the Early adopter program. From a product management point of view, this is a great way to engage with your most valued customers, providing them with new features before it is generally available.
General availability pattern
The purpose of the general availability pattern is to decouple code deployment to production from general availability to customers. A main reason for using this pattern, is to decrease the use of feature branches. In what is a feature toggle service blog post we have walked through why using feature branches are bad for your time-to-market.
The general availability pattern wraps the new features not yet available to the end users using a feature toggle. This allows the development teams to continue to put the new, and sometimes yet unfinished features, to production while the product owner is in control of when it will be released to the end users. The business reasons for using the general availability pattern can be many. It might be connected to contractual reasons. Some customers have different requirements on how often they are willing to take new updates in the code.
Planned marketing campaigns are other reasons why the general availability pattern are used. In this situation the marketing campaign of the new features requires the feature to be general available at a specific date and time to get the maximum ROI of the campaign. Using the general availability pattern allows the product manager to enable the new functionality for the public at the right time for the marketing campaign. One example is if the new features are introduced to the public live on stage at a conference.
Release toggles and expected lifetime
It’s easy to understand that release feature toggles usually have a short expected lifetime. Usually these feature toggles lives for days or weeks. Sometimes they might live for some months, even though there should be strong reasons for them to live this long. When the new feature is out and general available, the team should plan for cleaning up the feature toggles as part of the technical debt management. This step is usually what we find most teams tend to skip, due to the need or desire to quickly move on to the next part to be developed.
The purpose of experiment toggles is to allow rapid learning and introduce more fact-based decision making into the product development process. The feature toggles are in this pattern enabling different variants of the feature to different segments of the customer base. By applying metrics to the experiments, the product owner or product manager are able to know what variant that gives the best business outcome.
The best known pattern is the A/B/n testing. In this pattern at least two variants (A and B) of the same feature is developed and put into production in parallel. The next step is now to decide how much of the traffic that should be forwarded to each of the variants. Typically this would be a 50/50 split, although different configurations might be used, depending on the context of the experiment. To achieve this, the variant support in unleash-hosted is being used in combination with a suitable activation strategy. In some situations it might be needed to run more than two variants in parallel part of the experiment. This is easily achieved by adding more variants to the feature toggle.
Expected feature toggle lifetime
The expected lifetime of the experiment toggles obviously depends on the type of experiment and the amount of traffic available. To get a statistically valid result of the experiment, the dataset (e.g. the user segments and the metrics in the experiments) are of most importance. We will not dive into these details in this blog post. From our experience, the experiment toggles usually lives for weeks. Sometimes they live for months, but we do advise against them to live for multiple months. The reason for this is that usually you should have a clear direction after running the experiment in a month or two. Also, it is our experience that it is quite complex and time consuming running multiple experiments in parallel over a long period of time. Then it is better to decrease complexity, move onto the next experiment. If the decision proved to be wrong – pick it up again and improve.
The purpose of operational toggles is to introduce quick safety mechanism to increase control in situations where not all parts of the system might be under the control of the DevOps team. The best known operational toggle pattern is «Kill switches».
The Kill switch feature toggle pattern is feature toggles that wraps in less important or flaky parts of your system. The reason for doing this are usually one of two. The first reason is that this allows the DevOps team to disable these less important parts of the system in periods when there are unusually high traffic. The reason to disable part of the system, is the fact that in such unusual times, it is better to keep parts of the business up compared to close it all down.
The other scenario where Kill switches are a typical pattern, is when the system is dependent on 3rd party integrations. In this scenario, the DevOps team usually want an easy way to disable this integration while having the business as undisrupted as possible. From our experience, 3rd party integrations often is challenging to work with since parts of the system is outside the control of the development team. Read more about kill switches and best practices.
Expected feature toggle lifetime
It is in the operational toggle category where we see the largest variance in the expected lifetime of the feature toggles. As with the Release category, they are usually introduced as part of a release of a new feature and they should be removed from the code as soon as the DevOps team gets confident that there is no issues. Still, quite often we do see that the DevOps team decides to keep some of the Operational toggles for a long period of time, sometimes even permanent. From our experience, this decision makes sense.
As for most of the topics in product development, the question to «what is the expected lifetime of a feature toggle» is «it depends». At unleash-hosted, we strongly believe what is most important, is that the development team have a good understanding of the purpose of the feature toggle. Then the team needs to take a conscious decision on how to handle the feature toggle as part of the technical debt management. What is clear – feature toggles tend to live for too long.