agmatei
on 30 June 2024
Managed Apps on Public Cloud: Why Operations Matter, Part II
In the first part of this blog journey (I’d call it a post, but it’s actually two posts) we explored what operational excellence looks like in public cloud deployments. And while I do not want to spoil it for you, the main takeaway was that it is not easy and can become resource-intensive. With this in mind, you might should be wondering what you can do to achieve excellence without focusing all your resources on operations. You may be asking yourself questions like “Am I still able to innovate?” or “Do I have enough resources to cover all of this?” Worry not, friend, for I am here to guide you through this maze.
Operational management options
To cover your environment’s operations, you’ve got two options: you can do it yourself, or you can partner with someone who can do it for you. Regardless of your choice, the operations are the same. But depending on where you are in your journey, and what your main business scope is, some options could be more advantageous than others. Let’s look at what each choice would mean for your business.
Key Requirements
I’ve mentioned above that the operations themselves are similar regardless of your choice. On top of that, even the requirements to achieve the operational excellence I’ve described in Blog 1 are the same, whether you choose to try your hand at self-management or opt for a managed service. And while it would be impossible to map out exactly what you need in order to operate your public cloud clusters at full efficiency, there are a few key requirements without which no public cloud deployment could operate. These are:
People
Perhaps the most essential part of any project is people. This is especially true in the case of public cloud deployments. Unfortunately, in this scenario, the requirement is not for any type of person, but for seasoned software engineers. It is difficult to estimate precisely how many members a team requires, but an industry rule of thumb is that one engineer should usually focus on operating less than 100 nodes or clusters.
Attracting and retaining people in a team can be challenging. It is no secret that as an industry, we are currently battling a major software engineering workforce shortage, and so the market is quite fierce. Engineers tend to be motivated by purpose, complexity, scope, and, of course, competitive salaries – and it is all well deserved. Good engineers can make the difference between success and failure, and therefore entire businesses. In short, without people, there are no operations. Without good people, there are no good operations.
Time
Automation plays a key role in management, and operations are no exception. However, when it comes to monitoring, alerting, and incident recovery, there is only so much automation one can do. For various reasons, a big part of operations will always remain manual, requiring large amounts of time from engineers. Ideally, within a team, the senior members will focus on innovation and troubleshooting the most difficult incidents, while the junior and middle members will cover lighter operational tasks. This, however, changes with availability, which makes time a scarce yet essential resource for achieving operational excellence.
Knowledge
Let’s suppose we’ve got a team of people, and they have enough time to cover operations. And so we send them to work, and they come back screaming – they realise how dynamic and volatile public cloud operations can be, and how much they need to focus on not only building but also maintaining their operational skills. Knowledge is to operations what natural pearls are to jewellery: scarce, incredibly valuable, hard to obtain, and harder to maintain. Training an engineering team must be a continuous and evergreen process, in which members are constantly learning new and innovative ways to perform their tasks, whilst also continuously challenging themselves to improve and grow their seniority. Only by staying up to date with market trends and requirements can a team ensure that an environment operates at full efficiency and produces reliable and accurate results. Knowledge must be one of the key values of any operational team – I’d even dare to say any team at all.
Money
I’m sure you’ll agree that the three items above are rather intuitive. You’ll say “I knew that – anyone with half a brain knows that you need skilled people who have time to do what you want them to do”. And I will agree with you. However, the reason I’ve chosen to list them so clearly is to make fully logical the next point, which is that all of the above require significant pecuniary investments from the company that undertakes the creation and formation of an operational team.
Hiring and maintaining talent costs a lot of money. Replacing talent costs even more. Training and re-training a team adds additional costs, and the time spent on operations will always incur an opportunity cost (which, in the case of senior engineers, is often dizzyingly high). What’s more, is that these costs are often unpredictable. It is difficult to be sure when the required headcount will be met, or what training each engineer will need, or how scarce time availability will be divided into periods of high uncertainty or innovation. Therefore, strong financial forecasting is necessary to ensure the smooth development of operations.
Taking the DIY approach
Now let’s look at what options you’ve got for applying those resources towards the operation of a public cloud app deployment. Taking matters into your own hands is always admirable because it is an act of courage. Doing it all yourself means that you assume full responsibility for whatever happens. This option essentially entails building a team of operational experts that will manage your public cloud application environment from its conception until its decommissioning. As mentioned above, you will need to nurture and grow this team and ensure that everyone is well-trained and well acquainted with the fast-moving dynamics of the public cloud app ecosystem.
Requirements for success
Succeeding at the DIY approach is one thing. However, there are several factors that will make you more likely to achieve excellence and truly thrive. The scenario in which we would actually recommend that you take the DIY approach is if you are a tech-first company (meaning, your main scope is related to software or hardware). Being tech-first would make you an attractive workplace for top engineering professionals from an ideological perspective, more than a remunerative one, and it would also empower you to use the energy of all your teams to maintain technical correctness. In addition to that, top professionals tend to always gravitate towards intellectual challenges and autonomy for innovation.
Worry not if you are not tech-first, though. You can still try your hand at managing your public cloud clusters all by yourself, or with the help of your in-house IT team. Let’s look at the advantages and disadvantages of the self-managed route.
Advantages
- Full authority – when you do it yourself, you choose what to do and when. You will have full control over who does what, how and when your resources are procured and allocated, how big your team is, and what training they have.
- Freedom of practice – self-managed environments also involve full freedom of practice. You can choose what monitoring and alerting protocols to employ, which incident recovery methods you adopt, and even whether or not you want to be compliant with certain standards.
Disadvantages
- Unpredictable costs – you will have to be prepared to deal with all the unpredictabilities that come with hiring, training, and maintaining an in-house operational team. Remember, markets are scarce and incredibly dynamic, and situations where engineers come, train, and immediately leave are not rare.
- Resource supply difficulties – doing it all yourself means you’ve got no help in procuring your resources. You alone will have to make sure that your engineers receive the adequate training they need, that your tools and ecosystems are up to date, and that your talent is, indeed, talented.
- Accountability – this is the principal challenge when it comes to choosing self-management. If something goes fully wrong with your environments, you will be the only one responsible – because you alone have authority over it. You may find yourself in a situation where you have no support and really need it. That is why, in situations where self-management is chosen, I recommend opting for at least an Enterprise Support option for your open source ecosystem. This can ensure that you have a point of contact should anything terrible happen.
Choosing a Managed Service Provider
Opting for a managed service provider (MSP) is rather self-explanatory in this context. Instead of taking care of your operations by yourself, you choose a trusted company and hire their services at a fixed cost. In return, they operate your environments to your specifications, for as long as you need them to.
During my tenure as product manager for managed services, I have been constantly baffled by the occasional stigma associated with selecting a managed service for open source ecosystems. There seems to be a fear that this choice is an indirect declaration of lack of skill. If this is a worry you have, allow me to dispel it for you: it is a very strong declaration of the opposite. Choosing a managed service liberates your engineering resources to focus on innovation, which actually honours their skills and training. The truth is that operations, despite their essential indirect presence in an innovative project, have a very low direct contribution to innovation itself. Keeping the lights on won’t help you create the next big thing – but you certainly won’t be able to do it at all in the dark. So, if you’ve got a talented engineer – and trust me, I know, they’re scarce – then I find that allowing them to build directly towards your innovation and competitive edge is a gesture of grace, respect, and corporate maturity.
But managed services are certainly not for everyone. Like with self-management, there are upsides and downsides:
Advantages
- Resource assurance – the MSP will take care of having the right team of people, with the right certifications and training to take perfect care of your operations. And when it comes to time, the MSPs primary focus is on operational management – therefore, engineers will dedicate most, if not all, of their time to refining their operational skills. It is worth asking potential MSPs where their strengths and weaknesses lie, as no reputable MSP would ever operate an environment without having sufficient expertise to thrive at it.
- Predictable costs – I’ve said it and I will keep saying it: it is almost impossible to prove that a managed service is cheaper than self-management. Have there been empirical cases where that was true? Yes. Have there been empirical cases where that was not true? Of course. However, the key pecuniary advantage of choosing a managed service is the predictability. A good MSP will offer a clear price, usually by measuring a metric (Canonical charges per managed node per year, with slight differences depending on what product is managed on the said node, as well as what kind of node is deployed). This will rid you of worrying about surprises – if one of the MSPs engineers decides to leave the company, it will not affect you, nor your costs, nor your budgets.
- Shared accountability – if something goes wrong in your environment while under the coverage of an MSP, the most you will have to do is raise a ticket. And even that may not always be necessary, as reputable MSPs have good monitoring and alerting systems that allow them to find issues before they start affecting you. Moreover, the MSP assumes responsibility for the wellbeing of your public cloud clusters, so if your business is significantly damaged by bad operations, then the MSP is liable, and you are insured.
Disadvantages
- Upfront costs – it may often be intimidating to look at large MSP price lists. But bear in mind, there are smaller MSPs that can cover your operations without impressive prices. As long as your entire ecosystem’s operations are covered (minus the public cloud containers, if you choose to use them, because they are usually managed by the public cloud providers themselves) in a way that satisfies your business needs, then you can consider the provider for a long-term relationship. Nevertheless, you should expect an upfront operational management price for your entire cluster – with the price scaling alongside your deployment.
- Less control over protocols & methods – it is unlikely that an MSP will let you have any say over how compliant their team is, and how they approach incident recovery. Certainly, you will have your requirements, which you must lay out clearly in the pre-sales discussions to make sure they are met. However, beyond that, the MSP has full freedom to be as stringent (or lenient, though this is a rare and bad practice) with their protocols as they wish. Oftentimes, MSPs will choose to be compliant with many global standards, which can occasionally (but rarely) restrict the scope of potential projects.
- Less freedom to tinker – MSPs require full control of your environment. If your operations were a car, then the MSP would be the driver and the service person all at once. Imagine if, while driving, you were to put your hand on the steering wheel and steer strongly – that would more than likely cause a crash. This is why most MSPs do not allow their customers to tinker with the environments that they manage, and they usually carry out extensive validation procedures before even beginning the operational management process. This means that the environment you sign for is the environment that will be managed until the end of the contract, with little to no freedom of change.
Considerations
If you’re thinking of choosing an MSP, it is important to consider a couple of variables. Are they large enough to cover your environment? Do they have enough experience? Are they flexible enough? Do they tie you down? All of these questions and more would require answers before you opt for an MSP. I explore this in more detail, with a high focus on the current and highly interesting topic of AI, in my whitepaper, An Executive Guide to Managed AI Infrastructure. I’ll also go over these questions in a future post.
Conclusions
Let’s tell it like it is: I’m likely biased. My entire career revolves around helping enterprises like yours achieve operational excellence through managed services. Are managed services a fully assured way to achieve your business goals? No. But can they get you closer? Absolutely. That does not mean you can’t achieve the same success without opting for a managed service, but it can mean that it would take more money, more time, and overall more effort. That is why I strongly believe in the power of operational management.
If you’re interested in the services Canonical offers in the field, check out our webpage: https://ubuntu.com/managed
Until next time, stay well!
Adrian