When should you create another Kubernetes cluster?

Mark McCracken
7 min readAug 13, 2021

If you work in even a medium sized organisation, you’ve probably heard the term “kubernetes cluster” thrown around. Invariably, you’ll have more than one. But at some point, someone will come along, likely a developer with a need, or a new project with big ideas, and boldly proclaim “We need our own cluster for this”. How should you reasonably decide whether to grant their wish, or force them to use an existing cluster?

  • Why are we on kubernetes?
  • What does the natural evolution look like in enterprises?
  • What happens at first when someone asks for a new cluster? When do we know there’s a problem?
  • How do we get ready to solve this problem?
  • How do we then decide whether to grant a new cluster?

Why are we on kubernetes?

First of all, it’s good to remember one of the reasons for wanting kubernetes. There are many reasons, but one I want to call out is optimising compute resources. In the same way the compute landscape shifted from physical machines to virtualisation to optimise compute resource utilisation, kubernetes can offer another step up when sharing resources. Although we can have autoscaling of compute instances, we’re scaling in terms of entire virtual machines, which come with their own overhead. With kubernetes, we can scale in increments of the running application, regardless of the compute infrastructure, meaning zero extra linux kernels, but extra resources being claimed for our application.

In effect, we’ve moved from a very slow process of provisioning new physical compute in very fixed hardware sizes of cores, to virtualise compute that we can create more quickly, in more flexible configurations. Utilisation naturally jumps. But still, many times I’ve seen virtual compute instances with very low utilisation. Moving to kubernetes, if done correctly, we can see huge increases again, provided that our workloads are autoscaling, and our clusters are also autoscaling — we can achieve the dream of near full compute utilisation, meaning we’re paying only for the compute resources our application is actually needing at almost all times.

What does the natural evolution look like in enterprises?

Someone comes along and says enthusiastically, “We should use kubernetes!”. If you’re in a small org, this is likely an overeager developer wanting to learn something new, and by convenience, it’s also solving a problem. If you’re in a large org, this is likely someone from the Architecture team who has decided the way forward, and mandated this is how things will be done.

But regardless of how it comes about, it will almost certainly be an initial deployment to get things off the ground, some useful application to demonstrate that kubernetes really does what it promises, and someone was right to pick it. The small org folks call this “the new platform”, the large org folks call it “the PoC”, but they’re exactly the same really.

Fast forward 2–4 months (obviously depending on the level of red tape and bureaucracy, which depends on org size, add another few months if you’re a multinational corporation), and you’ve got the first app up and running. Everyone claps loudly. Chatter begins about the next big app that deserves to get the special kubernetes treatment.

What happens at first when someone asks for a new cluster?

Now at this point, we’ve just spent several months working hard to understand what kubernetes even is. We’ve learned our brains out, and we just want the next one to not be so painful.

One of 2 things happens — nobody can be bothered to care or work out how to separate access control, so they say “just stick it on the same cluster”, or else someone from security pipes up and says “No you can’t just do that!”. You try to ignore this person. Sadly it doesn’t work. So, you deploy a new cluster, exactly the same as the last one, but for application number 2. No points for predicting which of these scenarios you land in, it will be decided by org size. If you aren’t sure how big your org is, whether or not you can get away with option one is basically the answer.

How do we know there’s a problem?

But even the small orgs catch up to this state, one day someone from security comes along and says “hey fix this!” or our previous smart cookie developer has moved up the ladder and is now getting other people onboard, and thinks “it might be a good idea to not give everyone full access to delete everything I’ve built”.

The big orgs decide they could be saving money, the little orgs decide to protect what precious little they can afford. It’s time to lock things down.

How do we get ready to solve this problem?

We need everyone to access the same cluster, with appropriate role based access control (RBAC), to only perform actions they’re allowed to. Much, much easier said than done. Someone will insist on design approval, a deployment process, governance, all that fun stuff. But kubernetes has a good RBAC model, so shouldn’t be a problem.

Next, some applications shouldn’t be allowed to talk to others. Enter network policies. And some applications require TLS for their communication, certificates for external facing applications, internal cluster monitoring, and so on, and someone has the balls to utter the word “istio”. At this point your security team is on full on panic mode because you’ve already got a 6 month head start and deployed loads of stuff before they can even pronounce “cuber-neat-eyes”, so they demand to be involved in every meeting in your calendar and add some more of their own.

Unfortunately, there’s no shortcut. We need to learn this step of how to segregate access control effectively, understand the deployment process, and prevent applications from doing things they weren’t designed for.

6 months later, you should have an environment where some developer can deploy resources without having total control to everything. Which brings us to the big question.

How do we then decide whether to grant a new cluster?

What’s so bad about just giving out a new cluster? Well after your security team have had their input, and added runtime monitoring, and your ops team have had their input and added tracing and observability tooling, all of a sudden, your empty clusters are suddenly almost full of resources to monitor… nothing! I have seen clusters with virtually zero user resources, running 3 worker nodes at close to full capacity! Not the utilisation dream we had envisioned. So large clusters and sharing these components are the ideal situation, google will recommend the same thing.

What are common reasons for requesting a new cluster?

  • Regulatory reasons — kubernetes does not work well over long geographical distances, so they’re mostly constrained to a single geographical region (I think this is due to the underlying etcd database). If you have requirements that mean certain data must remain in certain geographical locations, then a new cluster is pretty much a must have.
  • Different environments — anyone who has watched any tutorials on kubernetes will see the example “you can have one namespace for dev, and one for prod”. In my opinion, this is total nonsense, complete and utter nonsense. I guess they do it because it’s an easy example to understand for namespacing, but it’s still nonsense. Dev and Production instances are on the same cluster, how do you safely test and upgrade the underlying infrastructure? Get separate clusters per environment.
  • “We’ve don’t want to share with someone else” — tough! If we’ve got the segregation access controls in place, there’s no reason we can’t provision computing infrastructure in kubernetes that’s truly separate from other users. We can even protect certain hardware for use by a single namespace, so this isn’t a good enough reason. If you go so far as to deploy a service mesh, you’ll have all of the controls of a regular compute instance architecture and then some, so there’s no reason.
  • Cloud provider redundancy — this is a complex topic. You can run workloads across multiple cloud providers, any provider worth their salt with have a kubernetes offering you can use, so this will allow your workloads to be fault tolerant, but this is something I’m extremely skeptical of. Most cloud providers will have an SLA in the range of 99.9% uptime, so in order to have multiple cloud providers, your most likely source of downtime should be the cloud provider, which unless you have an extremely mature ops team, is unlikely. However, if you can consider yourself a massive multinational corporation with material business risk in the case of downtime, then this might be a genuine reason to start spreading critical workloads across cloud providers, but do not underestimate the complexity of creating and maintaining this kind of configuration.
  • “We have highly sensitive data” — if you’re storing highly sensitive data in your cluster, (beyond PII, more like the kind of information that, in the wrong hands, could predict and manipulate your future stock price), then you might have a legitimate reason to provide a cluster separated from other workloads. But almost exactly the same security controls should go in place as normal clusters. These security controls are not so complex to implement, and even the toughest restrictions won’t take too long to secure things even for normal workloads.
  • PCIDSS — processing payment data? Expect extra security controls and an audit. This audit will not be 5 minutes, and if your payment applications are lumped in with other apps, expect more questions and pain. Separating the workloads that process payment information might not be the worst idea. You may be asked to implement all kinds of information logging around firewalls and network requests, which might be otherwise expensive in a shared cluster
  • “This is just a quick PoC” — absolutely not, we’ve almost certainly got another cluster lying round somewhere you can use for development purposes with the correct security controls in place already.

Aside from the reasons listed above, almost all resources should run on shared infrastructure, as opposed to dedicated clusters. Otherwise, you’re not making the most of the shared operating model of clusters, but adding complexity for operators to monitor and support workloads in different locations unnecessarily.

--

--