The Requirements
They needed help in ensuring the security of their workloads, the availability, and reliability of their applications in their multi-cloud environment. The detailed requirements include:
- Managing a mesh of 70-80 micro-services of an application in a highly resilient manner
- Seamless and continuous delivery of frequent code changes
- Keeping the lights-on round the clock through the site reliability engineering approach
- Implementing a Zero Trust business model to make sure that all the endpoints are protected
- Optimizing the overall website performance
Challenges
In the current transformed state, all the applications were running across different public and private cloud providers where the CI/CD was managed by the respective providers. Each application team had its own flavor of platforms and tools. In such a siloed setup, there were potential blockers that could lead to availability and reliability issues. The issues include:
- challenges in endpoint protection
- error-prone manual processes in new software releases
- lack of visibility into key performance indicators
The Solution
We worked with the customer’s business team during the discovery phase to understand their needs at a granular level and found that their:
- Online casino website modernization involved 10+ dockerized microservices in Java running on a cluster with a database. While the games were running on Apache web server and AWS S3 with Cloudflare, their frontend services for streaming the games were on redis.3.
- Back office application modernization involved 69+ containerized microservices and 4+ non-containerized applications. All the services use NET Core and .NET Framework (Legacy) as run time.
We offered a 24x7 availability of reliability engineers who:
- can take care of kernel hardening solution at SSH, OS, Network, Services and Cluster
- comes with hands-on experience on various CNI (e.g. Calico, Antrea) and Ingress Controller (e.g. Contour, nginx ) and offers smoother communication experience at Pod and Application level
- can build KPI metrics Dashboards using TO/Grafana to create visibility
- can periodically execute mock drills through Litmus Chaos Test to recreate fail-over scenarios so that potential weak spots and outages can be identified.
To ensure a unified experience, management and monitoring of the infrastructure, and high application availability and reliability, we decided to do the following:
- Deployed Tanzu Mission Control to create Tanzu Kubernetes Grid (TKG) clusters in AWS/Azure cloud so that applications can be containerized and easily run across multiple cloud environments
- Built a multi-cloud platform that is cloud-agnostic, where changing cloud provider is handled as a simple plug-out/plug-in. It offered a unified experience:
- for cluster operators to manage Kubernetes Clusters at scale using Tanzu Mission Control (TMC).
- for developers with self-service access to Kubernetes in their chosen environment with security and policy guardrails in place - leading to better engineering velocity.
- Mitigated container runtime dependency with container in container solution
- Upgraded TMC console to provide cluster upgrade capabilities with one-click for major Kubernetes Releases
- Used open source tools instead of cloud provider managed services for CI/CD
- Facilitated continuous delivery of frequent code changes, enabled one click attachment or detachment of Kubernetes cluster on TMC
- Added self healing capabilities including HPA, PDB, Cluster Autoscaler to bring in high resiliency to the infrastructure. - This is one of the Key SRE attributes
- Offered increased security through TMC, which enables zero trust business model as it adds native capability to roll out Confirmance Test as CIS Benchmarking at node and cluster levels
- Implemented observability and compatibility by using community tools to perform distributed tracing of large number of microservices
Our Approach
As part of our proactive site reliability engineering (PSRE), we keep a tab on symptoms rather than waiting for an actual impact to happen in the system. We achieved this for our client’s microservice-based environment by identifying the right SLI (Service Level Indicators) to track symptoms. We then gauged them based on a set SLO (Service Level Objective). As a result, we ensured operations run within the agreed SLA ( Service Level Agreement) along with a better user experience.
Here is an overview diagram of the approach of the unified platform. The platform comes with built-in-SRE best practice to meet the SLA, SLO and SLI.
Business Benefits
With our solution the client can now ensure:
- A unified experience for all the application teams
- SRE-based solution that helps in the application modernization journey through out-of-the-box policies that can increase control and security of Kubernetes clusters.
- Operators can scale the operations in a highly resilient manner, using Kubernetes clusters that come with high elasticity
- Automated deployment of software releases on desired cloud environment
- Improved efficiency of developers, operators, and Site Reliability Engineers with best-of-breed tools for CI/CD, observability & logging
- A consistent, upstream aligned, automated multi-cluster operation across SDDC, public cloud, and Edge environments that is ready for end-user workloads and ecosystem integrations, through TKG.
- Advanced security benchmarking to boost the security of Kubernetes clusters and apps.
- Ease of management of access, applying security policies and inspection clusters for security & configuration risks