The Requirements

They needed help in ensuring the security of their workloads, the availability, and reliability of their applications in their multi-cloud environment. The detailed requirements include:

Managing a mesh of 70-80 micro-services of an application in a highly resilient manner
Seamless and continuous delivery of frequent code changes
Keeping the lights-on round the clock through the site reliability engineering approach
Implementing a Zero Trust business model to make sure that all the endpoints are protected
Optimizing the overall website performance

Challenges

In the current transformed state, all the applications were running across different public and private cloud providers where the CI/CD was managed by the respective providers. Each application team had its own flavor of platforms and tools. In such a siloed setup, there were potential blockers that could lead to availability and reliability issues. The issues include:

challenges in endpoint protection
error-prone manual processes in new software releases
lack of visibility into key performance indicators

The Solution

We worked with the customer’s business team during the discovery phase to understand their needs at a granular level and found that their:

Online casino website modernization involved 10+ dockerized microservices in Java running on a cluster with a database. While the games were running on Apache web server and AWS S3 with Cloudflare, their frontend services for streaming the games were on redis.3.
Back office application modernization involved 69+ containerized microservices and 4+ non-containerized applications. All the services use NET Core and .NET Framework (Legacy) as run time.

We offered a 24x7 availability of reliability engineers who:

can take care of kernel hardening solution at SSH, OS, Network, Services and Cluster
comes with hands-on experience on various CNI (e.g. Calico, Antrea) and Ingress Controller (e.g. Contour, nginx ) and offers smoother communication experience at Pod and Application level
can build KPI metrics Dashboards using TO/Grafana to create visibility
can periodically execute mock drills through Litmus Chaos Test to recreate fail-over scenarios so that potential weak spots and outages can be identified.

To ensure a unified experience, management and monitoring of the infrastructure, and high application availability and reliability, we decided to do the following:

Deployed Tanzu Mission Control to create Tanzu Kubernetes Grid (TKG) clusters in AWS/Azure cloud so that applications can be containerized and easily run across multiple cloud environments
Built a multi-cloud platform that is cloud-agnostic, where changing cloud provider is handled as a simple plug-out/plug-in. It offered a unified experience:
- for cluster operators to manage Kubernetes Clusters at scale using Tanzu Mission Control (TMC).
- for developers with self-service access to Kubernetes in their chosen environment with security and policy guardrails in place - leading to better engineering velocity.
Mitigated container runtime dependency with container in container solution
Upgraded TMC console to provide cluster upgrade capabilities with one-click for major Kubernetes Releases
Used open source tools instead of cloud provider managed services for CI/CD
Facilitated continuous delivery of frequent code changes, enabled one click attachment or detachment of Kubernetes cluster on TMC
Added self healing capabilities including HPA, PDB, Cluster Autoscaler to bring in high resiliency to the infrastructure. - This is one of the Key SRE attributes
Offered increased security through TMC, which enables zero trust business model as it adds native capability to roll out Confirmance Test as CIS Benchmarking at node and cluster levels
Implemented observability and compatibility by using community tools to perform distributed tracing of large number of microservices

Our Approach

As part of our proactive site reliability engineering (PSRE), we keep a tab on symptoms rather than waiting for an actual impact to happen in the system. We achieved this for our client’s microservice-based environment by identifying the right SLI (Service Level Indicators) to track symptoms. We then gauged them based on a set SLO (Service Level Objective). As a result, we ensured operations run within the agreed SLA ( Service Level Agreement) along with a better user experience.

Here is an overview diagram of the approach of the unified platform. The platform comes with built-in-SRE best practice to meet the SLA, SLO and SLI.

Business Benefits

With our solution the client can now ensure:

A unified experience for all the application teams
SRE-based solution that helps in the application modernization journey through out-of-the-box policies that can increase control and security of Kubernetes clusters.
Operators can scale the operations in a highly resilient manner, using Kubernetes clusters that come with high elasticity
Automated deployment of software releases on desired cloud environment
Improved efficiency of developers, operators, and Site Reliability Engineers with best-of-breed tools for CI/CD, observability & logging
A consistent, upstream aligned, automated multi-cluster operation across SDDC, public cloud, and Edge environments that is ready for end-user workloads and ecosystem integrations, through TKG.
Advanced security benchmarking to boost the security of Kubernetes clusters and apps.
Ease of management of access, applying security policies and inspection clusters for security & configuration risks

Enhancing Gaming Platform Performance: A Focus on Multi-Cloud Security & Resiliency

Introduction

The Client

Highlights:

The Requirements

Challenges

The Solution

Our Approach

Business Benefits

Elevating Digital Excellence: A Global Consulting Firm's Journey to Drupal 10

Empowering INSEAD with Acquia + Drupal: A Journey Towards Brand Unification and Enhanced Digital Experience

Navigating Change: Enabling a Global Personal Care Company to Pivot and Thrive Online

Khan Bank’s Agile Transformation: Improving Delivery Efficacy to Meet Customer Expectations Faster

Shared Success

More Case Studies

Enhancing Gaming Platform Performance: A Focus on Multi-Cloud Security & Resiliency

Introduction

The Client

Highlights:

The Requirements

Challenges

The Solution

Our Approach

Business Benefits

Related Stories

Elevating Digital Excellence: A Global Consulting Firm's Journey to Drupal 10

Empowering INSEAD with Acquia + Drupal: A Journey Towards Brand Unification and Enhanced Digital Experience

Navigating Change: Enabling a Global Personal Care Company to Pivot and Thrive Online

Khan Bank’s Agile Transformation: Improving Delivery Efficacy to Meet Customer Expectations Faster

Shared Success

More Case Studies