Srijan | Case Study

Enhancing Gaming Platform Performance: A Focus on Multi-Cloud Security & Resiliency

Written by Team Srijan | Oct 3, 2022 6:01:49 AM

The Requirements

They needed help in ensuring the security of their workloads, the availability, and reliability of their applications in their multi-cloud environment. The detailed requirements include:

  • Managing a mesh of  70-80 micro-services of an application in a highly resilient manner
  • Seamless and continuous delivery of frequent code changes
  • Keeping the lights-on round the clock through the site reliability engineering approach
  • Implementing a Zero Trust business model to make sure that all the endpoints are protected
  • Optimizing  the overall website performance 

Challenges

In the current transformed state, all the applications were running across different public and private cloud providers where the CI/CD was managed by the respective providers. Each application team had its own flavor of platforms and tools. In such a siloed setup, there were potential blockers that could lead to availability and reliability issues. The issues include:

  • challenges in endpoint protection
  • error-prone manual processes in new software releases
  • lack of visibility into key performance indicators

The Solution

We worked with the customer’s business team during the discovery phase to understand their needs at a granular level and found that their:

  • Online casino website modernization involved 10+ dockerized microservices in Java running on a cluster with a database. While the games were running on Apache web server and AWS S3 with Cloudflare, their frontend services for streaming the games were on redis.3.
  • Back office application modernization involved 69+ containerized microservices and 4+ non-containerized applications. All the services use NET Core and .NET Framework (Legacy) as run time.

We offered a 24x7 availability of reliability engineers who:

  • can take care of kernel hardening solution at SSH, OS, Network, Services and Cluster
  • comes with hands-on experience on various CNI (e.g. Calico, Antrea) and Ingress Controller (e.g. Contour, nginx ) and offers smoother communication experience at Pod and Application level
  • can build KPI metrics Dashboards using TO/Grafana to create visibility 
  • can periodically execute mock drills through Litmus Chaos Test to recreate fail-over scenarios so that potential weak spots and outages can be identified.

To ensure a unified experience, management and monitoring of the infrastructure, and high application availability and reliability, we decided to do the following:

  • Deployed Tanzu Mission Control to create Tanzu Kubernetes Grid (TKG) clusters in AWS/Azure cloud so that applications can be containerized and easily run across multiple cloud environments
  • Built a multi-cloud platform that is cloud-agnostic, where changing cloud provider is handled as a simple plug-out/plug-in. It offered a unified  experience:  
    • for cluster operators to manage Kubernetes Clusters at scale using Tanzu Mission Control (TMC).
    • for developers  with self-service access to Kubernetes in their chosen environment with security and policy guardrails in place - leading to better engineering velocity.
  • Mitigated container runtime dependency with container in container solution 
  • Upgraded TMC console to provide cluster upgrade capabilities with one-click for major Kubernetes Releases 
  • Used open source tools instead of cloud provider managed services for CI/CD
  • Facilitated continuous delivery of frequent code changes, enabled one click attachment or detachment of Kubernetes cluster on TMC
  • Added self healing capabilities including HPA, PDB,  Cluster Autoscaler to bring in high resiliency to the infrastructure. - This is one of the Key SRE attributes 
  • Offered increased security through TMC, which enables zero trust business model as it adds native capability to roll out Confirmance Test as CIS Benchmarking at node and cluster levels 
  • Implemented observability and compatibility by using community tools to perform distributed tracing of large number of microservices

Our Approach

As part of our proactive site reliability engineering (PSRE), we keep a tab on symptoms rather than waiting for an actual impact to happen in the system. We achieved this for our client’s microservice-based environment by identifying the right SLI (Service Level Indicators) to track symptoms. We then gauged them based on a set SLO (Service Level Objective).  As a result, we ensured operations run within the agreed SLA ( Service Level Agreement) along with a better user experience. 

Here is an overview diagram of the approach of the unified platform. The platform comes with built-in-SRE best practice to meet the SLA, SLO and SLI.

 

Business Benefits

With our solution the client can now ensure:

  • A unified experience for all the application teams
  • SRE-based solution that helps in the application modernization journey through out-of-the-box policies that can increase control and security of Kubernetes clusters.
  • Operators can scale the operations in a highly resilient manner, using Kubernetes clusters that come with high elasticity
  • Automated deployment of software releases on desired cloud environment
  • Improved efficiency of developers, operators, and Site Reliability Engineers with best-of-breed tools for CI/CD, observability & logging 
  • A consistent, upstream aligned, automated multi-cluster operation across SDDC, public cloud, and Edge environments that is ready for end-user workloads and ecosystem integrations, through TKG.
  • Advanced security benchmarking to boost the security of Kubernetes clusters and apps. 
  • Ease of management of access, applying security policies and inspection clusters for security & configuration risks
  •