- Posted by Nikola Markovic
- On October 15, 2019
- 0 Comments
- google cloud platform, multiplayer game hosting, public cloud
How SuperAdmins helped the ‘Awakening of Heroes’ project reach desired levels of scalability, stability and security in terms of app’s infrastructure
Project Details and Case Study Overview
Client: Awakening of Heroes – https://www.awakeningofheroes.com/
Awakening of Heroes is a free-to-play 5v5 MOBA (Multiplayer Online Battle Arena) game featuring RPG and battle royale elements available for iOS and Android devices.
- Achieving quick yet reliable resource scalability on an hourly basis
- Attaining server stability and data security
- Achieving high coverage levels within target regions in order to reduce latency to a minimum (even though Google Cloud doesn’t provide services in several key regions)
- Protecting the game from DDoS attacks
- Automating deployment of new app versions
- Further ongoing infrastructure improvements and maintenance
Challenge 1: Reliable Resource Scalability and Building the Game Room
Being a multiplayer online game, Awakening of Heroes is a complex project that includes numerous components, many unknown coefficients and requires a substantial amount of resources. In order to support thousands of users at once without any hiccups, the game must feature a dynamically scalable infrastructure, while the entire project must be approached systematically and with great attention to detail.
Our first task was to establish a thorough communication with the team developing the video game and to understand the following notions:
- What all the moving parts of their project are and how they work
- Which components we need to focus on the most
- What our client’s overarching goal is
Once we had all the necessary input, our team at SuperAdmins was able to come up with a workflow strategy and provide a concrete and actionable plan on how to properly execute all the tasks and deploy tangible solutions that would improve the entire project and move it onto the cloud.
We came to the conclusion that the client needed help with one particular aspect of their system in order to quickly launch their project and release it to the global market.
The aspect in question was developing the Game Room for Awakening of Heroes.
Building the Game Room
Game Room is a component that physically hosts the group of AoH players participating in one gaming session. Our client decided to go with Google Cloud as their cloud-based platform of choice, while our task was to build servers in 3 different locations around the globe:
In order for the video game to run smoothly and withstand estimated workloads, the servers required a stable and scalable infrastructure. Since the players come from all corners of the world, they need to be grouped according to their location and placed on the appropriate server.
From the infrastructure standpoint, our task was to build a scalable environment that could be expanded later on in terms of adding more servers when and where necessary. We needed to find a way to execute this task in a quick, simple yet efficient way and without any backend issues so the players wouldn’t experience any hiccups while playing the game.
Our solution was to build a dynamic server infrastructure that makes seamless scalability possible and allows the platform to integrate new players at any given moment.
This was achieved through auto-control in terms of the number of players and load amount, and through adding additional virtual machines and CPUs when the average load reaches certain levels. We also make sure the database can handle the increasing workload.
These project aspects were especially vital for tackling load peaks, which are quite common in multiplayer games and depend on the time of day, location, promotion of the game, etc. When these load peaks happen, the number of players can reach unexpected heights, which means the game’s infrastructure must be ready to withstand sudden surges in terms of the number of active players without any downtime. Load drops are also very common, so we had to build a system that is able to automatically shut down unnecessary server instances when they are no longer needed.
This resulted in a substantial boost in cost-efficiency. Through using a dynamically scalable infrastructure, our client is now able to use only the servers that are necessary at that particular moment, and therefore optimize the amount of money spent on resources.
The Importance of Load Balancer
The game itself has several vital components, one of which is the control server that regulates which player is placed on which server. Load Balancer is located in front of these game servers and its purpose is to balance the traffic for the servers and make sure there is only one endpoint for them, which is crucial for the aforementioned scalability.
Load balancer also regulates the usage of servers according to the current number of players. As the increase in the number of virtual machines directly depends on the number of players, we came to the conclusion that the load balancer component should be placed on the control server.
Google Cloud Platform load balancers were deployed, while the instances within the autoscaling group are not being added via GCP. Instead, we created a separate mechanism through which adding and/or removing of the instances from the autoscaling group is performed. This means that this type of scalability is now on the app-level.
Challenge 2: Server Stability and Data Security
Properly executed server monitoring was key for both server stability and data stability and is closely related to server scalability. Unlike in the web hosting environment where server monitoring is a rather simple task, the multiplayer gaming landscape makes this process a bit more convoluted and requires internal monitoring within the game itself.
For example, you can have a scenario in which the servers are working properly and there are no malfunctions reported via traditional server scanning, but the players cannot log in. This is where the in-game stability monitoring comes into play.
In order to reach server stability, SuperAdmins and AoH teams had to work together in the following setup:
- AoH team makes sure the relevant data is accessible
- SA team gathers and monitors the data
Data Security and Stability
Since certain users are highly motivated to hack into the game and make security breaches so they can make virtual money or progress in the game by cheating, it was vital for us to make sure these scenarios are being kept at bay at all times.
There are 2 main aspects of Data Security and Stability in this project:
- Communication between the AoH app, user’s device and the servers
- Database security within the infrastructure
The communication between the app, the device through which the user is using the app, and the server they are connecting to needed to be as stable and secure as possible. To make sure this is executed properly, the SuperAdmins team analyzed the traffic and streamlined troubleshooting. Since the transit data coming from the load balancer and into the servers is in basic HTTP (so as not to hinder troubleshooting and generate latency), there’s a significant security gap that makes this phase susceptible to breaches. This is where we came in to make sure the transit data cannot be intercepted and accessed.
On the other hand, the data security within the infrastructure database (which stores info on our client) had to be encrypted. In the gaming environment, the user’s data isn’t too exposed and is stored within the app/device, while from the infrastructure standpoint, the data is stored and secured within a centralized database that cannot be accessed through the Internet. The only entity that has access to this centralized database is the server itself.
The SuperAdmins team made sure that:
- The data flow from servers to users is secured
- The data within the centralized database is securely stored
- Regular backups are taking place
- Other security-related tasks are performed regularly and effectively
Once all the above-mentioned aspects of server and data stability and security had been executed, we performed internal security tests using potent breaching tools and came to the conclusion that the entire system was completely safe, stable and secure.
Challenge 3: Regional Coverage and Curbing the Latency
As multiplayer video games typically cater to users coming from various regions of the world, they need to be grouped according to their location so as to avoid the occurrence of latency. With this in mind, each region should have a dedicated server used only for local players (which is why most similar apps ask users for location information). Unlike desktop video games, mobile multiplayer games have higher sensitivity in terms of latency as the players use the game via WiFi, 3G, 4G, etc.
Our job was to find the right Google Cloud service provider package and therefore achieve the optimal level of scalability.
With Google Cloud, we were able to cover servers for 3 different regions:
The main issue here was the fact that Google Cloud services don’t cover certain regions that harbour our potential target audience (like Russia for example). For this purpose, we decided to outsource local infrastructures that go beyond Google Cloud coverage and therefore reduce latency as much as possible, which means this project involved 3rd party server providers as well, aside from Google Cloud.
Challenge 4: Protection from DDoS Attacks
Load balancer also acts as a protection layer from DDoS attacks (distributed denial-of-service attack). As load balancer is also defined as a “service” on a cloud provider, it is automatically protected against a certain number of DDoS attacks and its deployment directly influences the stability and security of the servers.
As the popularity of Awakening of Heroes rises, we expect more players to join our community, which will automatically lead to higher attack risk, in which case we plan to incorporate sturdier 3rd party services that offer DDoS protection, aside from the protective layers that come with the cloud service provider. These proprietary DDoS protection platforms have stronger layers of security that are able to detect attacks in their earliest phases and further protect the game from security breaches.
Challenge 5: Reliable and Automated Deployment of New App Versions
Having multiple regions in which the new app version deployment needs to take place is somewhat of a challenge and is closely related to the version of the app located on users’ devices. There are numerous components to take into account, while the release process consists of several phases which do not always happen consecutively.
For example, whenever there’s a change on the API level, it automatically affects certain functionalities within the very gameplay because it is not backwards compatible with the current app version installed on users’ devices. This is where release updates come into play.
This process involves several stages that need to take place simultaneously, with some of these phases having higher priority than others.
To tackle this issue properly on the server and infrastructure level, our team deployed the combined practices of continuous integration and continuous delivery/deployment, or CI/CD. In other words, we perform a Git push on the testing branch, and when all the automated tests show no friction and the build phase proves to be successful, we perform a new app version deployment.
The best practice and the one we opted for is to first perform the deployment within the region with the lowest number of players and do a Git push on a master branch so the CI/CD pipeline detects the new release and the deployment within the auto-scaling group can be done automatically.
This is most commonly done via the blue-green deployment strategy during which one server group is removed from the auto-scaling group and the new servers containing the updated app version are added. Internal tests take place and when all the moving parts show no friction, the new server group assumes the primary role. In this dynamically scalable environment, additional servers can always be added later on if and when necessary.
Challenge 6: The Maintenance Phase
This phase has a workflow similar to the release stage. The main goal is to reach the highest accessibility level for each app, which is done by having multiple servers that are placed after the load balancer. So, whenever a certain maintenance task needs to be performed within the main infrastructure, we remove that particular server from the load balancer and perform a maintenance-related procedure on it (update, patches, etc), then we put the server back into the load balancer.
Important note: In any cloud-based infrastructure, each resource (especially server instances located in the auto-scaling group on which the app is operating) has to be developed in such a way so it can be shut down at any given moment and thus make room for the new server instance that will replace it. Within the scalable system that we created for AoH, this process is done without any hiccups in a so-called “destroy and forget” manner.
Achieving no or minimal downtime is extremely crucial during new app version deployment and maintenance. As Awakening of Heroes is available on Google Play and Apple’s App Store, one potential pitfall crops up and that is the fact that we cannot control the update time. When we send the new game version to Google or Apple, we don’t know if it will be published within the next 5 minutes or 5 hours.
Reducing downtime to a minimum is crucial in this phase and we took care of it by having both old and new versions running at the same time on parallel infrastructures, which is where DNS has to come into play along with traffic routing. So, when a new version is published, one target URL is defined on the load balancer and it automatically sends the players with the new version to the other set of servers.
The most tangible impact lies in the cost-saving aspect and the overall revenue of the project. This was achieved through the infrastructure that is able to support the initial number of players but also expands as the number of players grows and the popularity of the game increases, which eliminates the need for substantial capital investments upfront.
With infrastructures built on physical servers, the investment diagram looks something like this:
initial investment → number of users exceeds server capacities → additional investments
This type of incremental expenditure is not optimized in terms of how much money one is paying for their infrastructure and the overall payoff. Additionally, it can lead to considerable losses should the number of users go down as on-prem servers are not able to scale back.
Cloud-based infrastructures, on the other hand (like the one we built for Awakening of Heroes), are capable of scaling both up and down. This means the AoH infrastructure can now seamlessly grow along with the number of players, while each growth increment and its cost is easily predicted and optimized.
This automatically affects the revenue as well. As the game is of free-to-play nature, the AoH team counts on the conversion rate and the average revenue per user. With their current infrastructure setup, they can calculate their target average revenue per user in order to achieve profitable growth. They can also determine what the minimum value of this metric is in order to grow further and come up with their shop prices accordingly.