3 September 2024

Top 3 Common Load Testing Issues

There are a number of issues that we would often see across any of our load testing engagement but after a brief survey of our load tests, we’ve narrowed it down to just three. All of these issues, like any load testing issue, is a scalability one. They are classed on severity based on how users are impacted and how many users would be affected. While they all typically impact users significantly, these three are common and severe enough that we wanted to draw attention to them.  

They’re each categories of the scalability issue, rather than singular issues. There are of course bigger stories and details to be found but we’re going to keep it simple and focus on the core function. Unlike security, which vulnerabilities or issues often indicate a solution, these don’t really give you much in the way of actionable solutions!  

Most Common Load Testing Issue: Database Scalability

This is a hugely important issue that crops up frequently. We believe that because teams are often focused on core functionality and container scalability, performant queries aren’t often considered. We build something that works and we’re happy to move on, it’s uncommon to find the question, “how effective is this in the long term?”.  

Whether relational or non-relational databases or any other type, pretty much every application has a database to store its information. There’s different uses and purposes, some for caching and queuing systems, some for storing user or player data. Naturally, the more users you have, the more queries it has on the data and there’s more of this data being stored. One user is one state, you jump up to ten thousand, it’s an exponential increase. Not just ten thousand states but ten thousand write operations, registering quest completion or registration, inventory updates, everything! This also extends to any matchmaking system, it’s creating tickets for each instance, ensuring the right players are put in the same match, the queue isn’t getting backed up, all of it is talking to your database. These queries are constant and intensive, so one very common issue we find in our load testing is database scalability.  

These are caused by database related errors, as the database struggles to handle the queries and the data transfer. They can also be a result of an incorrect pool size. Or it might be a general memory and CPU issue in the database cluster.  

Solutions for Database Scalability

We typically advise a twofold solution. Scale up your database, which includes increasing your cluster of databases, which will allow you to handle more traffic and redirect incoming traffic without causing any issue. And we also advise scaling up your CPU and memory capacity of a single or several of your databases. As for the connection pool, increasing memory and CPU won’t solve it so an additional instance is required. This is usually the first solution people use but this can also lead to an overload in one place, which is why we call for an increased cluster of databases. This indirectly helps to solve the memory and CPU issue.  

In the long-term and for proper efficiency, we would advise your team to optimize the SQL queries going toward the database. Ensure there’s not too many queries at once, cache certain data so there doesn’t have to be constant communication. There’s a lot the Cyrex team would say can be done on the code level, such as ensuring you have the right indexes in your databases to improve your search quality. These, to us, are more interesting, long-lasting, and sustainable solutions. Optimal and superior overall but we understand this kind of solution takes time and investment of skill. It’s a difficult solution to implement when you’re only preparing for a spike of users next week with your next update!  

Critical Load Testing Issue: Server Container Scalability

We understand that this is a broad term! What we’re referring to in this piece is the core machine. Whether virtual or bare metal, the core piece that runs your OS, your APIs, and your core functionalities of your server system.  

This, naturally, is crucial for scalability issues. Mostly on memory and CPU in this case. Where you’re running pieces of your code, it has a requisite amount of memory and CPU to run. It needs calculation, to store certain data, and the issue we see is when the limits being reached in our load testing.  

Solutions for Server Container Scalability

The solutions are similar to the database scalability issue. More instances, more containers, a gradual increase of your capacity. This limit is something that will always be reached, the big question is when do you need to push this limit.  

That’s a judgement for your team to make. If you’re talking about a singular launch for a spike of users or a singular API with a high call response, it’s not really worth. In the latter case, you’d be looking at the efficiency of the API and why isn’t it performant.   

In the case where it’s needed, for us, a dual solution is advised. More infrastructure and an optimization of your underlying code to make it more performant.  

Important Note for Cloud Server Users

We also must impress the importance of this point, if you’re using a cloud subscription for your server container. There are limits, naturally, on what cloud infrastructure you have access to. As a failsafe and for your infrastructure to scale up quickly, the request can take days to be filled by a cloud provider. If you find yourself at a hard limit in the middle of a rush, you won’t find more cloud infrastructure available quickly.   While there are premium subscriptions that might offer faster returns from their support, in general, you’re still talking hours to a day at least. This is something important to consider when looking at your release schedule. Cloud infrastructure takes time to scale.  

With load testing, like our services offer, mean you won’t get caught out on launch day and be sat waiting for days for your cloud capacity to increase!  

Often Missed Load Testing Issue: Misconfigurations Regarding Gateways

For those who don’t know, a gateway is a portal that networking calls pass through. In any server process, you’ve got a ton of these gateways. On one end, you’ve got your server doing calculations and sending responses. In between, usually, you’ve got layers that users pass through.  

So, for example, you go onto Amazon. You have gateways guiding you to the right region, then the right area within said region, then further into private customer gateways, then into a specific cluster of machines, and so on! Included in there are load balancers to keep users and players in the right place without any crashes or delays.  

Each gateway is a server instance that routes traffic based on load, region, and a half dozen more qualifiers! The average user would probably pass through a dozen gateways on a large-scale operation. So, if you’re building gateways, it’s good to limit them as much as possible. Each gateway is a hub, each hub is a potential delay for the user’s connection. It’s also another element to maintain, manage, and potentially misconfigure!  

While they can suffer from the same issue of CPU and memory load, one major issue we see is misconfiguration. Something as simple as the user limit connection set to a wrong, low value can see the gateway shutting down. This might lead to a scaling up, believing it to be an issue with the memory and CPU but the gateway will still shutdown after say, 1000 connections because the value has been misconfigured as such!  

Misconfiguration Limitations and Solutions

Of course, these limitations can be useful in many cases as a failsafe. However, we’ve often seen load tests fail because a misconfigured gateway was set to 1000 users instead of 10,000. Once this happens, all active connections through that gateway will crash and be abandoned by the system.  

The solution is easy, of course! Simply change the value on the gateway! A limit should be in place but that should be a predetermined acceptable upper value.  

Load Testing Issues and Solutions with Cyrex

A load testing issue is always a problem! No matter where it is, users will always be affected. That is a problem!  

With live ops gaming and the more online, update focused world of gaming, consistent and reliable service has never been more important. A poor launch or update can result in a huge drop in audience and that can significantly impact your longevity.   

With Cyrex load testing services and our proprietary Cyrex Swarm technology, you will know exactly where your system struggles and how to fix it!  

Get in touch with the Cyrex team today! Your gold standard in all things load testing, digital security, and software development