Unreal Server Scaling Exercise

EngineeringBlogUnrealServer Infrastructure
Justin Nearing

Pax Dei looks dope:

Their wiki has a Tech section:

They have a section on Sharding and servers and stuff:

Sharded Zoned

We estimate we’ll start with a maximum population in a single shard of around 7,000 players. A single shard is run wholly within one physical cloud hosting center. We will run shards in different availability regions worldwide, e.g., North America and Europe.

Here’s a fun thought exercise: How would I approach the infrastructure requirements for this game if I was in charge of doing just that?

I have no affiliation or inside knowledge with Pax Dei, it’s just fun to think about.


Concurrent Users

Need to scale synchronous multiplayer servers to max load of 7k concurrent users.

Its probably rare that you get 7k concurrent users. Closest you get is on launch day, when you server code is its least hardened. Aaaand probably hit hardest.

That’s what makes launches so fun!

But, 7k gives us a nice number to start playing with as a loose theoretical maximum.

I'm not a multiplayer networking expert (I'd love to learn), but let's assume 1 action per second per player.

Means 7k~ synchronous actions per second per shard max~ load.

So if we had a very naive server could only handle 1 request per second, you'd need 7k servers per shard?

Running a shard in a single AZ also makes me think there's a Global Load Balancer (GLB) per shard. Each zone has its own load balancer.

Unique LBs mean unique IP addresses, and probably DNS URLs to go along with it.

If you have a server shard selection flow of client, then you’re going to have URLs to the different endpoints embedded in the client.

You could also have a single master server that sends a list of valid endpoints. But at the end of the day the client is going to have an URL with some kind of endpoint. Here’s the thing, if some nefarious entity is distributing hacked clients, and they just go ahead and change the URL… uh oh. Certificate pinning can help prevent this issue, but it’s a pain. Anything to do with SSL is a pain, even with LetsEncrypt.

It probably means dedicated DB/caching layer per shard. 7k players means a lot of actors to track keep track of.

You'd probably also want a global DB for things like user management? You could just stick it all in the sharded servers though. My initial thought is you want a global DB for all cross-shard state. In theory this would get hammered a lot less than the shard servers.

Unreal Server Management

Pax Dei uses Unreal, so let us assume Unreal Multiplayer servers.

Let’s us do some shallow research on Unreal server management.

Unreal Networking Server

Servers are grouped by world region (zones)

Thankfully it appears you can run dedicated servers as containers :

There's a whole thing for running Unreal container servers:

Funnily enough this the first “good documentation” I've come across in the Unreal ecosystem.

It looks like Unreal servers use UDP.

Onboarding with Dedicated Unreal Servers (DUS)

Manually build a “Hello World” dedicated unreal server from Editor.
Manually build an “Hello World” containerized DUS.
Manually setup a GKE cluster using your created cDUS.
This costs money. Delete it once working.
It doesn’t have to “work”, just get your brain working again with GKE lbs, services, pods, etc.
Can do this all in console.
Repeat using automation for each step.
Casually automate the entire system, nephew.

Load Testing

Load test against a single server. How many agents can we connect to a single prod-like server before latency becomes an issue?

  • Build proc for server container
  • Setup up loadtest env
  • Setup DDoS crew - some service that spews UDP noise at our loadtest server.
    • Naively you could just throw noise
    • Could setup some kind of NPC-as-player service. Build a special client that allows NPCs to make random actions. But this is probably unnecessary / difficult.
    • Could log a human playtest and replicate across DDoS agents?
    • Probably want to connect agents of various latencies, preferably having those agents track latency over time.

Haven’t watched it yet, just pinning for later