Service Levels

Tags
EngineeringSREblog
Owner
Justin Nearing
💡
This is WIP, originally as part of Unreal Server Scaling Exercise but can/should be spun out to its own thing.

Indicators

  • The goal is to find users/latency levels. You should be able to easily determine what latency the game starts to “feel bad”.
    • This “feels bad” number is an Indicator, a Service Level Indicator.
    • This is the best way to talk to Engineers- their eyes will light up if you can translate “feels bad” into a number. Engs can make numbers more better. They cannot conceptualize “feels bad”, unless they talk to the guys in the IT department.

Objectives

  • With that Service Level Indicator (SLI), you can form a SLO.
    • O stands for Objective
  • From an SLO perspective, you want to say something like “you provide latency better than your “feel bad” number for a reasonable amount of players.”
    • At all times, some players will exceed your feels bad number. This is unavoidable.
    • 99th percentile are your players with the worst performance. Weridos connecting through airport wifi through their phones hotspot on a literal potato.
    • You need to have meaningful conversations on how much engineering effort you are willing to dedicate to get these players above the “feels bad” SLI.
    • It might be 95th percentile, 80th- this is a tough balancing act. The closer to 99 you go, the more expensive improvements become.
  • This means you need to track client latency. Can report this through the server to a telemetry service on the backend.
    • I’ve worked with BigQuery in the past as a place to cram this stuff. BigQuery is basically a big-ass-SQL DB where you pay per query
    • That pay-per-query can be a sticking point, you might consider building a cache in front of this for high-access data.
    • Buuut that costs money to maintain as well so you have to decide what data to cache and for how long.

Agreements

  • With your SLI, you can discuss your SLO, and now you can get your Service Level Agreement.
  • SLA’s are a contract you sign with yourself and your customers.
    • It defines how much you and your players will tolerate being in the “feels bad” zone.
    • 💡
      The gut response is zero tolerance. This is prohibitively expensive, even if it were possible.

      AWS goes down sometimes. So does GCP and every other service provider on you critical infrastructure. So right off the bat you probably aren’t hitting 99.999% uptime.

      But the much bigger point is opportunity cost.

      Building new and exciting features for your players introduces instability to your service. Hardening these features before launch means shipping slower.

      You gotta be shipping content to keep those KPIs up.

      This defines the central tension between stability and business nimbleness.

      SLA’s gives you a hard number to define this tension.

      This is the way to think of SLA’s:

      How many times would you accept being in the “Feels Bad” zone if it meant launching a feature a month early?

      One hour of bad vibes for getting a major feature out a month early? Probably a good deal.

      Engineers, this is how you talk to Business people. You will see their eyes glaze over in real time if you start explaining why something is the way it is.

      Don’t explain why, convert the why into a choice that can be made. Give them the two sides of a deal