Home

2024

Worklog

LETSGO Game

99 Problems And You Can Only Build One
📕

99 Problems And You Can Only Build One

Tags
Project ManagementBlogEngineeringServer Infrastructure
Owner
J
Justin Nearing
This is the second in a three part series on 🔥So Your Technical Debt Has Gone To Collections In the first part, we discussed the inciting incident and initial stages of rebuilding an entire infrastructure stack from scratch. In this part, we discuss the solutions available and why each choice was made.

With our wide angle lens, we identified three potential solutions to our “Lose control of every server at the company” problem:

  • Workaround Option
  • YOLO into Kubernetes
  • Build something from scratch

Workaround Option

In the year leading up to this deprecation event, we had worked on a sort of “insurance policy”-

Specifically for this kind of issue.

Our Infrastructure-as-a-Service (IaaS) vendor recommended we upgrade from our current, technically deprecated version; to the latest supported version.

⁉️
That rando Ruby lib causing us to lose control of our entire server stack was from a looong deprecated IaaS API version. They were like “look if you don’t upgrade something eventually gonna break” The problem was upgrading straight-up removed critical features required for us to use the platform… Which is why we hadn’t upgraded in the first place.

Given the risk that losing control of our infrastructure posed, we decided to spend several months architecting an upgrade path just in case we ended up in a situation where we were about to lose control of our servers.

The only problem was it was a brutal mess.

Hacks on hacks, load-bearing hacks, bandaids held together with duct tape.

But the real deal breaker was we’d end up with multiple sources of truth.

If there is one inviolable rule of software design, it's the sanctity of a single source of truth.

You make a change in one place, and it is the only place where that value can be changed- all else is chaos.

YOLO Into Containers

We’re real DevOps up in here, which means sticking things into Kubernetes- regardless of use case- is our moral and ethical prerogative. 🧌

For half a decade our team had been wanting to go to containers, using Kubernetes to orchestrate those containers over a limitless set of products and server environments.

Green fields of pure ephemeral servers! Castles of pure YAML! The promised land.

Obviously this is the perfect opportunity!

Why not containers-into-space now?

The reason we couldn’t go this path was issues outside our teams control.

Significant challenges at the application level prevented our application from being easily containerized.

These weren’t unsurmountable issues, but out of our control enough that they may have well been.

This hurt, because we knew that any other solution would just be to buy the time we needed to move into a full containerization project- a project we immediately started working on after this deprecation crisis.

Build From Scratch

The option we eventually chose was to build a “long-term short-term solution”:

Replicate the entire infrastructure stack on GCP natively.

5 months to architect, implement, and migrate a feature-complete alternative to our IaaS platform.

Requirements:

  • Done quickly
  • Extremely flexible
  • Requires the minimum resources from other teams.

We had nothing existing to leverage, nothing even designed.

The only thing we had clear consensus on was some of the tools we wanted to try.

Hoo-boy!

Lessons

  • Never violate a single source of truth
  • Focus your solution on the things within your control
  • The harder the problem, the more bespoke the solution

The Power of Proof-Of-Concept

Given 3 equally bad options, how do you choose?

Show, don’t tell.

We gave ourselves a week to set up a POC for our bespoke infrastructure solution.

If it didn't work out, we go with the hacky insurance policy.

Thankfully, the POC was successful- our application was running with Terraform (an Infrastructure as Code language) and Ansible (a provisioning language that bootstrapped an empty VM into a server taking traffic).

The prototype required some mental fill-in-the-gaps, but it gave a reasonable approximation of the finished product.

I cannot understate how persuasive a working demo is.

For humans, seeing is believing- and we were showing that the risky path of building a bespoke solution would work.

The success of this POC changed the entire culture of our team overnight.

We all started showing, not telling.

If we had an idea, we would just do it.

If it worked, we adopted it.

If it didn’t, we just left it to “think about this later.”

This model worked really well for the situation we were in.

We didn’t have time to make detailed design decisions.

If something had a reasonable chance of solving a problem, go prove it.

In many cases, critical design decisions were made in 15 minutes or less.

I’m not sure how well this model scales, and to our benefit we had a very senior team who could drive POCs in short timelines, but it was absolutely a core component of our ultimate success.

Why choose Ansible?

It uses SSH, we know SSH, and there’s enough Google results for ansible get first item in array for it to be viable.

Literally a 5 minute conversation, then just start hacking together a test.

The result spoke for itself.

Lessons

  • Show, Don’t Tell
    • Log spew on a terminal is much more convincing than a 10 page TDD
  • POC’s can significantly speed up the development process
    • Especially on a team with clear deliverables
    • It’s easier to convert a POC into the “full solution” than a TDD
  • The tougher the constraints, the clearer your design
    • pass/fail design decisions are binary, easier to make
    • Allows devs to jump right into a POC to validate

The Story Continues

We came together as a team, found a viable solution to our impossible problem.

Now we just have to do it.

Aaaand do it to product teams we had little political capital with:

🎁Gifts At Gunpoint