Game Days at Elli
Let the Games Begin
In this post, we’ll explore Game Days — what they are, why our software engineering department dedicates two days per year to them, and their benefits. By the end, you’ll understand the concept well enough to start your journey. Note that this isn’t a detailed tutorial; since organisations vary you’ll need to tailor your approach.
Journey
Following, we will share more about our experience with hosting the Game Days.
Game Days
Game Days are exercises where engineers fix simulated incidents, enhancing their skills, improving our systems, and refining our processes.
Next, we will guide you through our motivation and the key considerations when designing incidents and organising Game Days. This includes experiments we conduct, how we structure the day, and how team composition can be adjusted. Finally, we’ll summarise the benefits for our engineers and Elli by running Game Days.
Motivation
At Elli, we’re dedicated to developing robust EV charging infrastructure. Our commitment to reliability constantly drives us to enhance our systems, processes, and skills. Game Days have become a key part of this journey starting from before our services went live. Having completed three iterations, and with our team and services expanding over the past four years, we’re preparing for the next Game Days.
You build it, you run it
It’s important to share our modus operandi to understand how Game Days fit us. At Elli, we embrace the “You Build It, You Run It” mentality. Every backend engineer:
- Writes application code
- Writes infrastructure code
- Creates pipelines
- Works on observability
- Performs database migrations
- Fixes security vulnerabilities
- Is on-call during business hours
With great ownership, comes great responsibility
What Werner, Amazon’s CTO, shared in 2006 brings more insights into this mentality, and we have adapted it to our needs.
Each service has a team associated with it, and that team is completely responsible for the service — from scoping out the functionality, to architecting it, to building it, and operating it.
There is another lesson here:
Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.
— Werner Vogels, Chief Technology Officer at Amazon.
Effectively, Game Days aim to improve the quality of ownership.
Incident Design
In designing incidents, we focus on three aspects: likelihood, learnings, and setup. We’ll discuss each and provide an example and an anti-example.
- Likelihood: What’s the chance of the incident occurring in production?
- Learnings: What key insights can be gained? Ideally, the incident should illuminate one or two technical or procedural areas.
- Setup: How easily can we replicate the incident within time constraints, ensuring it functions as intended during a Game Day?
Example: Consider a query endpoint with high latencies due to a missing database index. The scenario is likely when the table grows and query performance is not monitored.
The learning point is understanding indexes and how to introduce them. Setting up this incident is straightforward: remove the index and make requests to the endpoint.
Anti-example: A malicious actor gained access to our Kubernetes nodes to mine cryptocurrency. While plausible, it’s improbable when using managed services and following best security practices. The setup is complex, requiring the removal of security features and other modifications. On a side note, one valuable practice we adopted regarding security is the Elevation of Privilege game, but that’s a topic for another post.
For effective incident design, consult with experienced team members across different domains. Analyse old post-mortems for hidden lessons — initially, stage simpler incidents to target specific domain knowledge and learning objectives.
Scheduling
Scheduling depends on the desired hosting approach and the organisation’s constraints. In our scenario, we allocated two days for all engineering teams to create a festive and educational atmosphere. We experimented with two different formats.
- Two full consecutive days: All engineers engaged in solving incidents continuously over two days.
- Two split days over a month: We divided the event into two separate days within a month, weeks apart. On the first day, we focused on disaster recovery (DR) practices. On the second day, we worked on incident handling. This approach allowed us to use the DR session to create environments to “break” during the incident handling day. We used the time between the two events to finalise the environments that we would need to use for Incident Handling.
For incident handling, we structured the day as follows:
- 5–6 incidents for each team in a day
- 1h per incident
- 20' buffer
- 1h lunch break
- 30' intro explaining the whole process
- 30' for wrap-up and feedback form
For DR, we allocated the entire day to the practice, allowing teams to prioritise work based on their services’ criticality. It’s important to note that teams owning multiple services focused only on the most critical due to time constraints.
Teams Composition
We experimented with two approaches, working with the original team at the Disaster Recovery (DR) and mixing teams during incidents. Each one comes with distinct advantages.
Original Team: Established teams are effective in their domains. This rationale supports using original teams for DR tasks; they navigate complex aspects of their services with expertise. Furthermore, the teams can independently manage DR practices and create follow-up stories based on identified improvements. The knowledge acquired remains within the team. However, we do not prefer this setup for handling incidents due to the team’s high competency in their services — a simple incident is too straightforward and lacks challenge.
Mixed Team: The mixed-team setup was successful, supported by the similarity of our tech stack across the organisation. It allowed engineers from different domains to collaborate on incident resolution, expanding their internal network, and understanding of other domains and technologies (e.g., different frameworks). Nonetheless, the productivity of mixed teams may decline if their tech stacks are vastly different.
By integrating team members with varying levels of seniority, we ensured a balanced distribution of experience. Each team included an Incident Response Team member to support the incident response process and a mix of seasoned engineers who could support newer members. We recommend a collaborative approach for each incident, similar to a mob programming session, with about five people working on the same issue.
Facilitators
A few engineers volunteered during the planning days to become facilitators. This role involves designing an incident and playing it during the Game Days. Facilitators are responsible for preparing and hosting the incident, and guiding engineers as they work to resolve it throughout the day. In our context, they played each incident 5 to 6 times.
Payoff and Results
In the following sections, we detail our outcomes.
Upskilling Engineers
To ensure our engineers effectively manage real incidents under stress, we practise in Game Days.
These simulations require engineers to understand the incident’s impact, triage, communicate with stakeholders, resolve the issue, and write a postmortem. By this, engineers develop and sharpen crucial skills, and familiarise themselves with lesser-handled, yet critical components of our services — areas that are not touched often due to their reliability.
Systems Improvement
Looking at our backlog from 2021 and 2022, we have created and worked on more than 20 stories. We have made significant improvements in the following areas:
- Streamlined and updated our documentation for enhanced clarity, comprehensiveness, and accessibility.
- Improved our terraform setup, coming closer to a terraform apply to bring up services.
- Enhanced backup monitoring and refined our recovery practices for better reliability.
- Removed outdated or unnecessary code and configurations, streamlining our systems.
- Updated our logging to ensure that is useful and removed what is not needed.
Feedback
After each Game Day, we collected feedback from our engineers. Here, we share a summary of the feedback we received from our latest Game Days.
The Good
- Opportunity to explore and understand our systems and tech stack.
- Cross-team collaboration and networking.
- Hands-on approach expanding our thinking when facing incidents, and thus; gaining confidence.
The Bad
- Difficulty identifying the impact of an incident.
- Writing the post-mortem and stakeholder communication was cumbersome.
- Incident handling was challenging.
Wrap-up
In wrapping up, Game Days at Elli has contributed to our continuous quest for excellence. By embracing the “You Build it, You Run It” philosophy and reinforcing it through Game Days, we improve the quality of our ownership — we upskill our engineers and boost the reliability and robustness of our EV charging infrastructure. These events foster an environment for learning, collaboration, and continuous improvement, significantly contributing to our culture. The positive results and feedback we’ve gathered affirm the value of the practice.
We would love to hear from you.
- Do you implement Game Days in your organisation? How do they look?
- If not, what is preventing you from doing so?
- Do you have further questions or need further insights? We are happy to share our experiences. Leave us a comment with your questions.
Learn More
If you are interested in finding out more about how we work, please subscribe to the Elli Medium blog and visit our company’s website at elli.eco! See you next time!
About the author
Thanos Amoutzias is a Software Engineer, he develops Elli’s Charging Station Management System and drives SRE topics. He is passionate about building reliable services and delivering impactful products. You can find him on LinkedIn and in the 🏔️.
Credits: Thanks to all my colleagues who reviewed and gave feedback on the article!