Building Resilience into SSO Elevator 3.1.0: Lessons from an AWS Outage

AWS-outage.png
Date: October 28, 2025

When Everything Goes Down

On October 19, 2025, AWS experienced a widespread disruption in the US East (N. Virginia) Region (us-east-1). The incident was tied to issues with internal DNS and DynamoDB endpoints. While AWS mitigated the core issue within a few hours, degradation persisted throughout the day, with operations returning to normal that evening. The real story was what happened in other Regions that appeared mostly operational.

This is where things got interesting for our customers using SSO Elevator, FivexL’s open-source tool for temporary elevated access to AWS accounts. We started receiving reports of SSO Elevator failures from customers whose AWS IAM Identity Center (formerly AWS SSO) was deployed in Regions outside us-east-1 - Regions that were supposedly healthy. Their Identity Center was working fine, yet SSO Elevator was failing. How could this be?

The answer lies in a fundamental reality of distributed systems: nothing is perfect, and seemingly isolated services often have hidden dependencies on centralized infrastructure. In AWS’s case, many “regional” services depend on “global” services that are actually hosted in us-east-1. When us-east-1 struggles, these hidden dependencies can bring down functionality in otherwise healthy Regions.

This incident became a valuable learning opportunity and drove us to release SSO Elevator 3.1.0 with significantly improved resilience. More importantly, it reinforced a critical principle: we must engineer our systems expecting things to fail, not hoping they won’t.

The Investigation: Finding the Hidden Dependency

As part of our post-incident analysis and lessons learned process, we dove deep into understanding why SSO Elevator failed even when Identity Center itself was operational. The investigation quickly revealed the culprit: the AWS Organizations API.

SSO Elevator relies on the AWS Organizations API to retrieve the list of AWS accounts in your organization. This list populates the dropdown in the Slack form where users select which account they need temporary access to. Simple enough, right?

Here’s the problem: the AWS Organizations API uses a global endpoint in us-east-1. During the incident, even after initial mitigation, the Organizations API continued returning errors as us-east-1 experienced ongoing degradation. This meant that SSO Elevator couldn’t fetch the account list, causing the access request flow to fail completely - regardless of which Region hosted the customer’s Identity Center.

┌─────────────────────────────────────────────────────────────┐
│                     SSO Elevator (v3.0)                     │
│                  (Any AWS Region, e.g., eu-west-1)          │
└────────────────────────┬────────────────────────────────────┘
                         │
                         │ Get account list
                         │
                         ▼
            ┌────────────────────────────┐
            │   AWS Organizations API    │
            │  (global endpoint in       │
            │      us-east-1)            │
            └────────────────────────────┘
                         │
                         │ When us-east-1 is degraded
                         │
                         ▼
                    ╔═══════════╗
                    ║  FAILURE  ║  ◄── Single Point of Failure
                    ╚═══════════╝
                         │
                         ▼
              SSO Elevator stops working
             (even if Identity Center is healthy)

This was our hidden single point of failure. A service hosted in one Region was capable of breaking functionality worldwide. Understanding this dependency was crucial, but the real question was: how do we fix it?

The Solution Journey: From DynamoDB to S3

Once we identified the problem, the solution seemed straightforward: we needed to cache the list of accounts so SSO Elevator could continue operating even when the Organizations API was unavailable.

Our first instinct was to use DynamoDB with its built-in Time-To-Live (TTL) feature. It looked like an attractive serverless option - no infrastructure to manage, automatic cache expiration, and it would integrate nicely with our Lambda-based architecture.

But then we paused and asked ourselves a critical question: “Are we actually solving the problem, or are we just adding another dependency?”

By introducing DynamoDB, we would be adding a new service dependency while trying to reduce our vulnerability to service failures. Sure, DynamoDB is highly available, but it’s still another thing that could go wrong. We needed to think differently.

We took another look at the services SSO Elevator was already using and realized the answer was right in front of us: S3. We were already using S3 for storing audit logs of all access grants and revocations. Why not leverage it for caching as well? S3 is one of AWS’s most durable and available services, with 99.999999999% (11 nines) durability, 99.99% availability, and strong read-after-write consistency. More importantly, we weren’t adding a new dependency - we were maximizing the utility of existing infrastructure.

The Technical Implementation: Cache as a Safety Net

Here’s where our approach differs from typical caching strategies. Most caching implementations focus on performance - reduce latency, decrease API calls, save costs. Our implementation focuses entirely on resilience. The cache isn’t there to make things faster; it’s there to keep things running when the primary data source fails.

The implementation follows a parallel execution pattern:

  1. When a user opens the Slack form to request access, SSO Elevator simultaneously:

    • Calls the AWS Organizations API to get the current list of accounts
    • Retrieves the cached list from S3
  2. If the Organizations API call succeeds:

    • Use the fresh list from the API
    • Compare it with the cached version in S3
    • If they differ, update the S3 cache with the new list
    • Return the list to populate the Slack form
  3. If the Organizations API call fails:

    • Fall back to the cached list from S3
    • Return the cached list to populate the Slack form
    • Log the failure for monitoring

Architecture with S3 Caching (v3.1.0):

                    ┌─────────────────────────────┐
                    │   User Opens Slack Form     │
                    └──────────────┬──────────────┘
                                   │
                                   ▼
                    ┌─────────────────────────────┐
                    │    SSO Elevator (v3.1.0)    │
                    │   (Any AWS Region)          │
                    └──────────────┬──────────────┘
                                   │
                        Parallel Execution
                    ┌──────────────┴──────────────┐
                    │                             │
                    ▼                             ▼
       ┌────────────────────────┐    ┌────────────────────────┐
       │  AWS Organizations API │    │   S3 Cache Bucket      │
       │  (global endpoint in   │    │   (Same Region as      │
       │      us-east-1)        │    │    SSO Elevator)       │
       └────────────┬───────────┘    └────────────┬───────────┘
                    │                             │
                    └──────────┬──────────────────┘
                               │
                               ▼
                    ┌─────────────────────────┐
                    │   Decision Logic:        │
                    │                          │
                    │   ✓ API Success?         │
                    │     → Use fresh data     │
                    │     → Compare with cache │
                    │     → Update if changed  │
                    │                          │
                    │   ✗ API Failed?          │
                    │     → Use cached data    │
                    │     → Log failure        │
                    └────────────┬─────────────┘
                                 │
                                 ▼
                    ┌──────────────────────────┐
                    │  Return Account List to  │
                    │      Slack Form          │
                    └──────────────────────────┘
                                 │
                                 ▼
            Continues to operate during common us-east-1 degradation
            (assumes valid cache exists from prior successful call)

This approach provides several benefits:

No Added Latency on the Common Path: Because we execute the API call and cache retrieval in parallel, there’s no additional latency when the Organizations API succeeds. Users don’t wait longer for the form to load in the happy path.

Self-Healing Cache: By comparing the API response with the cached version, we ensure the cache stays fresh automatically. When accounts are added or removed from your organization, the cache updates on the next successful API call.

Fail-Safe Operation: During an AWS service disruption affecting us-east-1, SSO Elevator continues working with the cached account list. The list might be slightly stale, but the system remains operational. This is a reasonable trade-off for resilience.

Minimal Operational Overhead: No TTL to tune, no cache invalidation logic to debug, no additional service to monitor. The cache is just a file in S3 that updates itself when needed.

Enabling Better Resilience Across Regions

This improvement enables customers running SSO Elevator in any Region to better withstand us-east-1 degradation. While AWS IAM Identity Center organization instances can only exist in a single Region per AWS Organization (and moving requires deleting and recreating the instance), customers can deploy multiple account instances for specific use cases. With SSO Elevator 3.1.0’s improved resilience, these deployments can continue functioning during us-east-1 incidents that affect the Organizations API but not the customer’s chosen Region.

What’s New in SSO Elevator 3.1.0

Beyond the resilience improvements described above, version 3.1.0 includes several other enhancements based on customer feedback and our own operational experience. The release maintains backward compatibility, so existing deployments can upgrade without configuration changes.

Key improvements:

  • Intelligent account list caching with S3 fallback - the core resilience feature described in this post
  • Improved error handling and logging for better visibility during issues
  • Enhanced monitoring metrics to track cache hits, API failures, and fallback operations
  • Documentation updates including guidance on resilience strategies

You can review the complete changes in the GitHub repository.

Lessons Learned: Engineering for Failure

This incident and our response to it reinforced several important principles that apply far beyond SSO Elevator:

Nothing Is Perfect: Even AWS, with all its resources and expertise, experiences failures. If AWS can have outages, your services certainly can too. Accepting this reality is the first step toward building resilient systems.

Hidden Dependencies Are Everywhere: The AWS Organizations API dependency wasn’t obvious from our architecture diagrams. It was a transitive dependency - we needed it to function, but it wasn’t a service we explicitly integrated with or thought about much. These are the most dangerous kind because they’re easy to overlook during design and planning.

Map Your Dependencies: Take time to document not just what services you use, but what services they use. What happens if IAM is down? What about STS? Organizations? CloudTrail? Service Catalog? Many AWS services depend on these foundational services, even if you don’t call them directly.

Run Failure Scenarios: Don’t wait for real incidents to discover your failure modes. AWS provides tools like AWS Fault Injection Simulator that let you deliberately inject failures and see how your system responds. Even simpler, you can test IAM/STS failure modes by temporarily blocking those services at the SCP level.

Prioritize Your Improvements: You can’t make everything perfectly resilient. Use risk analysis to identify what matters most. For SSO Elevator, access during incidents is critical - if infrastructure is failing, operators need to fix it, which often requires elevated access. That made this improvement a priority.

Leverage What You Have: Our switch from DynamoDB to S3 for caching wasn’t just about avoiding a new dependency. It was about deeply understanding our existing architecture and maximizing the value of services we already relied on. Sometimes the best solution is already in your stack.

Take Action: Assess Your Own Resilience

I encourage you to take some time this week to assess the resilience of your own critical systems:

  1. List your service dependencies - Not just the services you call directly, but their dependencies too. AWS’s service documentation often lists these in the “Service endpoints and quotas” pages.

  2. Model failure scenarios - For each dependency, ask: “What happens if this service fails?” Walk through the failure cascade. You might be surprised by what you discover.

  3. Test your assumptions - Use AWS Fault Injection Simulator or manual testing (like temporary SCP policies) to verify how your systems behave during failures.

  4. Prioritize improvements - Focus on what matters most to your business and users. You can’t address every potential failure mode.

  5. Share your lessons - If you discover interesting failure modes or build clever solutions, share them with the community. That’s how we all get better.

Conclusion

AWS outages are rare, but they happen. The real test of engineering maturity isn’t avoiding failures - it’s how gracefully your systems degrade when failures inevitably occur.

SSO Elevator 3.1.0 represents our commitment to building tools that remain operational when you need them most. By identifying hidden dependencies, implementing intelligent caching strategies, and focusing on resilience over performance, we’ve made SSO Elevator more reliable for customers deploying across different Regions.

We hope this post provides useful insights for your own resilience planning. As always, SSO Elevator is open source, and we welcome contributions, feedback, and bug reports on our GitHub repository. If you’re using SSO Elevator or considering it for your organization, we’d love to hear about your experience.

Stay resilient, and may your oncall be quiet.

image-33.png

Andrey Devyatkin

Principal Cloud Engineering Consultant Co-Founder of FivexL Read More

Tags

Share Blog