Incident Postmortem - 1Password Cloud Services Degraded
- Date of Incident: 2025-10-20
- Time of Incident (UTC): 07:26:00 - 20:55:00
- Service(s) Affected: 1Password.com website, Sign in, Access to passwords and other items
- Impact Duration: 13 hours, 29 minutes
Summary
On October 20, 2025 at 07:26:00 UTC, 1Password.com faced intermittent latency, authentication failures, and degraded service availability due to a major outage at AWS in the us-east-1 region. This was not a security incident and no customer data was affected.
As a result, the 1Password server-side application experienced degradation or intermittent failures, affecting up to 50% of traffic in the US region. Complete service restoration occurred in conjunction with AWS’s final mitigations around 18:30 UTC.
Impact on Customers
All US customers accessing 1Password cloud services experienced intermittent latency, authentication failures, and degraded availability on 1Password.com.
- File Share: Sharing of passwords via links could intermittently fail
- Login: Users logging into vaults experienced timeout errors and slow responses
- Web Access: Users accessing their vault through the web interface experienced timeout errors and slow responses
- API Access: CLI users and API requests received timeout errors and slow responses
What Happened?
At 07:11:00 UTC, AWS began experiencing DNS resolution failures in the us-east-1 region, initially affecting DynamoDB and rapidly cascading to multiple AWS services. 1Password monitoring detected impact at 07:26:00 UTC when monitoring alerts fired for inability to scale up clusters, and an incident was declared.
1Password immediately deployed mitigations inside our infrastructure to ensure there was adequate compute capacity to serve our US-based users, which included pausing deployments and scaling down any services not critical to key functionality for our users.
Timeline of Events (UTC):
- 06:55:05 - 1Password monitoring triggers warning for unavailable Pods in Deployment (caused by inability to obtain AWS IAM credentials)
- 07:03:06 - 1Password monitoring alerts for 5xx errors on auth start endpoint (caused by inability to obtain AWS IAM credentials) - pages authentication team, but alert recovers within minutes
- 07:26:00 - 1Password monitoring alerts for inability to scale clusters, engineers begin investigating, Incident declared
- 07:26:41- AWS confirms elevated error rates across multiple services
- 07:49:06 - 1Password monitoring alerts for 5xx errors on auth start endpoint (caused by inability to obtain AWS IAM credentials)
- 07:51:09 - AWS identifies DNS as the root cause, begins mitigation
- 08:02:13 - 1Password suspends auto-scaling tooling to retain existing capacity
- 09:27:33 - AWS reports significant recovery signs
- 10:35:37 - AWS declares DNS issue fully mitigated, services recovering
- 14:14:00-15:43:00 - AWS announced full recovery across all services; throttles EC2 launches
- 16:42:49 - 1Password tooling and users start reporting 503s and inability to login due to volume of traffic
- 16:50:00 - 1Password services restarted to reset and flush connections, prioritizing post-recovery traffic.
- 20:53:00 - AWS resolves their incident
- 20:55:00 - 1Password engineers overscale deployments for stability and overnight observation
- Oct 21, 2025 - Incident resolved after confirmation of complete upstream recovery
How Was It Resolved?
- Mitigation Steps: 1Password paused deployments and auto-management of cluster capacity to ensure enough capacity was available to serve users through peak access times. As demand outstripped available capacity, 1Password engineering reset the circuit breaker to allow additional connections to the service.
- Resolution Steps: AWS announced system restoration and a reduction in throttling of EC2 API calls. To ensure sufficient capacity for peak traffic, 1Password engineers updated the required number of pods for core services the following business day and resumed auto-management of cluster capacity tooling. The following day, 1Password engineers resumed verification of the health of the systems, deployments, and auto-scaling of the services.
- Verification of Resolution: Engineers observed monitoring systems and cluster management tooling logs to ensure system health.
Root Cause Analysis
What We Are Doing to Prevent Future Incidents
- Improve Incident Response: Create additional backup protocols for when our incident response tooling is unavailable.
- Improve multi-service outage response: Create strong break-glass runbooks in the event of a multi-service cloud provider outage.
Next Steps and Communication
No action is required from our customers at this time.
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.
Sincerely,
The 1Password Team