How I Took Down Prod With a 400ms Migration (And The Playbook I Use Now)
How I Took Down Prod With a 400ms Migration (And The Playbook I Use Now) The 3 AM ALTER TABLE That Ruined My Weekend It was a regular Tuesday deploy. The Jira ticket was straightforward: just add a...

Source: DEV Community
How I Took Down Prod With a 400ms Migration (And The Playbook I Use Now) The 3 AM ALTER TABLE That Ruined My Weekend It was a regular Tuesday deploy. The Jira ticket was straightforward: just add a foreign key constraint linking orders to customers. The table had 50 million rows, but I had explicitly tested it against a staging dump. It took exactly 400 milliseconds. "It works locally and on staging. Ship it," I confidently told my team lead. What I totally missed was the underlying database locking mechanism. The ALTER TABLE needed an ACCESS EXCLUSIVE lock on both tables — orders and customers. But at that exact moment in production, a heavy business analytics dashboard was running a long SELECT query, holding an AccessShare lock on customers. So, my simple 400ms migration just sat there waiting in the queue for its turn. And then the disaster happened. The moment my ACCESS EXCLUSIVE lock request entered the queue, every single subsequent query—every SELECT, INSERT, and UPDATE against