Rodrigo Rosenfeld Rosas
Upgrading 200 GB Postgres within 10 minutes in Heroku
I joined a new project 4 months ago. While investigating a slow query I noticed that even after creating a new covering index the query planner would refuse to use it. We were running Postgres 15 and I decided to try Postgres 17 and confirmed it would use the covering index as expected which significantly improved the query performance.
The follower database upgrade issue
Next step was to push for the Postgres 15 to 17 upgrade, which seemed pretty straightforward according to this post from Heroku. So I gave it a try last Sunday in our staging database and noticed 2 things:
- The sign-in page downtime was about 3 minutes.
- Other parts of the application were down for much longer.
After inspecting the logs I noticed that parts of our application required our follower database to be available for increasing our read throughput, which is a common practice supported by Heroku. What that article from Heroku fails to clearly explain is that the prepare phase is only spawned for the leader database. That’s why the sign-in page would be down for only 3 minutes but it took 20-30 minutes for our staging environment to be available again.
Once the leader upgrade is completed, the followers upgrade process start, but the follower becomes immediately unavailable once the leader upgrade completed until the follower upgrade completes. Our staging database is pretty small, but our production one is almost 200 GB, so I anticipated it could be down for one hour or more until the follower database upgrade was complete.
During all that time the application would fail with 500 due to being unable to connect to the follower read-only database.
The plan for upgrading the production database
Once well understood how the Postgres upgrade works in Heroku, it was time to update the upgrade plan to take the follower into account. We have significantly less traffic on weekends so I upgraded the production database over this weekend using the following plan successfully:
- Yesterday (Saturday) I ran the
heroku pg:upgrade:prepareto start the prepare phase. This spawns a new follower PG 17 database and schedules the upgrade to happen automatically in our next maintenance window, which would be next Tuesday. - Then I pointed our read-only environment variable to our leader so that the app would no longer use our follower database until the upgrade was completed.
- Today (Sunday) I paused all background jobs queues, switched on the
maintenance mode on for the application and ran the
heroku pg:upgrade:runcommand and watched the progress withheroku pg:upgrade:wait. - 9 minutes later and the new database was upgraded and available, so I turned off the maintenance mode and after confirming the live app was working fine I unpaused all background job queues.
- I watched
heroku pg:upgrade:waitwhile the follower database was being upgraded. Once the application completed about an hour later I restored the read-only database URI to point to the follower again and monitored it for a few minutes to make sure the application was working fine.
Conclusion
I used to host my page on Heroku when I first created it and only moved from Heroku when the free plan wasn’t available anymore. But I never really dug in Heroku’s infrastructure before like I had to do once I joined this project.
I must say I’m positively surprised with the experience of using Heroku for hosting a big project like this one, especially with the experience of upgrading Postgres. I just feel their articles failed to better explain how the upgrade process works, especially regarding followers upgrade and timing. This article intends to help people upgrading the leader and their followers to Postgres 17 seamlessly on Heroku until they update their articles.
Heroku positively surprised me in many other areas, including monitoring and metrics. However I miss more dyno types from their plans. They offer a 1 GB RAM dyno for $ 50 / month, then 2.5 GB for $ 250 / month (5 times the price of the previous tier) and the next tier would allow for 30 GB at $ 500 / month. Why don’t we see plans with 4 or 8 GB for $ 150/200 monthly, for example? Anyway, it is what it is and we have to adapt if we want to stay in the platform, but overall it’s a great platform for companies with a small team, like ours.
Well, this client was acquired by a bigger company a few weeks ago, so they might decide to switch to some Kubernetes setup in the future to save some money if they already have some operations team in place. While it doesn’t happen I’m going to enjoy Heroku’s infrastructure and great tools, including the automated Postgres upgrade.
And yes, we’re no longer getting request timeout errors due to the bad query plan from PG 15. After upgrading to PG 17 those request timeout errors were gone and the queries now complete within a few milliseconds. I highly recommend upgrading to newer Postgres versions. PG 18 isn’t available on Heroku yet, but I’ll keep an eye on it.