Rodrigo Rosenfeld Rosas
Seamless Postgres Indexing in Rails: The Case for Delayed Migrations
Adding an index to an existing table column might seem straightforward. However, PostgreSQL requires a SHARE lock to create a standard index. This mode conflicts with common write operations like INSERT, UPDATE, and DELETE, which use the ROW EXCLUSIVE lock mode. These articles explain the issue in depth:
- Postgres Schema Migration without Downtime Best Practice
- How to Use Postgres CREATE INDEX CONCURRENTLY
Creating the index concurrently allows those write operations to happen while the index is being created, which can take quite a few minutes in huge tables. If those operations were blocked during that time the application would experience many request timeouts and the application would become unresponsive until the index creation is completed.
I am currently working on a Rails application hosted on Heroku. This application makes use of the strong_migrations gem, which sets a lock timeout of 10 seconds by default. Under heavy load, it can take a long time to acquire the lock necessary to complete index creation.
When that happens the whole deployment fails because the db:migrate task failed, leaving the created index in an invalid state. It gets worse because indices can’t be created concurrently in a transaction, which means the migration must call disable_ddl_transaction!. Any changes already made to the database are not rolled back if the migration fails.
Even if the only operation in the migration is the index creation, the migration can’t be retried until we remove the invalid index first if the migration specifies the index name.
Other issues caused by concurrent index creation
The strong_migrations gem can automatically get rid of invalid indices,
through the StrongMigrations.remove_invalid_indexes = true
setting.
That fixes part of the problem with retrying failed migrations.
Since indices can’t be created concurrently within a transaction, we
should create indices in dedicated migrations.
There’s no guarantee we’ll get the lock before the lock timeout. If the migration fails, the whole deployment will fail and it will prevent us from deploying other changes as well until the migrations are applied successfully. We want to prevent the situation where a single index creation migration could block all deployments until off hours when the application is less busy and the lock can be acquired within 10 seconds.
Delayed migrations to the rescue
In order to fix those issues we decided to handle index creations in delayed migrations. They work just like regular migrations except that they are not executed during the release phase. We run those migrations after a successful deployment instead. These migrations are stored in a separate location and new Rake tasks have been created to run them.
It changes the way we implement some changes too. Instead of deploying database changes alongside with code changes we first prepare the database before deploying the new code. Once the indices are created we can merge the new code using the index.
I’ve implemented a robust set of Rake tasks to handle this automated process. Here is the implementation:
| 1 | # frozen_string_literal: true |
| 2 | |
| 3 | namespace :delayed_migrations do |
| 4 | helper = Module.new do |
| 5 | module_function |
| 6 | |
| 7 | def with_delayed_migrations(&block) |
| 8 | ActiveRecord::Base.connection.execute("SET lock_timeout = '1h'") |
| 9 | pool = ActiveRecord::Base.connection_pool |
| 10 | path = Rails.configuration.app.delayed_migrations_path |
| 11 | context = ActiveRecord::MigrationContext.new( |
| 12 | path, pool.schema_migration, pool.internal_metadata |
| 13 | ) |
| 14 | block.call(context) |
| 15 | Rake::Task["db:schema:dump"].invoke if ActiveRecord.dump_schema_after_migration |
| 16 | end |
| 17 | end |
| 18 | |
| 19 | desc "Run all pending delayed migrations" |
| 20 | task run: :environment do |
| 21 | helper.with_delayed_migrations(&:migrate) |
| 22 | end |
| 23 | |
| 24 | desc "Run a specific delayed migration (e.g. rails delayed_migrations:up[20231027123456])" |
| 25 | task :up, [:version] => :environment do |_, args| |
| 26 | version = args[:version]&.to_i |
| 27 | raise "Version is required. Usage: rails delayed_migrations:up[VERSION]" if version.nil? |
| 28 | |
| 29 | helper.with_delayed_migrations { it.run(:up, version) } |
| 30 | end |
| 31 | |
| 32 | desc "Rollback a specific delayed migration (e.g. rails delayed_migrations:down[20231027123456])" |
| 33 | task :down, [:version] => :environment do |_, args| |
| 34 | version = args[:version]&.to_i |
| 35 | raise "Version is required. Usage: rails delayed_migrations:down[VERSION]" if version.nil? |
| 36 | |
| 37 | helper.with_delayed_migrations { it.run(:down, version) } |
| 38 | end |
| 39 | |
| 40 | desc "Rollback the last delayed migration" |
| 41 | task rollback: :environment do |
| 42 | helper.with_delayed_migrations(&:rollback) |
| 43 | end |
| 44 | |
| 45 | task extend_migrate_status: :environment do |
| 46 | next unless Rake.application.top_level_tasks == ["db:migrate:status"] |
| 47 | |
| 48 | migrations_path = ActiveRecord::Tasks::DatabaseTasks.migrations_paths |
| 49 | delayed_path = Rails.configuration.app.delayed_migrations_path |
| 50 | migrations_path << delayed_path if migrations_path.exclude?(delayed_path) |
| 51 | end |
| 52 | Rake::Task["db:migrate:status"].enhance(["delayed_migrations:extend_migrate_status"]) |
| 53 | end |
It’s now a matter of generating a new migration, adding the index creation logic,
moving the migration to db/delayed_migrations, and running
bin/rails delayed_migrations:run.
We can run this command through heroku run bash or using a background job scheduled after each release, for example. If the command times out it’s retried later on. Once it’s run successfully we can proceed with the code changes that use the new index.
Final thoughts
This is a very common problem with applications running in production that I’m surprised I don’t see an out-of-the-box solution from Rails or any other gem. Maybe I’m missing something so please let me know in the article comments if there’s a built-in solution to this problem that I’m not aware of.