Once PostgreSQL 10 was released I wanted to upgrade our 9.6 cluster to the newest version. However, it would require a lot of coordination effort to get a maintenance window to perform the migration the way I was used to: put the application in maintenance mode, get a new dump and restore it to the new cluster and switch off the maintenance mode.
That means the application wouldn’t be available for an hour or so, maybe more. After reading once more about pglogical, I decided to finally give it a try, which allowed me to switch from 9.6 to 10 in just a few seconds.
pglogical implements logical replication, which allows replicating databases among different versions, which is not possible with the binary replication mechanism provided by PostgreSQL itself. Well, PG 10 added some support to logical replication, but since we want to replicate from 9.6, we’d need to resort to some external extension.
A required condition from pglogical is that all tables being replicated must have a primary key. It doesn’t need to be a single column, but a primary key must exist. Superuser access must also be provided for both databases for the replication agents. DDL replication is not supported. Truncate cascades are not replicated. Nothing fancy, after all. It should allow us to replicate most databases.
You should pay special attention to the primary key requirement though, specially if you’re using the ActiveRecord Ruby gem to manage the database migrations in older databases as the schema_migrations table didn’t have a primary key in the earlier days. If that’s your case:
1 | alter table schema_migrations add primary key (version); |
The idea is to install a PostgreSQL package with support for the pglogical extension, then create the new PG 10 cluster and restore the schema only in the new cluster. The current cluster should be stopped and restarted using the pglogical-enabled installed PostgreSQL. The clusters should be reachable to it other through TCP/IP. You’ll need to tell the provider (the 9.6 database being upgraded) the IP and port for the subscriber (the new PG 10 database) and vice-versa. The pglogical extension is created in both databases, postgresql.conf and pg_hba.conf are changed to enable logical replication and both databases are restarted. Finally, some pglogical statements are issued to create the provider, subscriber and subscription, which starts the replication. Once the replication is finished, you may change the port in the new cluster to match the old one, stop the old cluster and restart the new one. Finally it would be a good idea to restart the applications as well, specially if you’re using some custom types such as row types, as they will most likely have different OIDs and if you have registered those row types it won’t work as expected until you reboot the application. This would be the case if you’re using DB.register_row_type using the Sequel Ruby gem, for example.
The final switch can happen in as quickly as a few seconds, which means minimal downtime.
We use Docker to run PostgreSQL in our servers (besides the apps), so this article also uses it to demonstrate how the process works, but it should be easy to apply the instructions to other kind of set-ups. The advantage of Docker as demonstration tool is that these procedures should be easy to replicate as is and it also takes care of creating and running the databases as well.
We assume the PostgreSQL client is installed in the host too for this article.
Create the following Dockerfiles in sub-directories pg96 and pg10 (look at the instructions inside the Dockerfiles in order to replicate in your own environment if you’re not running PostgreSQL in a Docker container):
1 | # pg96/Dockerfile |
2 | FROM postgres:9.6 |
3 | |
4 | RUN apt-get update && apt-get install -y wget gnupg |
5 | RUN echo "deb [arch=amd64] http://packages.2ndquadrant.com/pglogical/apt/ jessie-2ndquadrant main" > /etc/apt/sources.list.d/2ndquadrant.list \ |
6 | && wget --quiet -O - http://packages.2ndquadrant.com/pglogical/apt/AA7A6805.asc | apt-key add - \ |
7 | && apt-get update \ |
8 | && apt-get install -y postgresql-9.6-pglogical |
9 | |
10 | RUN echo "host replication postgres 172.18.0.0/16 trust" >> /usr/share/postgresql/9.6/pg_hba.conf.sample |
11 | RUN echo "host replication postgres ::1/128 trust" >> /usr/share/postgresql/9.6/pg_hba.conf.sample |
12 | RUN echo "shared_preload_libraries = 'pglogical'" >> /usr/share/postgresql/postgresql.conf.sample |
13 | RUN echo "wal_level = 'logical'" >> /usr/share/postgresql/postgresql.conf.sample |
14 | RUN echo "max_wal_senders = 20" >> /usr/share/postgresql/postgresql.conf.sample |
15 | RUN echo "max_replication_slots = 20" >> /usr/share/postgresql/postgresql.conf.sample |
1 | # pg10/Dockerfile |
2 | FROM postgres:10 |
3 | |
4 | RUN rm /etc/apt/trusted.gpg && apt-get update && apt-get install -y wget |
5 | RUN echo "deb [arch=amd64] http://packages.2ndquadrant.com/pglogical/apt/ stretch-2ndquadrant main" > /etc/apt/sources.list.d/2ndquadrant.list \ |
6 | && wget --quiet -O - http://packages.2ndquadrant.com/pglogical/apt/AA7A6805.asc | apt-key add - \ |
7 | && apt-get update \ |
8 | && apt-get install -y postgresql-10-pglogical |
9 | |
10 | RUN echo "host replication postgres 172.18.0.0/16 trust" >> /usr/share/postgresql/10/pg_hba.conf.sample |
11 | RUN echo "host replication postgres ::1/128 trust" >> /usr/share/postgresql/10/pg_hba.conf.sample |
12 | RUN echo "shared_preload_libraries = 'pglogical'" >> /usr/share/postgresql/postgresql.conf.sample |
13 | RUN echo "wal_level = 'logical'" >> /usr/share/postgresql/postgresql.conf.sample |
14 | RUN echo "max_wal_senders = 20" >> /usr/share/postgresql/postgresql.conf.sample |
15 | RUN echo "max_replication_slots = 20" >> /usr/share/postgresql/postgresql.conf.sample |
Let’s assume both servers will run in the same machine with IP 10.0.1.10. The 9.6 instance is running on port 5432 and the new cluster will be running initially (before the switch) in port 5433.
1 | cd pg96 && docker build . -t postgresql-pglogical:9.6 && cd - |
2 | cd pg10 && docker build . -t postgresql-pglogical:10 && cd - |
This is not a tutorial on Docker, but if you’re actually using Docker, it would be a good idea to push those images to your private registry.
The first step is to stop the old 9.6 cluster and start the pglogical enabled cluster with the old data (taking a backup before is always a good idea by the way). Suppose your cluster data is located at “/var/lib/postgresql/9.6/main/” and that your config files are located at “/etc/postgresql/9.6/main/”. If “/etc/postgresql/9.6” and “/var/lib/postgresql/9.6” do not exist, don’t worry, the script will create a new cluster for you (in case you want to try with new dbs, first, which is a good idea by the way, and map some temp directories).
Create the following script at “/sbin/pg-scripts/start-pg” and make it executable. It will run the database from the container.
1 | #!/bin/bash |
2 | version=$1 |
3 | net=$2 |
4 | setup_db(){ |
5 | pg_createcluster $version main -o listen_addresses='*' -o wal_level=logical \ |
6 | -o max_wal_senders=10 -o max_worker_processes=10 -o max_replication_slots=10 \ |
7 | -o hot_standby=on -o max_wal_senders=10 -o shared_preload_libraries=pglogical -- -A trust |
8 | pghba=/etc/postgresql/$version/main/pg_hba.conf |
9 | echo -e "host\tall\tappuser\t$net\ttrust" >> $pghba |
10 | echo -e "host\treplication\tappuser\t$net\ttrust" >> $pghba |
11 | echo -e "host\tall\tpostgres\t172.17.0.0/24\ttrust" >> $pghba |
12 | echo -e "host\treplication\tpostgres\t172.17.0.0/24\ttrust" >> $pghba |
13 | pg_ctlcluster $version main start |
14 | psql -U postgres -c '\du' postgres|grep -q appuser || createuser -U postgres -l -s appuser |
15 | pg_ctlcluster $version main stop |
16 | } |
17 | [ -d /var/lib/postgresql/$version/main ] || setup_db |
18 | exec pg_ctlcluster --foreground $version main start |
This script will take care of creating a new cluster if one doesn’t already exist. Although not really required for the replication to work, it also takes care of creating a new “appuser” database superuser authenticated with “trust” for simplicity sake. It might be useful if you decide to use this script for spawning new databases for testing purposes. Adapt the script to suite your needs in that case, changing the user name or the authentication methods.
Let’s run the 9.6 cluster in port 5432 (feel free to run it in another port and use a temporary directory in the mappings if you just want to give it a try):
1 | docker run --rm -v /sbin/pg-scripts:/pg-scripts -v /var/lib/postgresql:/var/lib/postgresql \ |
2 | -v /etc/postgresql:/etc/postgresql -p 5432:5432 postgres-pglogical:9.6 \ |
3 | /pg-scripts/start-pg 9.6 10.0.1.0/24 |
4 | # since we're running in the foreground with the --rm option, run this in another terminal: |
5 | docker run --rm -v /sbin/pg-scripts:/pg-scripts -v /var/lib/postgresql:/var/lib/postgresql \ |
6 | -v /etc/postgresql:/etc/postgresql -p 5433:5432 postgres-pglogical:10 \ |
7 | /pg-scripts/start-pg 10 10.0.1.0/24 |
The first argument to start-pg is the PG version and the second and last argument is the net used to create pg_hba.conf if it doesn’t exist, to allow “appuser” to connect from using the “trust” authentication method.
If you’re curious about how to run a Docker container as a systemd service, let me know in the comments section below and I may complement this article once I find some time, but it’s not hard. There are plenty of documents explaining that in the internet, but our own service unit file is a bit different from what I’ve seen in most tutorials, as it tries to check that the port is indeed accepting connections when starting the service and it doesn’t pull the image from the registry if it is available locally already.
Once you make sure the old cluster is running file with the postgresql-pglogical container, it’s time to update your postgresql.conf file and restart the container. Use the following configuration as a start-point for both 9.6 and 10 clusters:
1 | wal_level = logical |
2 | max_worker_processes = 10 |
3 | max_replication_slots = 10 |
4 | max_wal_senders = 10 |
5 | shared_preload_libraries = 'pglogical' |
For pg_hba.conf, include the following lines (change the network settings if you’re not using Docker, or if you’re running the containers in another net than the default one):
1 | host all postgres 172.17.0.0/24 trust |
2 | host replication postgres 172.17.0.0/24 trust |
Restart the servers and we should be ready for starting the replication.
In the PG 9.6 database:
1 | # take a dump from the schema that we'll use to restore in PG 10 |
2 | pg_dump -Fc -s -h 10.0.1.10 -p 5432 -U appuser mydb > mydb-schema.dump |
3 | psql -h 10.0.1.10 -p 5432 -c 'create extension pglogical;' -U appuser mydb |
4 | psql -h 10.0.1.10 -p 5432 -c "select pglogical.create_node(node_name := 'provider', dsn := 'host=10.0.1.10 port=5432 dbname=mydb');" -U appuser mydb |
5 | psql -h 10.0.1.10 -p 5432 -c "select pglogical.replication_set_add_all_tables('default', ARRAY['public']);" -U appuser mydb |
6 | |
7 | # I couldn't get sequences replication to work, so I'll suggest another method just before switching the database |
8 | # psql -h 10.0.1.10 -p 5432 -c "select pglogical.replication_set_add_all_sequences('default', ARRAY['public']);" -U appuser mydb |
This mark all tables and sequences from the public schema to be replicated.
In the PG 10 database:
1 | # create and restore the schema of the database |
2 | createdb -U appuser -h 10.0.1.10 -p 5433 mydb |
3 | pg_restore -s -h 10.0.1.10 -p 5433 -U appuser -d mydb mydb-schema.dump |
4 | # install the pglogical extension and setup the subscriber and subscription |
5 | psql -h 10.0.1.10 -p 5433 -c 'create extension pglogical;' -U appuser mydb |
6 | psql -h 10.0.1.10 -p 5433 -c "select pglogical.create_node(node_name := 'subscriber', dsn := 'host=10.0.1.10 port=5433 dbname=mydb');" -U appuser mydb |
7 | psql -h 10.0.1.10 -p 5433 -c "select pglogical.create_subscription(subscription_name := 'subscription', provider_dsn := 'host=10.0.1.10 port=5432 dbname=mydb');" -U appuser mydb |
From now on you can follow the status of the replication with
1 | select pglogical.show_subscription_status('subscription'); |
Once the initialization is over and the databases are synced and replicating (this may take quite a while depending on your database size) you may start the switch.
At this point the replication database is almost all set. I couldn’t figure out how to replicate the sequence values, so, if you’re using serial integer primary key columns relying on sequences, then you’ll also want to set proper values to the sequences otherwise you won’t be able to insert new records while relying on the serial sequence next value. Here’s how you can do that. Just to be sure, it’s inserting a 5000 gap so that you have enough time to stop the old server after gererating the set-value statements in case your database is very write intensive. You should probably review that gap value depending on how quickly your database might grow up between running those scripts and stopping the server.
1 | psql -h 10.0.1.10 -p 5432 -U appuser -c "select string_agg('select ''select setval(''''' || relname || ''''', '' || last_value + 5000 || '')'' from ' || relname, ' union ' order by relname) from pg_class where relkind ='S';" -t -q -o set-sequences-values-generator.sql mydb |
2 | psql -h 10.0.1.10 -p 5432 -U appuser -t -q -f set-sequences-values-generator.sql -o set-sequences-values.sql mydb |
3 | # set the new sequence values in the new database (port 5433 in this example): |
4 | psql -h 10.0.1.10 -p 5433 -U appuser -f set-sequences-values.sql mydb |
Then, basically, you should change the port for the PG10 cluster and set it to 5432 (or whatever was the port the old cluster was using). Then stop the 9.6 cluster (Ctrl+C in the example above) and restart the new cluster. Finally, it’s a good idea to also restart the apps using the database, just in case they are relying on some custom types whose conversion rules would depend on the row type OID.
This assumes your apps are able to gracefully handle disconnections for the connections in the pool by using some connection validation before issuing any SQL statements. Otherwise, it’s probably a good idea to restart the apps whenever you restart the database after tweaking “postgresql.conf” and “pg_hba.conf”.
Once everything is running fine with the new database, you might want to clean things up. If that’s the case:
1 | select pglogical.drop_subscription('subscription'); |
2 | select pglogical.drop_node('subscriber'); |
3 | drop extension pglogical; |
I hope that helps you getting your database upgraded with minimal downtime.
For some years I have been using rsnapshot to back up our databases and documents using an incremental approach. We create a new back-up every hour and retain the last 24 hours backup, one back-up per day for the past 7 days and one back-up per week for the past 4 weeks.
Rsnapshot is great. It uses hard-links to achieve incremental back-up, saving up a lot of space. It’s a combination of “cp -al” and rsync. But we were facing a problem related to free inodes count on our ext4 partition. By the way, NewRelic does not monitor the free inodes count (df -i) so I found this problem the hard way, after the back-up stopped working due to lack of free inodes.
I’ve created a custom check in our own monitoring system to alert about low free inodes available and then I tried to tweak some ext4 settings to avoid this problem again in the new partition. We have 26GB spread on 2.6 million of individually gzipped documents (they are served directly by nginx) which will create almost 100 million hard-links in that back-up partition. There are hardlinks around the original documents as well as part of a smart strategy to save space when the same document is used in multiple transactions (they are not changed). Otherwise they would take some extra Gigabytes.
Recently, my custom monitoring system sent me an alert that 75% of the inodes were used while about only 30% of disk space was being actually used. So, I decided to investigate a bit more about other filesystems which dealt with inodes dynamically.
That’s how I found btrfs, a modern file-system which not only does not have a limit on inodes but, as I’ll describe, has some very interesting features for dealing with incremental back-up in a faster and better way than rsnapshot.
Initially I wasn’t thinking about replacing rsnapshot, but after reading about support for subvolumes and snapshots in btrfs I changed my mind and decided to replace rsnapshot with a custom script. I’ve tried to adapt rsnapshot for several hours to make the workflow I wanted work without success though. Here’s an issue related to btrfs support.
Before I talk about how btrfs helps our back-up system, let me explain a few issues I had with rsnapshot.
I’ve been living with some issues with rsnapshot in the past years. I want the full back-up procedure to take less than an hour so that we would be able to run it every hour. I had to tweak its settings a few times in order to get the script to finish in less than an hour but in the past days it was taking already almost 40 minutes to complete. A while back, before the tweaks, I had to change the interval to back-up every two hours.
One of the slow parts of rsnapshot is removing the last back-up snapshot when rotating. It doesn’t matter if you use “rm -rf” or whatever other method. Removing a big tree of files is slow. An alternative would be to move the latest snapshot to the first one (hourly.0), since this would save the “rm -rf” time and also the “cp -al” time, skipping to the rsync phase. But I wasn’t able to figure out how to make that happens with rsnapshot.
Also, some of the procedures could be done in parallel to speed up the process but rsnapshot doesn’t provide direct support to specify this and it’s hard to write proper shell script to manage those cases.
After reading about btrfs I figured out that the back-up procedure could be made much faster and be simplified. Then I created a Ruby script, which I’ll show in the next section, and integrated it in our automation tools in one day. I’ve replaced rsnapshot with it in our back-up server, with the new script and it’s running pretty well for the last two days taking about 8 minutes to complete the procedure on each run.
So, let me explain the strategy I wanted to implement to help you understanding the script.
As I said, btrfs supports subvolumes. Btrfs implements copy-on-write (CoW), so basically, this allows to both create and delete snapshots from subvolumes instantly (constant time). That means we replace the slow “rm -rf hourly.23” with the instantaneous “btrfs subvolume delete hourly.23” and “cp -al …” with the instantaneous “btrfs subvolume snapshot …”.
In order for a regular user to delete subvolumes with btrfs, the user_subvol_rm_allowed fs option must be used. Also, deleting a subvolume doesn’t work if there are other subvolumes inside it, so they must be removed first. There’s no switch or tool in the btrfs-progs package that allows you to delete them recursively. This is important to understand the script.
Our back-up procedure consists of getting a recent dump of two production PostgreSQL databases (the main database and the one used by Redmine) and syncing two directories containing files (the main application files and the files uploaded to Redmine).
The idea is to get them inside a static path as the first step. The main reason for that is that if something goes wrong in the process after syncing the documents (the slowest part), for example, we wouldn’t lose the transferred files the next time we try to run the script. So, basically here’s how I implemented it (there’s a simpler strategy I’ll explain next):
After the procedure is finished to get them in the latest state it creates a tmp directory and create a snapshot for each subvolume inside tmp and once everything works fine the back-ups are rotated and tmp is moved to hourly.0. Removing hourly.23 in the rotation phase has to remove the inner subvolumes first.
After implementing this (it was an iterative process) I realized it could be simplified to use a simpler infra-structure. “latest” would be a subvolume and everything inside it regular files and directories. Than the “tmp” directory wouldn’t be used and after rotating a snapshot of “latest” would be used to create “hourly.0”. I didn’t update the script yet because I’m not sure if it worths changing, since the current layout is more modular, which is useful in case I want to take some snapshot of just part of the back-up for some reason. So the sample back-up script in the next section will use my current tested approach, which is the situation described first above.
The main database has over 500MB in PostgreSQL custom format, and it’s much faster to rsync it than using scp. Initially those databases were not stored in the “latest” diretory and I used “scp” to copy them directly to the “tmp” directory, but I changed the strategy to save some time and bandwidth.
The script should exit with a message and non zero exit status code when something fails so that I would be notified if anything goes wrong by Cron (by setting the MAILTO=my@email.com in the beggining of the crontab file). It shouldn’t affect the existing valid snapshots either in that case.
It shouldn’t run in case the previous procedure hasn’t finish, so there’s a simple lock mechanism preventing that from happen in case it takes over an hour to complete. The second attempt will fail and I should get an e-mail telling me that happened.
It should also have a dry-run mode (which I call test mode) that will output the commands without running it, which is useful while designing the back-up steps. It should also allow for commands to run concurrently so it uses some indentation to show the order the commands are run.
Finally, it will report in the logs the issued commands and their status (finished or failed) as well as any commands output (STDOUT or STDERR) and the time each command took as well as the total time in the end of the procedure.
Finally, now that you understand what the script is supposed to do, here’s the actual implementation.
1 | #!/usr/bin/env ruby |
2 | |
3 | require 'open3' |
4 | require 'thread' |
5 | require 'logger' |
6 | require 'time' |
7 | |
8 | class Backup |
9 | def run(args) |
10 | @start_time = Time.now |
11 | @backup_root_path = File.expand_path '/var/backups' |
12 | #@backup_root_path = File.expand_path '~/backups' |
13 | @log_path = "#{@backup_root_path}/backup.log" |
14 | @tmp_path = "#{@backup_root_path}/tmp" |
15 | |
16 | @exiting = false |
17 | Thread.current[:indenting_level] = 0 |
18 | |
19 | setup_logger |
20 | |
21 | lock_or_exit |
22 | |
23 | log 'Starting back-up procedure' |
24 | |
25 | parse_args args.clone |
26 | |
27 | run_scripts if @action == 'hourly' |
28 | |
29 | rotate |
30 | unlock |
31 | report_completed |
32 | end |
33 | |
34 | private |
35 | |
36 | def setup_logger |
37 | File.write @log_path, '' unless File.exist? @log_path |
38 | logfile = File.open(@log_path, File::WRONLY | File::APPEND) |
39 | logfile.sync = true |
40 | @logger = Logger.new logfile |
41 | @logger.level = Logger::INFO |
42 | @logger.datetime_format = '%Y-%m-%d %H:%M:%S' |
43 | @logger_mutex = Mutex.new |
44 | end |
45 | |
46 | def lock_or_exit |
47 | if File.exist?(pidfile) && run_command("kill -0 #{pid = File.read pidfile}") |
48 | abort "There's another backup in progress. Pid: #{pid} (from #{pidfile})." |
49 | end |
50 | File.write pidfile, Process.pid |
51 | end |
52 | |
53 | def unlock |
54 | File.unlink pidfile |
55 | end |
56 | |
57 | def pidfile |
58 | @pidfile ||= "#{@backup_root_path}/backup.pid" |
59 | end |
60 | |
61 | def run_command!(cmd, sucess_in_test_mode = true, abort_on_stderr: false) |
62 | run_command cmd, sucess_in_test_mode, abort_on_stderr: abort_on_stderr, abort_on_error: true |
63 | end |
64 | |
65 | def run_command(cmd, sucess_in_test_mode = true, abort_on_stderr: false, abort_on_error: false) |
66 | indented_cmd = ' ' * indenting_level + cmd |
67 | Thread.current[:indenting_level] += 1 |
68 | if @test_mode |
69 | @logger_mutex.synchronize{ puts indented_cmd} |
70 | return sucess_in_test_mode |
71 | end |
72 | start = Time.now |
73 | log "started: '#{indented_cmd}'" |
74 | stdout, stderr, status = Open3.capture3 cmd |
75 | stdout = stdout.chomp |
76 | stderr = stderr.chomp |
77 | success = status == 0 |
78 | log stdout unless stdout.empty? |
79 | log stderr, :warn unless stderr.empty? |
80 | if (!success && abort_on_error) || (abort_on_stderr && !stderr.empty?) |
81 | die "'#{cmd}' failed to run with exit status #{status}, aborting." |
82 | end |
83 | log "finished: '#{indented_cmd}' (#{success ? 'successful' : "failed with #{status}"}) " + |
84 | "[#{human_duration Time.now - start}]" |
85 | success |
86 | end |
87 | |
88 | def indenting_level |
89 | Thread.current[:indenting_level] |
90 | end |
91 | |
92 | def log(msg, level = :info) |
93 | return if @test_mode |
94 | @logger_mutex.synchronize{ @logger.send level, msg } |
95 | end |
96 | |
97 | VALID_OPTIONS = ['hourly', 'daily', 'weekly'].freeze |
98 | def parse_args(args) |
99 | args.shift if @test_mode = (args.first == 'test') |
100 | unless args.size == 1 && VALID_OPTIONS.include?(@action = args.first) |
101 | abort "Usage: 'backup [test] action', where action can be hourly, daily or weekly. |
102 | If test is specified the commands won't run but will be shown." |
103 | end |
104 | end |
105 | |
106 | def die(message) |
107 | log message, :fatal |
108 | was_exiting = @exiting |
109 | @exiting = true |
110 | delete_tmp_path_if_exists unless was_exiting |
111 | unlock |
112 | abort message |
113 | end |
114 | |
115 | def create_tmp_path |
116 | delete_tmp_path_if_exists |
117 | create_subvolume @tmp_path |
118 | end |
119 | |
120 | def create_subvolume(path, skip_if_exists = false) |
121 | return if skip_if_exists && File.exist?(path) |
122 | run_script %Q{btrfs subvolume create "#{path}"} |
123 | end |
124 | |
125 | def delete_tmp_path_if_exists |
126 | delete_subvolume_if_exists @tmp_path, delete_children: true |
127 | end |
128 | |
129 | def delete_subvolume_if_exists(path, delete_children: false) |
130 | return unless File.exist?(path) |
131 | Dir["#{path}/*"].each{|s| delete_subvolume_if_exists s } if delete_children |
132 | run_script %Q{btrfs subvolume delete -c "#{path}"} |
133 | end |
134 | |
135 | def run_script(script) |
136 | run_command! script |
137 | end |
138 | |
139 | def run_scripts(scripts = all_scripts) |
140 | case scripts |
141 | when Par |
142 | il = indenting_level |
143 | last_il = il |
144 | scripts.map do |s| |
145 | Thread.start do |
146 | Thread.current[:indenting_level] = il |
147 | run_scripts s |
148 | last_il = [Thread.current[:indenting_level], last_il].max |
149 | end |
150 | end.each &:join |
151 | Thread.current[:indenting_level] = last_il |
152 | when Array |
153 | scripts.each{|s| run_scripts s } |
154 | when String |
155 | run_script scripts |
156 | when Proc |
157 | scripts[] |
158 | else |
159 | die "Invalid script (#{scripts.class}): #{scripts}" |
160 | end |
161 | end |
162 | |
163 | Par = Class.new Array |
164 | def all_scripts |
165 | [ |
166 | Par[->{create_tmp_path}, "mkdir -p #{@backup_root_path}/latest", dump_main_db_on_d1, |
167 | dump_tickets_db_on_d1], |
168 | Par[local_docs_sync, local_tickets_files_sync, local_main_db_sync, local_tickets_db_sync], |
169 | Par[main_docs_script, tickets_files_script, main_db_script, tickets_db_script], |
170 | ] |
171 | end |
172 | |
173 | def dump_main_db_on_d1 |
174 | %q{ssh backup@backup-server.com "pg_dump -Fc -f /tmp/main_db.dump } + |
175 | %q{main_db_production"} |
176 | end |
177 | |
178 | def dump_tickets_db_on_d1 |
179 | %q{ssh backup@backup-server.com "pg_dump -Fc -f /tmp/tickets.dump redmine_production"} |
180 | end |
181 | |
182 | def local_docs_sync |
183 | [ |
184 | ->{ create_subvolume local_docmanager, true }, |
185 | "rsync -azHq --delete-excluded --delete --exclude doc --inplace " + |
186 | "backup@backup-server.com:/var/main-documents/production/docmanager/ " + |
187 | "#{local_docmanager}/", |
188 | ] |
189 | end |
190 | |
191 | def local_docmanager |
192 | @local_docmanager ||= "#{@backup_root_path}/latest/docmanager" |
193 | end |
194 | |
195 | def local_tickets_files_sync |
196 | [ |
197 | ->{ create_subvolume local_tickets_files, true }, |
198 | "rsync -azq --delete --inplace backup@backup-server.com:/var/redmine/files/ " + |
199 | "#{local_tickets_files}/", |
200 | ] |
201 | end |
202 | |
203 | def local_tickets_files |
204 | @local_tickets_files ||= "#{@backup_root_path}/latest/tickets-files" |
205 | end |
206 | |
207 | def local_main_db_sync |
208 | [ |
209 | ->{ create_subvolume local_main_db, true }, |
210 | "rsync -azq --inplace backup@backup-server.com:/tmp/main_db.dump " + |
211 | "#{local_main_db}/main_db.dump", |
212 | ] |
213 | end |
214 | |
215 | def local_main_db |
216 | @local_main_db ||= "#{@backup_root_path}/latest/postgres" |
217 | end |
218 | |
219 | def local_tickets_db_sync |
220 | [ |
221 | ->{ create_subvolume local_tickets_db, true }, |
222 | "rsync -azq --inplace backup@backup-server.com:/tmp/tickets.dump " + |
223 | "#{local_tickets_db}/tickets.dump", |
224 | ] |
225 | end |
226 | |
227 | def local_tickets_db |
228 | @local_tickets_db ||= "#{@backup_root_path}/latest/tickets-db" |
229 | end |
230 | |
231 | def main_docs_script |
232 | create_snapshot_cmd local_docmanager, "#{@tmp_path}/docmanager" |
233 | end |
234 | |
235 | def create_snapshot_cmd(from, to) |
236 | "btrfs subvolume snapshot #{from} #{to}" |
237 | end |
238 | |
239 | def main_db_script |
240 | create_snapshot_cmd local_main_db, "#{@tmp_path}/postgres" |
241 | end |
242 | |
243 | def tickets_db_script |
244 | create_snapshot_cmd local_tickets_db, "#{@tmp_path}/tickets-db" |
245 | end |
246 | |
247 | def tickets_files_script |
248 | create_snapshot_cmd local_tickets_files, "#{@tmp_path}/tickets-files" |
249 | end |
250 | |
251 | LAST_DIR_PER_TYPE = { |
252 | 'hourly' => 23, 'daily' => 6, 'weekly' => 3 |
253 | }.freeze |
254 | def rotate |
255 | last = LAST_DIR_PER_TYPE[@action] |
256 | path = ->(n, action = @action){ "#{@backup_root_path}/#{action}.#{n}" } |
257 | delete_subvolume_if_exists path[last], delete_children: true |
258 | n = last |
259 | while (n -= 1) >= 0 |
260 | run_script "mv #{path[n]} #{path[n+1]}" if File.exist?(path[n]) |
261 | end |
262 | dest = path[0] |
263 | case @action |
264 | when 'hourly' |
265 | run_script "mv #{@tmp_path} #{dest}" |
266 | when 'daily', 'weekly' |
267 | die 'last hourly back-up does not exist' unless File.exist?(hourly0 = path[0, 'hourly']) |
268 | create_tmp_path |
269 | Dir["#{hourly0}/*"].each do |subvolume| |
270 | run_script create_snapshot_cmd subvolume, "#{@tmp_path}/#{File.basename subvolume}" |
271 | end |
272 | run_script "mv #{@tmp_path} #{dest}" |
273 | end |
274 | end |
275 | |
276 | def report_completed |
277 | log "Backup finished in #{human_duration Time.now - @start_time}" |
278 | end |
279 | |
280 | def human_duration(total_time_sec) |
281 | n = total_time_sec.round |
282 | parts = [] |
283 | [60, 60, 24].each{|d| n, r = n.divmod d; parts << r; break if n.zero?} |
284 | parts << n unless n.zero? |
285 | pairs = parts.reverse.zip(%w(d h m s)[-parts.size..-1]) |
286 | pairs.pop if pairs.size > 2 # do not report seconds when irrelevant |
287 | pairs.flatten.join |
288 | end |
289 | end |
290 | |
291 | Backup.new.run(ARGV) if File.expand_path($PROGRAM_NAME) == File.expand_path(__FILE__) |
So, this is what I get running the test mode:
1 | $ ruby backup.rb test hourly |
2 | btrfs subvolume create "/home/rodrigo/backups/tmp" |
3 | mkdir -p /home/rodrigo/backups/latest |
4 | ssh backup@backup-server.com "pg_dump -Fc -f /tmp/main_db.dump main_db_production" |
5 | ssh backup@backup-server.com "pg_dump -Fc -f /tmp/tickets.dump redmine_production" |
6 | btrfs subvolume create "/home/rodrigo/backups/latest/docmanager" |
7 | btrfs subvolume create "/home/rodrigo/backups/latest/tickets-files" |
8 | btrfs subvolume create "/home/rodrigo/backups/latest/postgres" |
9 | btrfs subvolume create "/home/rodrigo/backups/latest/tickets-db" |
10 | rsync -azHq --delete-excluded --delete --exclude doc --inplace backup@backup-server.com:/var/main-documents/production/docmanager/ /home/rodrigo/backups/latest/docmanager/ |
11 | rsync -azq --delete --inplace backup@backup-server.com:/var/redmine/files/ /home/rodrigo/backups/latest/tickets-files/ |
12 | rsync -azq --inplace backup@backup-server.com:/tmp/main_db.dump /home/rodrigo/backups/latest/postgres/main_db.dump |
13 | rsync -azq --inplace backup@backup-server.com:/tmp/tickets.dump /home/rodrigo/backups/latest/tickets-db/tickets.dump |
14 | btrfs subvolume snapshot /home/rodrigo/backups/latest/tickets-db /home/rodrigo/backups/tmp/tickets-db |
15 | btrfs subvolume snapshot /home/rodrigo/backups/latest/tickets-files /home/rodrigo/backups/tmp/tickets-files |
16 | btrfs subvolume snapshot /home/rodrigo/backups/latest/postgres /home/rodrigo/backups/tmp/postgres |
17 | btrfs subvolume snapshot /home/rodrigo/backups/latest/docmanager /home/rodrigo/backups/tmp/docmanager |
18 | mv /home/rodrigo/backups/tmp /home/rodrigo/backups/hourly.0 |
The “all_scripts” method is the one you should adapt for your needs.
I hope that script will help you serving as a base for your own back-up script in Ruby in case I was able to convince you to give this strategy a try. Unless you are already using some robust back-up solution such as Bacula or other advanced systems, this strategy is very simple to implement, takes little space and allows for fast incremental backups and might interest you.
Please let me know if you have any questions in the comments section or if you’d suggest any improvements over it. Or if you think you’ve found a bug I’d love to hear about it.
Good luck dealing with your back-ups. :)