A lot of users ask us if Rockstor can do replication, which is completely understandable. Who doesn’t need disaster recovery? Local redundancy is a must, which is provided by many software raid levels available in Rockstor. But, that only helps with some disaster scenarios caused by broken disk drives and such. What about replicating data from one Rockstor appliance to another, efficiently and automatically on a schedule?Enter Rockstor’s replication framework.
The replication framework is a very useful feature and there are many aspects to it. It would be very confusing to write about all of it in one post, So I’ll focus on the basics in this post and cover topics such as recovering data, implementation details and others in future posts.
So What exactly is it?
Replication has different meaning in different storage products. In Rockstor, replication refers to a fast and efficient periodic backup of Shares to another Rockstor appliance. In tech speak, it’s called Asynchronous Replication.
Say, you have one Rockstor appliance holding your data in multiple Shares. Furthermore, let’s say that one of those Shares is very important and you’d like it to be mirrored on another Rockstor appliance in a different building or a remote branch office. You can schedule a replication task in the web-ui such that your Share is backed up every few minutes or hours or some other interval you choose. The local Share is then mirrored by a remote Share. In case of a local disaster, the data contained(and possibly destroyed) in your local Share can be recovered from it’s mirror copy. An important note, however, is that the remote Share does not contain any possible changes made locally after the last backup.
How does it work?
As many of our users know, the underlying filesystem in Rockstor is BTRFS. One of many super cool features of BTRFS is send/receive. It is the basic foundation on which Rockstor’s replication framework is built on.
Suppose a Share is scheduled to be replicated every hour. A replication task will run every hour as scheduled. The first replication task runs right away and when it does, the first thing it does is create a Snapshot of the share, a point in time read-only copy of the Share. Because of BTRFS’s copy-on-write capabilities, the Snapshot is created instantaneously and occupies no extra space. Then it copies the entire contents of the snapshot to the remote Rockstor appliance where the Share is reconstructed. So the first replication task can take a while, proportional to the amount of data in the Share and subject to constraints like network throughput. But the next replication task that runs an hour later creates a new Snapshot and sends only the changes since the last snapshot taken an hour ago. The receiving side reconstructs the mirror from these changes and subsequent replication tasks follow the same lockstep pattern.
Cool! How do I setup replication?
It all sounds great, but is it easy to schedule replication policy for a Share? Is it easy to recover data when needed? Is the framework resilient to network and other errors? The answer to all these questions is YES! Rockstor’s replication framework addresses all these concerns to provide a simple, reliable and resilient mechanism.
I’ll give a quick setup tutorial here, but for more information read Rockstor documentation.
Step 1: Pair two Rockstor appliances
In order for replication to work, there must be two Rockstor appliances that are aware of each other. Let’s call them Source and Destination. We’ll setup a Share on Source to be replicated on Destination every 10 minutes.
Login to the web-ui of Source and click on the hostname/ip right below the logo in the top left corner. This takes you to the Appliances screen where you add a new appliance by clicking the Add Appliance button.
Once you click submit in the Add Appliance form, the Source connects to Destination using the admin credentials provided and upon success, adds it. Repeat this process on Destination and add Source as an appliance. Replication will fail if the two appliances are not aware of each other.
Step 2: Turn on Replication service
This can be done from System -> Services screen in the web-ui. If the service is not enabled, you’ll be warned while adding replication tasks described in the next step.
Step 3: Create a replication task
Replication tasks are like cronjobs, you can schedule one task for every Share you want replicated. This can be done by clicking on Add Replication Task button from Storage -> Replication -> Send screen. Select the Share you want replicated from the drop down. Target appliance should be the IP of Destination in our case and it should already be selected if there are no other appliances. Don’t change the default replication ports unless you are setting up replication over WAN(I’ll cover this in a different post). For the Target pool name, provide the name of a pool on Destination that you want to house the backup in.
Step 4: Watch the progress
Once the replication task is added, the first transfer from Source to Destination is triggered. This could take a while if the Share is large. In Storage -> Replication -> Send screen, you’ll now see a table with details of the newly added replication task. Click on the link in the Last backup column for detailed trail of replication tasks that were executed so far. In the below example, there are four failed attempts followed by the first successful attempt. The failed attempts occurred because I did not add Source appliance to the Destination appliance. Once I did that, everything started working as expected.
You can see similar information on the Destination appliance as replication tasks execute over time. To see the receiving side of things, go to Storage -> Replication -> Receive screen. Make sure to do this on the Destination appliance and not the Source. You’ll see the Destination side information of the replication task we added earlier. Click on the link in Last Receive column to display the trail of replication tasks seen from the receiving side.
We don’t see any information about first four failed attempts because those attempts couldn’t even be recorded as the Destination appliance was not aware of the sender and rejected it. Once we added the Source appliance, it started playing nice.
There’s more to write about, including setting up replication over WAN, resilient features etc.. I plan to cover them in separate posts soon.