Backup and restore of a Managed Service Fabric Cluster

Published Feb 8, 2022

A key tenant of microservices is the concept that we store data as close to the microservice as possible. Some people espouse a database-per-service architecture, but I would argue that's really not in the spirit of microservices.

Service Fabric (SF) comes to the rescue by providing stateful storage which is a nebulous world where data exists both in memory and on disk wherever the service replicas live.

The "plus" side about a database-per-service concept is you can use conventional tried-and-true database tooling to get at the raw information, run backups, do reports, queries, etc...

This is not a thing with SF. You've got any number of reliable collections under control of some stateful service instance and ONLY that service can access the information.

That's scary enough. Theoretically, there IS a tool out there called the Service Fabric Backup Explorer (Preview). But as you can see, it's in "preview" and has been for four years. It has aged like a fine wine doesn't.

So if we can't poke at our own data with some sort of SSMS-esque tool, at least we can ensure the data can be backed up and readily restored - provably.

Backup

Enable backups on the cluster

You're going to need a storage account for the backups...

In this example, I'm using plain old Azure blob storage, but - allegedly - you can use Managed Identity Azure Blob Store (whatever that is), and File Share.

You have to create a backup policy on the cluster

Like anything else in Azure, there is probably a way to Powershell the hell out of all this, and at some point, when scale becomes a consideration, I'll look into that. For the time being, I'm just going to use Service Fabric Explorer (SFX) to create my policy.

You can create many different policies and spread them out according to need, but for my purposes, I really only need one backup policy that keeps my cluster reasonably safe from corruption. To do that, I'm going to ask for EVERYTHING to be backed up on a routine basis.

Name is self-explanatory. Call it what you like
AutoRestoreOnDataLoss is also self-explanatory. However, I would suggest caution. If the cluster detects data loss (however it does that), it would initiate a restore process that may be very helpful. OR, if that data requires some level of coherence with other data or external services outside the cluster, it could be a total cluster...phuck.
MaxIncrementalBackups sets the number of backups you have between full backups. So, if you set it to 5, you can expect your backup sets to look like this where F is full and I is incremental or partial: "F I I I I I F I I I I I F..."
ScheduleKind sets how the backups should proceed. If you ask for "time based", you can set a daily or weekly schedule along with a list of run times. I USED "frequency based" whereby I simply specify how often I want the backup to run irrespective of time of day.
Interval (for FrequencyBased) is a string in ISO8601 duration format. Basically, you specify a periodicity which can span years to seconds. In my case, I want a new backup (Full or incremental as the sequence goes) every 10 minutes so I use PT10M. If I wanted it to go every 3 days and 10 minutes, I would've put P3DT10M. But that would be weird for a backup...
StorageKind is Azure Blob Store. I set that up in the previous step.
FriendlyName is whatever I want to call that blob store.
ConnectionString is the connection string of the blob container:
ContainerName is "backups". Just made it.
RetentionPolicy is optional. I chose to use it.
MinNumberOfBackups is the minimum number of backups I can keep without any repercussions. It works hand-in-hand with...
RetentionDuration is how long I can keep any given backup.

Let's say I have 20 as the minimum number of backups and a retention duration of 5 days. SF will keep as many backups as are made for as long as 5 days...however if the number of backups (because of a long periodicity between them - say 36 hours?) doesn't exceed 20, then all of them are kept regardless of the retention duration. In other words, until I have more than 20 backup files to worry about, it will not age them out.

Restore

Now that we have SF automatically creating periodic backups. How do we restore them?

In my simple(ish) case, I'm only dealing with single partitions across my various services. I haven't found a way to restore the entire application at once from the SFX, but there's probably a way to do this in Powershell. However, for this example, I'm going to show how to restore a specific service to a previous state.

Go to the SFX and navigate to service you want to restore.

Click on the partition, then "Backups" and scroll down. You may not see any backups at first because they're framed by a temporal constraint. Adjust the start and end date/times as necessary to expose a set of backups you can use.
Here, I'm opting to restore a full backup, but you can see there are also incrementals. Obviously, incremental restores are faster. Clicking on that full restore, I get this:

Note the "Restore" button circled in Red. That initiates the process.
When the process kicks off, it'll bring the entire service down one replica at a time. During that time, I expect the entire service to go offline rather than try some sort of rolling update, although I could be wrong and I haven't really tested this. Just monitor the service for "Active" after it goes into quorum loss and that's the signal to know the backup has been restored.

One thing to point out (and it's probably obvious) is that if you change the schema of a reliable collection, there's a high probability the system cannot seamlessly restore a backup. In other words, if you've declared a object with three fields and are saving thousands of them in a reliable collection and making backups, AND you change the nature of that object to five fields and push an update, SF will probably wipe out your collection. Furthermore, if you try to restore a backup to that service, it will fail because the signature of the reliable collection objects have changed.

So, backup and restore is really just for that. Backup. Restore. Bad things happened so fix it. It does not handle data migrations.

If anybody has some spare time, getting the service fabric backup explorer up to date and functional would be a useful diversion. How hard could it be?

Hercules - Vaux le Vicomte | France in Photos

Backup

Enable backups on the cluster

You're going to need a storage account for the backups...

You have to create a backup policy on the cluster

Restore

Comments