Adding an X.509 certificate to an API endpoint on a managed service fabric cluster

Published Feb 2, 2022

I have a client application that uses an API endpoint that fronts a managed service fabric cluster. I want to expose that endpoint but I want to secure it with a signed X.509 certificate so there's an appropriate chain of trust.

Previously, I had done this using a certificate thumbprint. As happens, the cert expired which meant I needed to deploy another cert. And since that new cert had a different thumprint (because the thumbprint is merely the hash of the contents of the cert), my app broke.

The API is a stateless service that is hosted on every node of the cluster. It sits behind the load balancer and exposes a port to which my client app can connect. This port is secured with the following code snippets.

This function creates the service instance listener. Note the listenOptions.UseHttps calls GetCertificateFromStore.

This function is how the cert is actually returned:

I have created a dev cert that developers can use when working locally. Otherwise, the function will return that statically defined subjectName variable - which corresponds to the common name of the cert I want to use.

So why common name?

When I have to update my cert (and I will no matter what because they all expire at some point), I can KEEP the common name the same. I can even drop the cert into the key vault ahead of time and SF will automatically select whatever cert (with the same common name) has the furthest expiration date. Therefore, in both cases (DEBUG or otherwise), I can get my cert by common name and not worry about changing thumbprints or waiting until expiration of a cert before adding an updated version to the key vault.

If I were only running locally, this is almost all I need to do to secure my endpoint. The problem is when Kestrel tries to use that cert, it's going to do so running as a specific user - usually NETWORK SERVICE. For a user to employ a cert, it must have access TO that cert.

I'm assuming at this point that the X.509 cert has been imported into the Certificates-Local Computer/Personal store - this is the same thing as LocalMachine/My. This is important because you'll need to set the permission on the cert and this is one of the stores that let's you actually do that. It might be the only one... but I'm not going to prove it. ;)

Right-click on the cert you're planning to use to secure your HTTPS endpoint.

Then add the NETWORK SERVICE account to the Group or User names list and assign Read access.

At this point, you have all you need to run locally.

RUNNING in the cloud? Different story - read on.

In a regular service fabric (SF) cluster, you could replicate this permissions dance for each node in you cluster and be OK. But it's not ideal - and it's downright impossible to do in managed service fabric where you can't readily get to the individual machines in the scale set.

In this case, you need to allow Service Fabric to set that "Read" access on your cert.

In the PackageRoot/ServiceManifest.xml file, you should have something like this:

For running locally, the item in red is optional. For running in the cloud, this is what links this service to some important information in the ApplicationManifest.
So let's look.

The CertificateRef parameter from the ServiceManifest file is going to link up with the SFApp/ApplicationPackageRoot/ApplicationManifest.xml file. This mapping is telling the cluster that the ApiCert:
1. Is a certificate reference - ResourceType="Certificate"
2. Should be found by subject name - X509FindType="FindBySubjectName"
3. Is bound to a security access policy that links a principal user to the cert - AccountType="NetworkService".
4. Is going to use the cert subject name supplied elsewhere in the file by the placeholder [CertToUse]
The CertToUse parameter is also supplied by the ApplicationManifest towards the top of the file in the <Parameters> section like so:

<Parameter Name="CertToUse" DefaultValue="*.XXXXXXXXXX.com" />
The main reason for setting that up as a parameter is because I can override the default value here in the ApplicationParameters folder under, for example, Local.1Node.xml.

Even though I've specifically set the NETWORK_SERVICE read permissions on the local development cert, I can still use this pattern. In fact, I have to since the ApplicationManifest outright declares that I'll be binding a cert to a principal and, ultimately, a microservice's endpoint. Unless I override this setting in my Local.1Node.xml file, it'll try and bind the production cert I don't have on my development machine. (I'm only using self-signed certs for dev).

So, that Local.1Node.xml file has this line:

<Parameter Name="CertToUse" Value="dev.XXXXXXXXXXXXXXXXXXX" />

It seems silly that I have to specify to Kestrel in the source code which cert I want to use AND in the application manifest. And it is... if I only want to run locally and can set the read permission of the cert in question for the account under which the application will run (NETWORK SERVICE). Otherwise, I have to have them both.

But we're not done... since we're running in the cloud, we have to get the certificate we want installed on to the cluster.

This is NOT the same thing as a management certificate used with SFX. It could be - but it's ill-advised. We want a dedicated, fully-signed X.509 certificate backing our API endpoint. At this point, we've established the manifest files and edited the source code to request and access the right cert.

Now - we need to actually make sure the cert gets out to the cluster. For this, we'll use the Azure Key Vault.

Get a certificate. You can procure one from any certificate authority. Microsoft actually brokers them for their App Services but I'm not sure how that all works other than direct integration with the App Services. For this project, I just uploaded one we already bought from GoDaddy.
Ownership of the domain for the signed certificate is mission critical. Unless you somehow got a cert for ...eastus.cloudapp.azure.com, you've probably created a CNAME alias. If you haven't, let's do that now in GoDaddy. Other DNS managers are probably similar.
Here, I'm creating the CNAME record that is going to create the alias for my domain. Dev.MYDOMAIN.com is going to point to my cluster at eastus.cloudapp.azure.com.
The final thing I need to do is map the load balancer in the cluster so HTTPS traffic (default port 443) can find my exposed endpoint in the cluster.
Now, when my client app wants to hit my SF API endpoint, it reaches out to my fully qualified domain through a DNS lookup. The cert that secures that domain is mapped to that same domain XXXXXXXXXX.com. The cert has the appropriate runtime permissions and bindings to the 8986 endpoint because of the GetCertficateFromStore function as well as the callouts in the manifest files as shown in steps 1 and 2. The load balancer transfers that generic HTTPS traffic (mapped by the DNS CNAME from dev.XXXXXXXXX.com to xxxxxxxxxxx.eastus.cloudapp.azure.com) from port 443 to port 8986. And finally, we can make the required API function call.

Easy.

Eye roll gif - Lol Gifs

Comments