We Built Our Own Authorization Service

Sage

•

November 2, 2021

Setting the scene

Sage is on a mission to provide an integrated operating system for senior care, and ultimately provide the best quality of life for older adults that modern technology can offer. Diving into the nuances of the domain, along with how we arrived at our current solution is another blog series. Suffice it to say, the domain is complex but with lots of room for immediate improvement. These improvements each provide their own set of challenges that the Sage platform must overcome to be effective as it scales.

The complexity and hierarchical nature of the commercial senior care facilities we’re deploying in pose one of our biggest challenges. Many times these facilities are part of a larger senior care network with organizational hierarchy that extends above single-facility management. It takes many different people doing different jobs to keep operations like these humming. For another layer of complexity, the nature of these businesses requires the production and management of highly-sensitive personally-identifiable information (PII) and protected health information (PHI) pertaining to the residents and staff of these facilities. At a minimum, it is our responsibility to ensure that only senior living staff that require access to this sensitive information ever get it, and that we restrict access to only the information that is relevant to performing their job function.

In order to address this problem, we require a system for authorization that effectively manages this access. Below I’ll discuss more about our requirements for this authorization system, why we decided to build our own, and how we did it.

Prior art and why we built our own thing

There are a few solutions out in the wild that do roughly what we want, but none that are a perfect match. We found that the products that offer the simplest solutions aren’t expressive enough in their representation of permissions to accurately represent our authorization model. Some don’t offer all of the guarantees that we need.

There are several robust offerings as well, but the major issue with them is that they’re complex pieces of software that require specific knowledge to wield effectively. They can also be tricky to deploy and manage. Understanding these tools well enough to use them, implement our model, and manage the infrastructure constituted a large risk for where we are as a company.

A final reason, and this shouldn’t be undersold, is that by relying on a 3rd-party provider for our permissions management we’d effectively be signing up to have them be the “database” for all of our access controls. This fact should be really scary when considering how this software is going to evolve. As we build out more features and need to perform increasingly complex operations for our users, we need to keep our permissions persistence in sync with everything else in the platform. This becomes much harder when relying on a 3rd-party that you can’t control to host the API and data. For auxiliary or specialized functionality (i.e. Twilio sending SMS messages via a proprietary network) this can make sense, but it was difficult to justify for such a critical piece of our infrastructure.

Because we’re small and in a hurry, and our time-boxed search didn’t turn up any solutions that we were thrilled with, we decided to explore what it would look like to implement our own thing. It ended up taking us two weeks to build an MVP that we have complete control over. With minor improvements, it’s still humming along four months later with live customers on our platform.

A quick aside on authorization vs. authentication

The rest of this post discusses the decisions we made around the design of our authorization system, but it would be helpful to quickly describe what I mean by authorization. First, some definitions:

Authentication (auth-n): who are you? Authentication is a means for identity verification, and is akin to checking a fingerprint or photo ID.
Authorization (auth-z): are you allowed to be here? Authorization is a means for determining if an authenticated user is allowed to perform the action that they are asking to perform on the resource(s) they want to perform the action on.

Authentication is a separate hard problem from authorization, but it has been solved many times by very talented groups of people. At Sage, we have no interest in building and managing our own means for authentication, and wouldn’t do it as well as others have anyway. We decided to leverage Auth0 as our identity provider and are very happy with how simple it is to integrate with on the identity management side.

Sage’s APIs consume JWT bearer tokens in the Authorization header in each request to verify the identity of the caller. We can use the verified identity included in the bearer token to then check the authorization of the caller to perform the requested operation. Adopting bearer tokens and ensuring that their use is ubiquitous has made reasoning about our security (at least on the authentication side) about as simple as it can be. Do you have a valid token? Great! You don’t? Go away.

Some (more) definitions

There are a few terms that are used throughout the rest of this post that warrant some explanation:

Resource — any instance of a “noun” in the Sage platform that may require specific access controls (i.e. facility, staff member, resident, unit, etc). Instances of resources and resource types are not the same. We designed our system to grant permissions based on specific instances of resource types.
Scope — a specific action that an application is allowed to take on behalf of a user (i.e. resident:read-phi). The first part of the scope references the resource type that the scope operates on. The second part of the scope is the specific permission that is granted. In the example, the resource type is resident and the permission is read-phi, granting whoever has the scope permission to read resident PHI.
Actor — the user (human or robot) performing an action. All actors are resources within the Sage platform that are able to act on other resources. Actors will each have associated scopes for a set of resources (or not).

What properties should this system have?

The complexity of senior living organizations along with the strict requirements around the handling of PHI inform the characteristics that the system must have, which are:

We must be able to model many different combinations of permissions because of the size and diversity of the organizations we’re working with.
Each organization is hierarchical, so many times permissions are implied and inherited based on a role (i.e. facility administrators need to be able to perform all of the same job functions as the caregivers that they manage).
These organizations are dynamic so role definitions and assignments need to be flexible and mutable as the organizations evolve.
The authorization system needs to be completely auditable from “the beginning of time”, being able to identify exactly who had access to what information and when.

Beyond the properties for this system that are derived from our problem domain, there are a few more assertions needed to actually arrive at a reasonable solution. They are:

All requests to our API operate on at least one resource, but many times more than one. Because of this, all endpoints are authenticated.
Scopes alone are not enough to define whether or not a given user has permission to perform an action. Scopes in our platform are global concepts that reference resource types, not instances (i.e. resident:read-phi). In this example, if we relied on scopes alone, we'd be allowing all users who had the resident:read-phi scope to read all PHI on all residents. We instead need to ensure that a user is granted scopes only on specific resources. We explored alternative methods of modeling scopes but ultimately ruled them out.
It would be expensive and hard to embed all of a user’s specific scopes in a JWT token that we pass to our API in each request.
The authorization system should be decoupled from our authentication system, and that our authentication system should be used only to provide a deterministic identity for a user that could be used to look up and manage permissions.

Our design

For several reasons, we decided to build a REST service that provides a small API for checking and modifying permissions.

Model and graph

The service we designed is very simple. It’s currently 3500 lines of Java code, including tests. It includes just four core concepts and a graph to group them all together. Three of these concepts you’re already partially familiar with: resource, scope, and actor.

Resources are the things that we’re authorizing access to. Sage’s resource hierarchy looks something like this:

Resources have a unique identifier, and they also have a reference to whatever resource is their parent in the model hierarchy. For example, if I have a resident resource RESIDENT-A and a facility resource FACILITY-A, I could model the resident resource as:

{ "resourceIdentifier": "RESIDENT-A", "parentResourceIdentifier": "FACILITY-A"}

Scopes are the permissions that we’re granting on specific resources and actors are just resources that are attempting to act on other resources.

The fourth core concept in the design is a resource operation. Resource operations are the glue that provides the association of an actor, a set of scopes, and a resource. Resource operations are always associated with exactly one actor and one target resource, and define the set of operations that the actor may perform on the target resource. For a staff member STAFF-MEMBER-A and a resident RESIDENT-A, an example resource operation would be something like:

{ "actorResourceIdentifier": "STAFF-MEMBER-A", "targetResourceIdentifier": "RESIDENT-A", "scopes": [ "resident:read-phi", "resident:write-phi" ]}

In this example, the staff member has scopes resident:read-phi and resident:write-phi, meaning that the staff member is authorized to both read and write to this resident.

The resource graph comes into play when dealing with implied permissions for users, and I’ll use an example to illustrate the point. Let’s say that I have a facility administrator STAFF-MEMBER-B that is responsible for managing all of facility FACILITY-D. We could grant individual permissions on every resource that the administrator should have access to, but how do you keep track of it all, and how do you manage updates to permissions when they move or leave? It gets really messy. Instead, what if we just grant them all of the permissions that they need on the facility resource FACILITY-D? This authorizes them to perform actions allowed by their granted scopes on all resources within the hierarchy under their facility. So given the resident resource RESIDENT-B with parent FACILITY-D:

{ "resourceIdentifier": "RESIDENT-B", // resident id "parentResourceIdentifier": "FACILITY-D" // facility id}

and the resource operation:

{ "actorResourceIdentifier": "STAFF-MEMBER-B", // admin id "targetResourceIdentifier": "FACILITY-D", // facility id "scopes": [ "resident:read-phi", "resident:write-phi" ]}

we could authorize a request to read RESIDENT-B's PHI for STAFF-MEMBER-B by

Looking up the resource operation for RESIDENT-B and checking to see if STAFF-MEMBER-B has the required scope resident:read-phi
When the lookup fails, next look up the resource operation for RESIDENT-B's parent, FACILITY-D, and check to see if STAFF-MEMBER-B has the required scope, which would succeed

This traversal pattern will work for any arbitrary hierarchical model. There are many properties that are attractive about using a graph to model authorization. First, we can model all resources across all of our customers in the same graph. We parent each enterprise resource on a global root resource tying everything together. This has the property of allowing us to manage global administrator access on our stack by simply adding or removing scopes from the resource operation for a user on the root resource.

The graph also allows us to consolidate granted permissions quite nicely, and to reflect permissions based on how users think about the world, rather than having a complicated opaque layer that they can't reason about. "Facility admins have these permissions on the facility" is natural to explain.

Performance

The graph has some attractive properties as far as performance characteristics are concerned. Creating a resource in the hierarchy only requires a single write, as everyone with implied permissions will automatically be authorized. Granting permissions on large swaths of the resource hierarchy can also be achieved with a single write to the correct resource in the graph (i.e. granting scopes on a facility for a facility administrator). The number of reads to identify if a user is authorized to perform an action is only ever maximally the total depth of the graph, and in our case, that depth is five. Typically, the number of reads will be less than the max depth. The most expensive operation we have to contend with is to list or revoke all permissions for a user, which can be done with a single call to our service, but requires reading all records for that user. We can optimize this operation by adding an index to our PostgreSQL table on the author resource identifier. We also expect operations that list or revoke all permissions to be relatively infrequent. So far, we're seeing less than 100ms of latency added to our end-to-end request times on the common read and write paths (check authorization, grant permissions) with the introduction of calls to the authorization API without any stack optimizations such as caching.

Architecture

To implement the service, we used the same tech stack that we're using for our other services: Java, Dropwizard, Hibernate, PostgreSQL, Docker. The service is built to be stateless so that we can have a high-availability (HA) deployment from the start, and also enables horizontal scaling under higher future load without requiring changes to the service design.

The only non-standard decision we made is that we designed the data store to be append-only. All mutations of the resource graph happen as appends to the existing data, with no previous state ever being lost. This design solves a couple of major problems that we were faced with. First, it allows us to audit permissions over time. The graph is mutated but all past state is still present, so we're able to go back to arbitrary points in time and see who had access to what. This design choice also allows us to rewind history if we'd ever need to revert a damaging set of changes that were made to the graph.

The service is deployed in AWS ECS with a replication factor of three (one instance per availability zone), and is exposed to other services through a load balancer. We could easily lift the deployment of this service from ECS and instead deploy it in EKS (or our own Kubernetes cluster) without much rework.

Future follow-ups and potential limitations

We haven’t tapped into optimizations at all for this service yet, and have a pretty high ceiling for things we can do to improve performance. We haven’t invested at all in indexing our tables, or in adding caching in the application or as a service. When warranted, we’ll be able to tap into these methods to get an immediate boost in performance.

We also, based on our back-of-the-envelope calculations, have a pretty significant runway before we start reaching the limitations of PostgreSQL. We expect each facility to generate O(1000) resources and resource operations per month. Given this rate, we should have years of stability before we need to worry about doing anything more complex with our storage infra. If and when we hit limitations of PostgreSQL, there are plenty of steps we can take to move forward. We could always move toward a store like DynamoDB, or something like CockroachDB. We could also pursue a new data layout and shard the tables based on some method of partitioning. We thankfully have a while before we’re going to need to pursue any of these options.

Conclusion

While I’m typically a proponent of leveraging off-the-shelf solutions whenever possible, it actually made sense for us to implement our own authorization system. So far the software has been stable, easy to maintain, and is working well for our use cases. This decision isn’t right for everyone however, and careful consideration should be made when making architectural choices like this one. Hopefully this post was helpful in understanding a set of challenges that we’re facing at Sage, and perhaps it will make it easier for someone else to think through auth architecture in the future.