Designing a multi-tenant SaaS

Designing a multi-tenant SaaS

This article is based on my talk Designing a multi-tenant SaaS given at Cloud Nord on October 12, 2023.

Kestra is a highly scalable data scheduling and orchestration platform that creates, executes, schedules and monitors millions of complex pipelines. For an introduction to Kestra, you can read my article on the subject.

One of the recent Kestra evolutions I was responsible for was multitenancy support, this article will tell you about the design that went into adding this functionality.

Multitenancy and its different models

Multitenancy can be defined as a software architecture principle that enables software to serve multiple client organizations (tenants) from a single installation.

Multitenancy simulates several logical instances within a single physical instance. The aim is to control hosting and operating costs.

There are several multi-tenant models.

One instance per tenant

  • One application instance is started for each tenant, multitenancy is managed outside the application.
  • Pros: simplicity.
  • Cons: operating cost, you have to start one application per tenant.

One database per tenant

  • A database is started for each tenant, the only logic to implement is database selection.
  • Pros: simplicity, implementation cost.
  • Cons: operation cost, you have to start one database per tenant.

One schema per tenant

  • A database schema is created for each tenant, the only logic to implement is schema selection.
  • Pros: simplicity, implementation cost, operation cost.
  • Cons: requires a base offering schemas, the single base is the SPOF.

Tenant within tables/messages

  • A tenantId field is added to each row of each table./li>
  • Pros: flexibility, cost of operation.
  • Cons: complexity, cost of implementation.

Kestra’s architecture

First of all, let’s explain Kestra’s architecture.

Kestra is separated into several components that communicate with each other by asynchronous messages via a queue. Flow metadatas are stored in a repository.

The various components are:

  • The Executor: contains the orchestration logic.
  • The Scheduler: handles the various trigger events of a flow.
  • The Worker: executes the tasks of a flow.
  • The Indexer: optional component that indexes the queue into the database.
  • The Webserver: serves Kestra’s GUI and API.

The Worker is the only component that access to external systems needed to execute a flow (remote database, web service, cloud service, …) as well as Kestra’s internal data storage.

Kestra offers two deployment modes: standalone with all components in a single process, or microservice with one component per process.

Kestra offers two runners:

  • The JDBC runner: Queue and Repository are implemented via a database (H2, PostgreSQL, MySQL).
  • The Kafkarunner: Kafka is used as Queue and Elasticsearch as Repository. This runner is only available in the enterprise edition.

Multitenancy at Kestra

The SaaS project

Kestra is working on a version available in a SaaS, Kestra Cloud.

To date, the constraints of the SaaS are as follows:

  • A “big” high dispo cluster with one Kafka runner per cloud provider / region.
  • Each tenant has its own resources (namespace, flow, execution).
  • Data isolation.
  • A user is global to all tenants.

The notion of namespace

A flow is in a namespace; namespaces are hierarchical, like a filesystem directory.

Namespaces enable specific configuration (task, secret, …) as well as the definition of role-based access in the enterprise version.

One of the questions we asked ourselves was whether the notion of namespace could be of use to us in implementing Kestra multitenancy.

The evaluated models

Three different multitenancy models were evaluated.

  1. Tenant per namespace: each namespace is a tenant. This solution is close to the One schema per tenant solution, but using namespace, which is a property specific to Kestra.
  2. Tenant per base namespace: each base namespace is a tenant. A tenant can have several namespaces, the children of the base namespace. This is a variation on the previous model.
  3. Tenant via a tenantId property.

As one of Kestra’s runners uses Kafka and Elasticsearch, which do not support the notion of schema, only a declination of the Tenant model within tables/messages was possible. The three solutions therefore propose adding the tenant to a new property or using an existing property (namespace) to limit the changes required.

The choice

Tenant via a new tenantId property.

Using namespace would have been convenient, as we already had namespace-based flow isolation as well as role-based access management. But it would have greatly reduced the functionality of a user of our Cloud, as a namespace or basic namespace could not have been used by different users. So the only model that met all our needs without limiting future functionalities was the addition of a new tenantId property.

Implementation

  1. We add a property tenantId to all model objects.
  2. We filter on the tenantId column in every BDD request.
  3. We resolve the tenantId in the API layer.

Sounds simple, doesn’t it? 🤷

A plan always goes without with hiccups!

Adding a property to a large number of classes / filters / queries / … brings a high risk of oversight, and therefore of bugs. And that’s exactly what happened. Despite great care during implementation, there were a few places where the tenant was not passed (mainly due to the use of Lombok builders … I won’t dwell into it). Similarly, some parts of Kestra require knowledge of all data (for example, all flows), so we had to ensure that, for example, when we list flows from the API, the list is filtered by tenant, but not when we list flows from the flow scheduler.

To facilitate the migration of our existing users, we allowed the use of a default tenant, which is the tenant whose identifier is null. This was the cause of many other bugs…

In conclusion, multitenancy is essential when setting up a SaaS, and once you’ve carefully chosen your implementation model, you can expect a long, laborious and bug-prone implementation. To mitigate the risk of bugs, we chose to merge the PR on the multi-tenant at the start of the release cycle, which enabled us to test it for a month on our own test environments before delivering it to our users, thus uncovering many of the bugs that had been introduced. I strongly recommend that you plan a substantial testing period, as we did.

The final word: implementing a multi-tenant architecture isn’t easy, idealy, it needs to be implemented as early as possible in a code base.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.