Skip to main content

Autonomous services

To go fast and not break things, we need resilient services. We need services that continue to function when related upstream and downstream services cannot. We want teams to be confident that an honest human error on their part will not incapacitate the system as a whole. At the same time, we need to keep the overall complexity of the system in check. Simple is always better. We want to compose our systems from simple and repeatable building blocks that we snap together. We want the system to naturally evolve by simplfy adding and removing uncoupled services.

Each autonomous service will fall under one of the three higher-order autonomous service patterns:

  1. Backend for Frontend (BFF)
  2. External Service Gateway (ESG)
  3. Control Service

We can create highly complex systems composed of many of these easily understandable services, without making the system itself overly complex.

Developers can easily work on one autonomous service without needing to pull in others. THey can easily reason about the internals of the service they are changing.

At the heart of each autonomous subsystem is an event hub. The event hub eliminates the mayhem that would ensue if we were to create patchwork of topics and queues to connect the various services. Instead, the event hub mediates between upstream producers and downstream consumers. The general flow of events through the event hub proceeds as follows:

  1. Upstream services publish domain events to the hub through a bus
  2. The bus routes events to one or more channels, such as a stream
  3. Downstream services consume events from the hub through a specific channel

Dividing a system into autonomous subsystems

The goal of software architecture is to define boundaries that enable the components of the system to change independently.

Systems need to be divided into manageable set of high-level subsytems that each has a single reason to change. These subsystems will constitute the major bounded contexts of the system. We will apply the Single-Responsibility Principle (SRP) along different dimensions to help us arrive at boundaries that enable change. This allows us to facilitate organizational scale with separate groups managing the individual subsystems.

The subsystems have to be autonomous, and will give autonomous organizations the confidence to continuously innovate within their subsystems. This is achieved by creating bulkheads between the systems (see External Gateway Service).

Logical places to look for for architectural boundaries:

  1. By actor: A logical place to start carving up a system into subsystems is along the external boundaries with the external actors. These are users and the external systems that directly interact with the system. Following the SRP, each subsystem might be responsible to one and only one actor.
    • In the example of food delivery, we may have a separate subsystem for each category of the user: Customer, Driver, and Restaurant. We may also want a subsystem for each category of the external system, such as relaying orders to the restaurant's ordering systems, processing payments, and pushing notifications to customers.
    • The grouping for enterprise systems may be more complex and we will need to look for a good way to organize the actors into cohesive groups and these groups may align with the business units.
  2. By business unit: Another good place to look for architectural boundaries is between business units, and an org char can provide useful insights. If an org structure of a company may be unstable, we may have to look deeper into the work the business units actually perform.
  3. By business capability: Ultimately, we want to draw our architectural boundaries around the actual business capabilities that the company provides. Each autonomous subsytem should encapsulate a single business capability or at most a set of highly cohesive capabilities.
  4. By data life cycle: Over the course of the life of a piece of data, the actors that use an dinteract with the data will change and so will their requirements. Bringing the data life cycle into the equation will help uncover some overlooked subsystems. We will usually find these subsystems near the beginning and the end of the data life cycle.

Creating subsystem bulkheads

To allow for the fortification of all architectural boundaries in a system so that autonomous teams can forge ahead with experiements, confident in the knowledge that they are containing the blast radius when teams make mistakes is our aim. At the subsystem level, we are essentially enabling autonomous organizations to manage their autonomous subsystems independently. To create this fortification of our autonomous subsystem, we do that with bulkheads:

  1. Separate cloud accounts: This prevents overloading cloud accounts with too many unrelated workloads, which puts all workloads at risk. At a bare minimum, development and production environments must be in separate accounts. But we can do better by having separate accounts, per subsystem, per enivornment. Obviously this should be catered appropriately based on the size of the team, and scaled as the company scales. Cloud account separate can help with controling tech debt that naturally accumulates as the number of resources wihtin an account grows, improve security posture by limiting the attack surface of each account, less competition for limited resources, cost allocation is simple and error resistant, and observability and goverance are more accurate and informative because monitoring tools tag all metrics by accounts.
  2. External domain events: Within a subsystem, its services will communicate via internal domain events. Accross subsystem boundaries, we need more regulated team communication and coordination to facilitate changes to these contracts. Communication increases lead time, which is the opposite of what we want. We want to limit the impact that this has on internal lead time, so we are free to innovate within our autonomous susystems. We essentially want to hide the internal information and not air our dirty laundry in public. Instead, we will perform all inter-subsystem communication via external domain events. These external events will have much stronger contracts with stronger backward compatibility requirements. We will intend for these contracts to change slowly to help create a bulkhead betweeb subsystems. Domain-Drive Design reers to this technique as context mapping, such as when we use domain aggregates with the same terms in multiple bounded contexts, but with different meanings.

See the External Service Gateway (ESG) pattern in the upcoming sections. Each subsystem will treat related subsystems as external systems. We will bridge the internal event hubs of related subsystems to create an event-frist topology. Each subsystem will define an egress gateway that defines what events it is willing to share and hides everything else. Subsystems will define ingress gateways that act as an anti-corruption layer to consume upstream external domain events and transform them into its internal formats.

Autonomous service patterns

THere are three high-level autonomous service patterns that ll our services will fall under. At the boundaries of our autonomous subsystems are the Backend For Frontend (BFF) and External Service Gateway (ESG) patterns. Between the boundary patterns lies the Control service pattern. Each of these patterns is responsible to a different kind of actor, and hence support different types of changes.

Example reference:

                Restaurant Subsystem
/\
Payment Processor <-> Customer <-> Order Subsystems
\/
Delivery Subsystem

The Customer subsystem of our example Food Delivery system might have the following actors:

  • The Customer wil be the user of this susbsystem and the dominant actor
  • The Restaurant Subsystem will publish external domain events regarding the restaurants and their menues
  • A Payment Processor must authorize the customer's payment method.
  • The subsystem will exchange OrderPlaced and OrderReceived external domain events with the Order Subsystem
  • The Delivery Subsystem will publish external domain events regarding the status of the order.

Backend For Frontend (BFF)

This pattern works at the boundary of the system to support end users. Each BFF service supports a specific frontend micro-app, which supports a specific actor.

In the Customer subsytem of the Food Delivery System example, we might have BFFs to browse restaurants and view their menues, signup and maintain account preferences, place orders, view delivery status, and view order history. These BFFs typically account for about 40% of the services in the subsystem.

A listener function consumes domain events from the event hub and caches entities in materialized views that support queries. The synchronous API provides command and query operations that support the specific user interface. A trigger function reacts to the mutations caused by commands and produces domain events to the event hub.

External Services Gateways

The External Service Gateway (ESG) pattern works at the boundary of the system to provide an anti-corruption layer that encapsulates the details of interacting with other systems, such as third-party, legacyc, and sister subsystems. They act as a bridge to exchange events between the systems.

In the Customer subsystem of our example Food Delivery System, we might have ESGs to receive menus from the Restaurant subsystem, forward orders to the Orders subsystem, and receive the delivery status from the Delivery subsystem. The Order subsystem would have ESGs that integrate with the various order systems used by restaurants. The Delivery subsystem would have an ESG to integrate with a push notifications provider. These ESGs typically account for upwards of 50% or more of the services in a subsystem.

An egress function consumes internal events from the event hub and then transforms and forwards the events out to the other system. An ingress function reacts to external events in another system and then transforms and forwards those events to the event hub.

Control

Controle service help minimize coupling between services by mediating the collaboration between boundary services. These services are completely asynchronous. They consume events, perform logic, and produce new events to record the results and trigger downstream processing.

We use these services to perform complex event processing and implement business processes. They leverage the systemwide event sourcing pattern. In the Delivery subsystem of our food delivery example, we might have a control service that implements a state machine to orchestrate the delivery process under many different circumstances. Control services typically account for about 10% of the services in a subsystem.

A listener function consumes lower-order events from the event hub and correlates and collates them in a micro events store. A trigger function applies rules to the correlated events and publishes higher-order events back to the event hub.

Events

Event bus

The event bus is the entry point into the event hub. It provides a level of indirection so that we do not couple producers to specific consumer channels. Producers only need to know about the bus. We will add rules to route events to specific consumer channels without impacting the producers.

The event bus also acts as the outbound bulkead that protects upstream services from downstream failures and outages. An upstream service can publish an event to the bus and forget about it. It can trust that the bus will eventually deliver the event to all interested consumers, through the various channels (that is, streams).

Domain Events

We send the domain event from the producer service to all the consumer services via the event hub. To support this flow of information, we need a standard event envelope, we need to include the domain model state in each event, and we need to consider the substitution of producers and consumers, both internally and externally.

Event envelope

The event envelop defines a standard set of fields that all events must contain. This allows the bus, any channel, and all consumers to handle any event. The bus relies on these fields to perform content-based routing to specific channels.

Consumers utilize these fields to filter and dispatch events within their stream processing pipelines. The event lake uses these fields to organize events for storage and retrieval.

interface Event<T> {
id: string
type: string
timestamp: number
partitionKey?: string
tags: { [key:string]: string | number}
[<T>]: any //<entity>
raw?:any
eem?:any
}

To implement systemwide event sourcing in an upstream service, we can execute either of the following two variations on the pattern: stream-first or database-first.

Chaine's variation of this Event is to call it an Envelope, with the following structure:

interface Envelope<T> {
/**
* @id unique identifier for event
*/
id: string
/**
* @type the type of the event and the expected payload, such as thing-submitted and a namespace prefix for external events
*/
type: string
/**
* @timestamp the time event was stored in event store in epoch time. Epoch time (milliseconds from Jan1, 1970) are suited well for time series DBs where indexing is based on time
*/
timestamp: number
/**
* @partitionKey generally the entity id or correlation id to ensure related events can be processed together
*/
partitionKey?: string
/**
* @tags contain general information such as account, region, source, function name and pipeline
*/
tags: {[key: string]: string | number}
/**
* @event the actual event contract
*/
event: EventImplementation<T>
/**
* @raw contains the raw data produced by the source of the event, in its native format, to ensure no information is lost
*/
raw?: string
/**
* @param metadata the data associated with a particular event
* @see {@link IMetadata}
*/
metadata?: Metadata
}

Event carrier state transfer

Our domain events are more than just notifications. They represent the fact that something happened and they contain the data that provides the context of the event. Downstream services use this data to make decisions and to create materialized views. This is referred to as event carrier state transfer.

The **<entity>** or event field in the event envelope represents the payload of a specific event type. It contains a snapshot of all the relevant data that is available in the service when the actor performs the action that produces the events.

For example, two BFF services may provide different actors with the abilility to work on different aspects of the same domain entity. Both services have a lean, read-only copy of the domain entity's summary information, but each is responsible for maintaining different subsections of the domain model.

When the BFFs publish events, they include the summary of information for context, along with the detailed information they own. However, each does not need to maintain copies of each other's subdomain model and include it in the events. Instead, downstream services that need to will merge with the subdomains in their own materialized views to suit their needs.

Substitution

This is when multiple sources, or producers, producing the same domain event. To support substitution, the <**entity**> field of a specific event type should conform to a standard canonical format for the specific domain entity. This format defines the contract between producers and consumers.

Internal versus external

Internal domain events define the contracts within a subsystem, and external domain events define the contracts across subsystems.

We give strong backward compatibility guarantees to downstream subsystems for the external domain events. This allows teams to experiement and change the definition of the internal domain events more freely.

Routing and channel topology

Routing is the purpose of the event hub and helps decouple producers from consumers. Producers send events to the bus and it is responsible for routing them to the various consumer channels.

Many consumers may share a single channel, but a consumer generally subscribes to a single channel. The hub typlically owns shared channels and the associated routing rules, whereas a service will own a dedicated channel and add its own routing rules to the bus.

A subsystem may have many channels. There are a variety of things to consider when defining a topology of channels, such as isolation, message size, prioirty, volume, and the number of consumers.

Event Sourcing

The Event Sourcing pattern provides the foundation for eventual consistency and the subsequent flexibility that allows us to create reactice and evolutionary systems. It proposes turning events into facts, instead of jsut ephemeral (not lasting) messages. We accomplish this by retaining all events and storing them in perpetuity. The events serve as an audit trail of historical activity so that we do not lose information.

Event streams

An event stream is a messaging channel that acts as a temporal event store for downstream consumers. We leverage streams for all inter-service communication channels. They act as the input queue for the various stages in our eventually consistent system.

Autonomous services listen for events on a stream so that they can react in near real time. From their perspective, a channel or stream is the source of events. In other words, streams provide temporal event sourcing to downstream services.

Stream-first event sourcing

Distributed transactions