Delivery guarantees

Mechanism

The Notificaties API standard (and by extension Open Notificaties) operates on a simple yet powerful message delivery mechanism: webhooks.

A webhook, in essence, is nothing more than and HTTP endpoint exposed by a server where HTTP requests/messages can be sent to. Upon receiving such a request, the webhook receiver is responsible for processing the content of this request appropriately.

Webhooks are registered by parties interested in receiving notifications. The webhook registration is recorded and saved in Open Notificaties. Whenever (another) party publishes a notification, it does so by making a HTTP POST call to the Open Notificaties API. Open Notificaties, in turn, checks which parties should receive this notification and forwards the message to the registered webhook.

Flow

After a notification (or cloudevent) is received via the API, all subscriptions that need to receive it are fetched and a ScheduledNotification is created in the database for each subscription. A container/pod running Celery Beat (with the ./bin/celery_beat.sh command) runs a background task that runs every NOTIFICATION_SEC_INTERVAL seconds picks up NOTIFICATION_LIMIT (see Environment configuration reference > Celery) of scheduled notifications and creates tasks that will send the notification to the subscription callback_urls. Successful ScheduledNotifications are removed, failed ones get updated with an execute_after timestamp to be retried (according to exponential backoff) until they succeed or the retry limit is reached.

Failure modes

Even though the mechanism is simple, the underlying infrastructure is not. There is always a chance that a message does not get properly delivered - a problem that all message broker systems have.

The Notificaties API standard defines that recipients of a message/notification have to reply with a HTTP 204 status code to confirm that the message was received. However, to complicate things further, this confirmation response may also be lost. To summarize, the following scenarios are possible:

  • Open Notificaties delivers message and receives confirmation (happy flow)

  • Open Notificaties delivers message but does not receive a confirmation (failure mode)

  • Open Notificaties fails to deliver the message successfully (failure mode)

Now there are essentially two mitigation modes available:

  • at-most-once delivery

  • at-least-once delivery

Delivering a message exactly once is not possible since the underlying infrastructure (“the internet”) may fail for whatever reason.

Mitigations

Open Notificaties operates in “at-least-once” delivery mode. This means that whenever a delivery attempt succeeds, no more delivery attempts are made and whenever no confirmation is received, another delivery attempt will be made.

A failure can be in the form of an HTTP status code that does not indicate success (so anything other than HTTP 200, HTTP 201, HTTP 202, HTTP 204) or network errors such as connection or timeout errors (or anything else going wrong). In practice, Open Notificaties will consider any 2xx response status code as “message is delivered, no further attempts must be made”.

Note

Webhook subscribers must be able to handle multiple deliveries of the same message! If they received a message correctly but failed to reply with a success response, Open Notificaties will deliver the same message again.

Retry mechanism

By default, sending notifications to subscribers has automatic retry behaviour, i.e. if the notification publishing task has failed, it will automatically be rescheduled/tried again until the maximum retry limit has been reached.

Autoretry explanation and configuration

Retry behaviour is implemented using binary exponential backoff with a delay factor, the formula to calculate the time to wait until the next retry is as follows:

\[t = \text{backoffFactor} * \text{baseFactor}^c\]

where t is time in seconds and c is the number of retries that have been performed already.

This behaviour can be configured using setup_configuration and also via the admin interface at Configuratie > Notificatiescomponentconfiguratie:

  • Notification delivery max retries: the maximum number of retries the task queue will do if sending a notification has failed. Default is 7.

  • Notification delivery retry backoff: a boolean or a number. If this option is set to True, autoretries will be delayed following the rules of binary exponential backoff. If this option is set to a number, it is used as a delay factor. Default is 25.

  • Notification delivery retry backoff max: an integer, specifying number of seconds. If Notification delivery retry backoff is enabled, this option will set a maximum delay in seconds between task autoretries. Default is 52000 seconds.

  • Notification delivery base factor: the base factor used for exponential backoff. This can be increased or decreased to spread retries over a longer or shorter time period. Default is 4.

When using the default configuration, the following delay will be added to each retry of a failing request:

#

Delay added

Total elapsed

0

0s

0s

1

25s

25s

2

100s

2m 5s

3

400s

8m 45s

4

1600s

35m 25s

5

6400s

2h 22m 5s

6

25600s

9h 28m 45s

7

52000s

23h 55m 25s

So if the subscribed webhooks is up after 1 min of downtime the default configuration can handle it automatically.

Note

Because scheduled notifications are started in batches every X seconds (based on NOTIFICATION_SEC_INTERVAL, 20s by default) the notifications will not be executed on the exact delay. Scheduled notifications are ordered by their execute_after timestamp and attempt so that new notifications are prioritized. Under high load, dependent on the amount of queued scheduled notifications and NOTIFICATION_LIMIT it is possible that notifications are sent a few minutes later.

Open Notificaties message broker

Under the hood, notifications are distributed by background workers to ensure API endpoint availability.

The results and metadata of the background tasks are stored in Redis, which is an in-memory key-value store. Redis is also used as a message broker.

Task metadata is important for keeping track of automatic delivery retries, so it is recommended to set up Redis as a highly-available and/or persistent storage.