Co-authored-by: Andrew Morgan <andrew@amorgan.xyz>
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
Co-authored-by: Eric Eastwood <erice@element.io>
Regressed in
https://github.com/element-hq/synapse/pull/18900#discussion_r2331554278
(see conversation there for more context)
### How is this a regression?
> To give this an update with more hindsight; this logic *was* redundant
with the early return and it is safe to remove this complexity
✅
>
> It seems like this actually has to do with completed vs incomplete
deferreds...
>
> To explain how things previously worked *without* the early-return
shortcut:
>
> With the normal case of **incomplete awaitable**, we store the
`calling_context` and the `f` function is called and runs until it
yields to the reactor. Because `f` follows the logcontext rules, it sets
the `sentinel` logcontext. Then in `run_in_background(...)`, we restore
the `calling_context`, store the current `ctx` (which is `sentinel`) and
return. When the deferred completes, we restore `ctx` (which is
`sentinel`) before yielding to the reactor again (all good
✅)
>
> With the other case where we see a **completed awaitable**, we store
the `calling_context` and the `f` function is called and runs to
completion (no logcontext change). *This is where the shortcut would
kick in but I'm going to continue explaining as if we commented out the
shortcut.* -- Then in `run_in_background(...)`, we restore the
`calling_context`, store the current `ctx` (which is same as the
`calling_context`). Because the deferred is already completed, our extra
callback is called immediately and we restore `ctx` (which is same as
the `calling_context`). Since we never yield to the reactor, the
`calling_context` is perfect as that's what we want again (all good
✅)
>
> ---
>
> But this also means that our early-return shortcut is no longer just
an optimization and is *necessary* to act correctly in the **completed
awaitable** case as we want to return with the `calling_context` and not
reset to the `sentinel` context. I've updated the comment in
https://github.com/element-hq/synapse/pull/18964 to explain the
necessity as it's currently just described as an optimization.
>
> But because we made the same change to
`run_coroutine_in_background(...)` which didn't have the same
early-return shortcut, we regressed the correct behavior ❌ . This is
being fixed in https://github.com/element-hq/synapse/pull/18964
>
>
> *-- @MadLittleMods,
https://github.com/element-hq/synapse/pull/18900#discussion_r2373582917*
### How did we find this problem?
Spawning from @wrjlewis
[seeing](https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$h3TxxPVlqC6BTL07dbrsz6PmaUoZxLiXnSTEY-QYDtA?via=jki.re&via=matrix.org&via=element.io)
`Starting metrics collection 'typing.get_new_events' from sentinel
context: metrics will be lost` in the logs:
<details>
<summary>More logs</summary>
```
synapse.http.request_metrics - 222 - ERROR - sentinel - Trying to stop RequestMetrics in the sentinel context.
2025-09-23 14:43:19,712 - synapse.util.metrics - 212 - WARNING - sentinel - Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost
2025-09-23 14:43:19,713 - synapse.rest.client.sync - 851 - INFO - sentinel - Client has disconnected; not serializing response.
2025-09-23 14:43:19,713 - synapse.http.server - 825 - WARNING - sentinel - Not sending response to request <XForwardedForRequest at 0x7f23e8111ed0 method='POST' uri='/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=281963%2Fs929324_147053_10_2652457_147960_2013_25554_4709564_0_164_2&timeout=30000' clientproto='HTTP/1.1' site='8008'>, already dis
connected.
2025-09-23 14:43:19,713 - synapse.access.http.8008 - 515 - INFO - sentinel - 92.40.194.87 - 8008 - {@me:wi11.co.uk} Processed request: 30.005sec/-8.041sec (0.001sec, 0.000sec) (0.000sec/0.002sec/2) 0B 200! "POST /_matrix/client/unstable/org.matrix.simplified_msc3575/
```
</details>
From the logs there, we can see things relating to
`typing.get_new_events` and
`/_matrix/client/unstable/org.matrix.simplified_msc3575/sync` which led
me to trying out Sliding Sync with the typing extension enabled and
allowed me to reproduce the problem locally. Sliding Sync is a unique
scenario as it's the only place we use `gather_optional_coroutines(...)`
-> `run_coroutine_in_background(...)` (introduced in
https://github.com/element-hq/synapse/pull/17884) to exhibit this
behavior.
### Testing strategy
1. Configure Synapse to enable
[MSC4186](https://github.com/matrix-org/matrix-spec-proposals/pull/4186):
Simplified Sliding Sync which is actually under
[MSC3575](https://github.com/matrix-org/matrix-spec-proposals/pull/3575)
```yaml
experimental_features:
msc3575_enabled: true
```
1. Start synapse: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Make a Sliding Sync request with one of the extensions enabled
```http
POST
http://localhost:8008/_matrix/client/unstable/org.matrix.simplified_msc3575/sync
{
"lists": {},
"room_subscriptions": {
"!FlgJYGQKAIvAscfBhq:my.synapse.linux.server": {
"required_state": [],
"timeline_limit": 1
}
},
"extensions": {
"typing": {
"enabled": true
}
}
}
```
1. Open your homeserver logs and notice warnings about `Starting ...
from sentinel context: metrics will be lost`
Part of https://github.com/element-hq/synapse/issues/18905
Lints for ensuring we use `Clock.call_later` instead of
`reactor.callLater`, etc are coming in
https://github.com/element-hq/synapse/pull/18944
### Testing strategy
1. Configure Synapse to log at the `DEBUG` level
1. Start Synapse: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Wait 10 seconds for the [database profiling
loop](9cc4001778/synapse/storage/database.py (L711))
to execute
1. Notice the logcontext being used for the `Total database time` log
line
Before (`sentinel`):
```
2025-09-10 16:36:58,651 - synapse.storage.TIME - 707 - DEBUG - sentinel - Total database time: 0.646% {room_forgetter_stream_pos(2): 0.131%, reap_monthly_active_users(1): 0.083%, get_device_change_last_converted_pos(1): 0.078%}
```
After (`looping_call`):
```
2025-09-10 16:36:58,651 - synapse.storage.TIME - 707 - DEBUG - looping_call - Total database time: 0.646% {room_forgetter_stream_pos(2): 0.131%, reap_monthly_active_users(1): 0.083%, get_device_change_last_converted_pos(1): 0.078%}
```
Introduce `Clock.add_system_event_trigger(...)` to wrap system event
callback code in a logcontext, ensuring we can identify which server
generated the logs.
Background:
> Ideally, nothing from the Synapse homeserver would be logged against the `sentinel`
> logcontext as we want to know which server the logs came from. In practice, this is not
> always the case yet especially outside of request handling.
>
> Global things outside of Synapse (e.g. Twisted reactor code) should run in the
> `sentinel` logcontext. It's only when it calls into application code that a logcontext
> gets activated. This means the reactor should be started in the `sentinel` logcontext,
> and any time an awaitable yields control back to the reactor, it should reset the
> logcontext to be the `sentinel` logcontext. This is important to avoid leaking the
> current logcontext to the reactor (which would then get picked up and associated with
> the next thing the reactor does).
>
> *-- `docs/log_contexts.md`
Also adds a lint to prefer `Clock.add_system_event_trigger(...)` over
`reactor.addSystemEventTrigger(...)`
Part of https://github.com/element-hq/synapse/issues/18905
Introduce `Clock.call_when_running(...)` to wrap startup code in a
logcontext, ensuring we can identify which server generated the logs.
Background:
> Ideally, nothing from the Synapse homeserver would be logged against the `sentinel`
> logcontext as we want to know which server the logs came from. In practice, this is not
> always the case yet especially outside of request handling.
>
> Global things outside of Synapse (e.g. Twisted reactor code) should run in the
> `sentinel` logcontext. It's only when it calls into application code that a logcontext
> gets activated. This means the reactor should be started in the `sentinel` logcontext,
> and any time an awaitable yields control back to the reactor, it should reset the
> logcontext to be the `sentinel` logcontext. This is important to avoid leaking the
> current logcontext to the reactor (which would then get picked up and associated with
> the next thing the reactor does).
>
> *-- `docs/log_contexts.md`
Also adds a lint to prefer `Clock.call_when_running(...)` over
`reactor.callWhenRunning(...)`
Part of https://github.com/element-hq/synapse/issues/18905
Remove `sentinel` logcontext where we log in `setup`, `start`, and exit.
Instead of having one giant PR that removes all places we use `sentinel`
logcontext, I've decided to tackle this more piece-meal. This PR covers
the parts if you just startup Synapse and exit it with no requests or
activity going on in between.
Part of https://github.com/element-hq/synapse/issues/18905 (Remove
`sentinel` logcontext where we log in Synapse)
Prerequisite for https://github.com/element-hq/synapse/pull/18868.
Logging with the `sentinel` logcontext means we won't know which server
the log came from.
### Why
9cc4001778/docs/log_contexts.md (L71-L81)
(docs updated in https://github.com/element-hq/synapse/pull/18900)
### Testing strategy
1. Run Synapse normally and with `daemonize: true`: `poetry run
synapse_homeserver --config-path homeserver.yaml`
1. Execute some requests
1. Shutdown the server
1. Look for any bad log entries in your homeserver logs:
- `Expected logging context sentinel but found main`
- `Expected logging context main was lost`
- `Expected previous context`
- `utime went backwards!`/`stime went backwards!`
- `Called stop on logcontext POST-0 without recording a start rusage`
1. Look for any logs coming from the `sentinel` context
With these changes, you should only see the following logs (not from
Synapse) using the `sentinel` context if you start up Synapse and exit:
`homeserver.log`
```
2025-09-10 14:45:39,924 - asyncio - 64 - DEBUG - sentinel - Using selector: EpollSelector
2025-09-10 14:45:40,562 - twisted - 281 - INFO - sentinel - Received SIGINT, shutting down.
2025-09-10 14:45:40,562 - twisted - 281 - INFO - sentinel - (TCP Port 9322 Closed)
2025-09-10 14:45:40,563 - twisted - 281 - INFO - sentinel - (TCP Port 8008 Closed)
2025-09-10 14:45:40,563 - twisted - 281 - INFO - sentinel - (TCP Port 9093 Closed)
2025-09-10 14:45:40,564 - twisted - 281 - INFO - sentinel - Main loop terminated.
```
Closes: #18436
Implements:
https://github.com/matrix-org/matrix-spec-proposals/pull/4308
Follows: #18674
Adds an extension to Sliding Sync and a companion
endpoint needed for backpaginating missed thread subscription changes,
as described in MSC4308
---------
Signed-off-by: Olivier 'reivilibre <oliverw@matrix.org>
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
Also adds a section in the docs explaining the `sentinel` logcontext.
Spawning from https://github.com/element-hq/synapse/pull/18870
### Testing strategy
1. Run Synapse normally and with `daemonize: true`: `poetry run
synapse_homeserver --config-path homeserver.yaml`
1. Execute some requests
1. Shutdown the server
1. Look for any bad log entries in your homeserver logs:
- `Expected logging context sentinel but found main`
- `Expected logging context main was lost`
- `Expected previous context`
- `utime went backwards!`/`stime went backwards!`
- `Called stop on logcontext POST-0 without recording a start rusage`
- `Background process re-entered without a proc`
Twisted trial tests:
1. Run full Twisted trial test suite.
1. Check the logs for `Test starting with non-sentinel logging context ...`
Spawning from https://github.com/element-hq/synapse/pull/18871
[This change](6ce2f3e59d)
was originally used to fix CPU time going backwards when we `daemonize`.
While, we don't seem to run into this problem on `develop`, I still
think this is a good change to make. We don't need background tasks
running on a process that will soon be forcefully exited and where the
reactor isn't even running yet. We now kick off the background tasks
(`run_as_background_process`) after we have forked the process and
started the reactor.
Also as simple note, we don't need background tasks running in both halves of a fork.