while adding a test, several inconsistencies around socket management in
the face of an error happening during socket wait time were uncovered,
that have been fixed:
* resolver-related connections weren't transitioning to "closed" status,
due to a lack of resetting on handling connecting errors.
* connections waiting for DNS answer weren't pinned to the session; so
when an error happened, the pool accounting was corrupt, as it was
still considering a connection in use; the connection is now sent back
to the pool.
* resolver wasn't cleaning up state well, remaining open after dealing
with a socket-level error. it now proceeds to the closed state.
* API has been aligned across connections and resolvers for error
handling.
this removes the exception handlers on the loop, which was buggy;
besides force-closing connections which shouldn't be closed, it'd also
do it in an unsafe, where the selector selectables would be traversed
and force closed, which mutated the selectables listed being updated
(this could leave selectables hanging there); this becomes more
problematic for persistent connections.
this was fixed by rescuing at the wait/select call level, and traversing
the list of IOs being inspected.
another fix was propagation to resolvers which hold the connections,
which was not happening in the case of forceful termination of
resolvers, which didn't implement #force_reset. This was fixed by moving
to #force_close and implementing it for resolvers, which drops
connections and propagates #force_close.
before, it was evaluating all names in the queue, but that was not accurate due to candidates, so there was always a candidate with a lower timeout; instead, one relies on the preexisting @name, wihch was already point to the currently being resolved name, and use it as proxy to get the correct timeotus list. This also results in performance improvement
defaulting to unbounded, in order to preserve current behaviour; this will cap the number of connections initiated for a given origin for a pool, which if not shared, will be per-origin; this will include connections from separate option profiles
a pool timeout is defined to checkout a connection when limit is reeached
the change to read/write cancellation-driven timeouts as the default
timeout strategy revealed a performance regression; because these were
built on Timers, which never got unsubscribed, this meant that they were
kept beyond the duration of the request they were created for, and
needlessly got picked up for the next timeout tick.
This was fixed by adding a callback on timer intervals, which
unsubscribes them from the timer group when called; these would then be
activated after the timeout is not needed anymore (request send /
response received), thereby removing the overhead on subsequent
requests.
An additional intervals array is also kept in the connection itself;
timeouts from timers are signalled via socket wait calls, however they
were always resulting in timeouts, even when they shouldn't (ex: expect
timeout and send full response payload as a result), and with the wrong
exception class in some cases. By keeping intervals from its requests
around, and monitoring whether there are relevant request triggers, the
connection can therefore handle a timeout or bail out (so that timers
can fire the correct callback).
when closed, connections are now placed in a place called eden_connections; whenever a connection is matched for, after checking the live connections and finding none, a match is looked in eden connections; the match is accepted **if** the IP is considered fresh (the input is validated in the cache, or input was an ip or in /etc/hosts, or it's an external socket) and, if a TLS connection, the stored TLS session did not expire; if these conditions do not match, the connection is dropped from the eden and a new connection will started instead; this will therefore allow reusing ruby objects, reusing TLS sessions, and still respect the DNs cache
a behaviour has been observed behind a vpn, where when one of the
servers is unresponsive, the switch to the next nameserver wasn't
happening. Part of it was a bug in the timeout handling, but the rest
was actually the switch not happening (i.e. it'd fail on the first
server). This fixes it by switching to the next nammeserver on query
error.
While adding the test, the code to recover from an exhausted HTTP/1.1
connection proved not to be reliable, as it wasn't taking inflight
requests nor update-once keep-alive max requests counters into
account.
This has been fixed by implementing our own test dummy using
webrick.