Bug 1363 - sshd gets stuck: select() in packet_read_seqnr waits indefinitely
Summary: sshd gets stuck: select() in packet_read_seqnr waits indefinitely
Status: CLOSED FIXED
Alias: None
Product: Portable OpenSSH
Classification: Unclassified
Component: sshd (show other bugs)
Version: 4.2p1
Hardware: All All
: P2 major
Assignee: Damien Miller
URL: http://marc.info/?t=117394251600035
Keywords: patch
Depends on:
Blocks: V_5_1
  Show dependency treegraph
 
Reported: 2007-09-17 13:02 AEST by Matt Day
Modified: 2023-01-13 13:56 AEDT (History)
3 users (show)

See Also:


Attachments
latest version of fix -- this has been tested (4.09 KB, patch)
2007-09-17 13:02 AEST, Matt Day
no flags Details | Diff
revised patch (7.06 KB, patch)
2007-09-18 10:50 AEST, Damien Miller
no flags Details | Diff
Set remain_ms correctly (7.24 KB, patch)
2007-09-18 16:53 AEST, Damien Miller
no flags Details | Diff
activate server-side timeout after auth completes (7.70 KB, text/plain)
2008-01-20 12:17 AEDT, Damien Miller
dtucker: ok+
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matt Day 2007-09-17 13:02:25 AEST
Created attachment 1348 [details]
latest version of fix -- this has been tested

This bug was discussed on openssh-unix-dev in March 2007:
  http://marc.info/?t=117394251600035

During the discussion, Darren Tucker created a fix for the problem and I (Matt Day) revised and tested it. The latest version of the patch is attached.

Original problem report:

I'm having a problem where sshd login sessions are occasionally
(as often as once a day) getting stuck indefinitely. I enabled debug
messages and got a backtrace of a stuck sshd, and I think I've found
the bug.

sshd version:
    OpenSSH_4.2p1 FreeBSD-20050903, OpenSSL 0.9.7e-p1 25 Oct 2004

Uncommented lines (ie. nondefault settings) in sshd_config:
    LogLevel DEBUG
    ClientAliveInterval 90
    Subsystem sftp /usr/libexec/sftp-server

SSH client:
    PuTTY version 0.58, default settings

OS/HW:
    FreeBSD 6.1-RELEASE running on 64-bit x86 ("amd64" platform)

Executive summary:
    The select() in packet_read_seqnr() waits indefinitely, resulting
    in stuck SSH sessions when networking problems interfere with
    key exchange. Would like to be able to set a timeout there, or
    send SSH keepalives during key exchange.

Periodically (every 60 minutes) the SSH client initiates rekeying
via key exchange. Here's an example of a successful rekeying:

Mar 11 19:02:35 SSH2_MSG_KEXINIT received
Mar 11 19:02:35 SSH2_MSG_KEXINIT sent
Mar 11 19:02:35 kex: client->server aes256-ctr hmac-sha1 none
Mar 11 19:02:35 kex: server->client aes256-ctr hmac-sha1 none
Mar 11 19:02:35 SSH2_MSG_KEX_DH_GEX_REQUEST_OLD received
Mar 11 19:02:35 SSH2_MSG_KEX_DH_GEX_GROUP sent
Mar 11 19:02:35 expecting SSH2_MSG_KEX_DH_GEX_INIT
Mar 11 19:02:38 SSH2_MSG_KEX_DH_GEX_REPLY sent
Mar 11 19:02:38 set_newkeys: rekeying
Mar 11 19:02:38 SSH2_MSG_NEWKEYS sent
Mar 11 19:02:38 expecting SSH2_MSG_NEWKEYS
Mar 11 19:02:38 set_newkeys: rekeying
Mar 11 19:02:38 SSH2_MSG_NEWKEYS received

In the failure case, sshd gets stuck during key exchange. The SSH   
session had been going fine for many hours, and then these were the 
last messages it logged:

Mar 11 20:02:38 SSH2_MSG_KEXINIT received
Mar 11 20:02:38 SSH2_MSG_KEXINIT sent
Mar 11 20:02:38 kex: client->server aes256-ctr hmac-sha1 none
Mar 11 20:02:38 kex: server->client aes256-ctr hmac-sha1 none
Mar 11 20:02:38 SSH2_MSG_KEX_DH_GEX_REQUEST_OLD received
Mar 11 20:02:38 SSH2_MSG_KEX_DH_GEX_GROUP sent
Mar 11 20:02:38 expecting SSH2_MSG_KEX_DH_GEX_INIT

The user was idle when this happened, but had a program running
that was generating output. That program became tty-blocked after
about 30 minutes, presumably because sshd wasn't draining its output,
and that's when I noticed the user's sshd was stuck and got a backtrace:

(gdb) where
#0  0x.. in select () from /lib/libc.so.6
#1  0x.. in packet_read_seqnr () from /usr/lib/libssh.so.3
#2  0x.. in packet_read () from /usr/lib/libssh.so.3
#3  0x.. in packet_read_expect () from /usr/lib/libssh.so.3
#4  0x.. in kexgex_server (kex=0x538900) at kexgexs.c:99
#5  0x.. in kex_setup () from /usr/lib/libssh.so.3
#6  0x.. in kex_input_kexinit () from /usr/lib/libssh.so.3
#7  0x.. in dispatch_run () from /usr/lib/libssh.so.3
#8  0x.. in process_buffered_input_packets () at serverloop.c:475
#9  0x.. in server_loop2 (authctxt=0x4) at serverloop.c:760
#10 0x.. in do_authenticated2 (authctxt=0x4) at session.c:2456
#11 0x.. in do_authenticated (authctxt=0x53a400) at session.c:227
#12 0x.. in main at sshd.c:1749

This backtrace agrees with the debug messages: it's in kexgex_server(),
calling packet_read_expect(SSH2_MSG_KEX_DH_GEX_INIT), which ultimately
calls select() from packet_read_seqnr().

The select call in packet_read_seqnr passes NULL for a timeout,
meaning it will wait forever. That explains why the comment above
it says "Note that no other data is processed until this returns,
so this function should not be used during the interactive session."
But, this was an interactive session.

I've set ClientAliveInterval in sshd_config so that SSH sessions
die in a timely manner when networking problems arise, but the
keepalive is apparently not sent during key exchange. The default
TCP keepalive on FreeBSD is unhelpful here; it only kicks in after
2 hours, and I need stuck SSH sessions to die a lot sooner. I want
to keep the FreeBSD TCP keepalive defaults.

Would it be possible for the select() in packet_read_seqnr to use
an optional timeout? Similarly, I believe the select() in
packet_write_wait has the same problem. Upon timeout, it would be
fine with me if the session died with an error logged. Alternatively,
if SSH keepalives were sent during key exchange, that would suffice.
Comment 1 Damien Miller 2007-09-17 18:53:59 AEST
Comment on attachment 1348 [details]
latest version of fix -- this has been tested

>+		while ((ret = select(connection_in + 1, setp, NULL, NULL,
>+		    packet_wait_tvp)) == -1 &&

I think you need to reset packet_wait_tv before each select call. Linux at least will modify the timeout parameter to the contain the amount of time remaining, and apparently this behaviour is permitted by POSIX.
Comment 2 Matt Day 2007-09-18 09:30:37 AEST
(In reply to comment #1)
> I think you need to reset packet_wait_tv before each select call. Linux
> at least will modify the timeout parameter to the contain the amount of
> time remaining, and apparently this behaviour is permitted by POSIX.

Good observation -- I agree.

In addition to the patch, this sort of select-loop with timeout appears in the following OpenSSH 4.7 places:
* conloop() (ssh-keyscan.c)
* timeout_connect (sshconnect.c)

I'm not familiar with that code, but must not be portable for the same reason.

In the patch, the purpose of the select-loop is to wait for the socket to become ready. In OpenSSH 4.7, the packet.c routines wait forever. So the patch introduces a timeout which is enabled on systems using the SSH keepalive.

So, I think the patch is actually *correct* on Linux. On systems like FreeBSD that do not change the select-timeval, the select-loop would start the timer over from the beginning each time EAGAIN or EINTR occurred. So if signals kept going off, sshd could still get stuck indefinitely in that loop.

So I believe the correct fix should keep track of its own time and pass the correct amount of remaining time period into select().
Comment 3 Damien Miller 2007-09-18 09:41:14 AEST
> In addition to the patch, this sort of select-loop with timeout appears
> in the following OpenSSH 4.7 places:
> * conloop() (ssh-keyscan.c)

I'm not so fussed about this one.

> * timeout_connect (sshconnect.c)

This is fixed in CVS -current. See:

http://www.openbsd.org/cgi-bin/cvsweb/src/usr.bin/ssh/sshconnect.c.diff?r1=1.201&r2=1.202

> So, I think the patch is actually *correct* on Linux. On systems like
> FreeBSD that do not change the select-timeval, the select-loop would
> start the timer over from the beginning each time EAGAIN or EINTR
> occurred. So if signals kept going off, sshd could still get stuck
> indefinitely in that loop.

Actually, the patch is incorrect everywhere and possibly badly broken on Linux. On Linux, the timeout will get smaller and smaller and will eventually timeout an active connection. On other platforms the behavior is undefined.

The logic that needs to be implemented is:
    1. Decrease the timeout on EINTR or EAGAIN and retry the select
    2. Reset the timeout on a successful read
Comment 4 Matt Day 2007-09-18 09:54:03 AEST
(In reply to comment #3)
> Actually, the patch is incorrect everywhere and possibly badly broken
> on Linux. On Linux, the timeout will get smaller and smaller and will
> eventually timeout an active connection. On other platforms the
> behavior is undefined.

Ah, you're right of course. I forgot the timeout value is being stored in a global by packet_set_timeout, which is only being called at session establishment time.
Comment 5 Damien Miller 2007-09-18 10:50:15 AEST
Created attachment 1350 [details]
revised patch

This implements the logic I described in comment #3
Comment 6 Tomas Mraz 2007-09-18 16:41:46 AEST
Shouldn't the ms_to_timeval(&timeout, ms_remain); be called inside the for(;;) loop? As you said the contents of the timeout is undefined on some platforms after returning from the select().
Comment 7 Damien Miller 2007-09-18 16:53:24 AEST
Created attachment 1351 [details]
Set remain_ms correctly

You are right: that diff only reset the timeout on successful reads, but not on interrupted select()s.

One concern though: does this impose a timelimit of (ClientAliveInterval * ClientAliveCountMax) on userauth? 

I'm not sure whether that is desirable, because we already have LoginGraceTime for that. I'm not sure what the solution is there, perhaps set the timeout to MAX(LoginGraceTime, ClientAliveInterval * ClientAliveCountMax), turn the timeout off after KEX, or never turn it on in the server to begin with?
Comment 8 Damien Miller 2007-11-03 11:43:00 AEDT
(In reply to comment #7)
I think the solution is not to enable the timeout in the server at all - it already has LoginGraceTime that kicks off right after accept().

So, just ignore the sshd.c hunk of this patch.
Comment 9 Darren Tucker 2007-11-03 12:06:20 AEDT
(In reply to comment #8)
> (In reply to comment #7)
> I think the solution is not to enable the timeout in the server at all
> - it already has LoginGraceTime that kicks off right after accept().
> 
> So, just ignore the sshd.c hunk of this patch.

The problem also exists in the session (eg during rekey) and LoginGraceTime doesn't help there.

We could set the timeout to MAX(ClientAliveInterval * ClientAliveCountMax, LoginGraceTime) before auth, and ClientAliveInterval * ClientAliveCountMax afterward.
Comment 10 Damien Miller 2008-01-20 12:17:53 AEDT
Created attachment 1446 [details]
activate server-side timeout after auth completes

Almost identical patch, but activate the sshd timeout right after the LoginGraceTime alarm is cancelled. I.e. LoginGraceTime will cover initial KEX and auth, and this new timeout will cover the rest of the running protocol (including re-KEX).

client behaviour is unchanged
Comment 11 Darren Tucker 2008-06-13 06:45:17 AEST
Thanks all.  This has been applied and will be in the 5.1 release as well as tomorrow's snapshop.
Comment 12 Damien Miller 2008-07-22 12:19:44 AEST
Mass update RESOLVED->CLOSED after release of openssh-5.1