Hey list,
at the very busy (many threads, many packets, 8 cores) DFN radsecproxy we recently got some crashes looking like
#0 0x00002aeb05fffe54 in SSL_write (s=0x0, buf=0x2aeb482a1d00, num=20) at ssl_lib.c:989 No locals. #1 0x000000000040ce3c in tlsserverwr (arg=0x2aeb48177620) at tls.c:339 cnt = 0 error = <optimized out> client = 0x2aeb48177620 replyq = 0x2aeb48291e60 reply = 0x2aeb482c2190 #2 0x00002aeb06650064 in start_thread (arg=0x2aeb3212d700) at pthread_create.c:309 __res = <optimized out> pd = 0x2aeb3212d700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {47189645776640, 1280935663696751663, 0, 47190015195840, 15, 47189645776640, 4904604809065540655, 4904632170130975791}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread"
I think this is a race condition among 2 threads in the lines
https://git.nordu.net/?p=radsecproxy.git;a=blob;f=tls.c;h=567a6be3491751cb70... and https://git.nordu.net/?p=radsecproxy.git;a=blob;f=tls.c;h=567a6be3491751cb70...
(i.e. client->ssl will be nulled before SSL_write() is called )
The attached patch against 1.6.9 fixes this while trying to be non-invasive to the code flow. There is still the chance of running SSL_write on a 'broken' tls connection but this is handled by openssl and 'cnt' anyway -- one could perhaps adjust the debug message then :).
The same problem exists for the dtls-case for which I don't have a test case at the moment but a patch could look exactly the same.
https://git.nordu.net/?p=radsecproxy.git;a=blob;f=dtls.c;h=f8660925ab28caef6...
/Steffen
Steffen Klemer klemer@dfn.de wrote Fri, 17 Nov 2017 15:38:08 +0100:
I think this is a race condition among 2 threads in the lines
https://git.nordu.net/?p=radsecproxy.git;a=blob;f=tls.c;h=567a6be3491751cb70... https://git.nordu.net/?p=radsecproxy.git;a=blob;f=tls.c;h=567a6be3491751cb70...
(i.e. client->ssl will be nulled before SSL_write() is called )
Thanks for finding and report this (with a fix, none the less!).
This looks like RADSECPROXY-47 [0] which was identified earlier but sadly never addressed.
[0] https://project.nordu.net/projects/RADSECPROXY/issues/RADSECPROXY-47
The attached patch against 1.6.9 fixes this while trying to be non-invasive to the code flow. There is still the chance of running SSL_write on a 'broken' tls connection but this is handled by openssl and 'cnt' anyway -- one could perhaps adjust the debug message then :).
How do you mean openssl and 'cnt' handle this?
An alternative fix might be to simply check for client->ssl != NULL before calling SSL_write() outside of the replyq->mutex lock. Did you consider this? Worth investigating?
Hey Linus.
Thanks for your feedback.
I think this is a race condition among 2 threads in the lines
https://git.nordu.net/?p=radsecproxy.git;a=blob;f=tls.c;h=567a6be3491751cb70... https://git.nordu.net/?p=radsecproxy.git;a=blob;f=tls.c;h=567a6be3491751cb70...
(i.e. client->ssl will be nulled before SSL_write() is called )
Thanks for finding and report this (with a fix, none the less!).
This looks like RADSECPROXY-47 [0] which was identified earlier but sadly never addressed.
[0] https://project.nordu.net/projects/RADSECPROXY/issues/RADSECPROXY-47
Oh, as it was marked 'fixed' I thought that the whole if(client->ssl==NULL) stuff was the (non-complete) fix of this bug :).
The attached patch against 1.6.9 fixes this while trying to be non-invasive to the code flow. There is still the chance of running SSL_write on a 'broken' tls connection but this is handled by openssl and 'cnt' anyway -- one could perhaps adjust the debug message then :).
How do you mean openssl and 'cnt' handle this?
IIRC (and from the behavior of the patched version running here) openssl notices on its own that the connection is gone and adjusts the return-value ('cnt'-variable) which we check anyway. This in mind the only thing we have to do is to wake up the sleeping process which is what I (try to) do with the sslalive variable.
An alternative fix might be to simply check for client->ssl != NULL before calling SSL_write() outside of the replyq->mutex lock. Did you consider this? Worth investigating?
The race would be closer but you could still have tlsserverrd() execute 'client->ssl=NULL' between your proposed check and the call to SSL_write(). If we want to go this way I think the only option is a new mutex per tlsserverrd-tlsserverwr pair that hinders setting ssl=NULL before executing SSL_write() with it. But I don't see why this should be necessary as openssl seems to be clever enough.
While writing this: Where exactly do we ssl_free() the content of client->ssl?
Steffen Klemer klemer@dfn.de wrote Mon, 20 Nov 2017 14:27:24 +0100:
This looks like RADSECPROXY-47 [0] which was identified earlier but sadly never addressed.
[0] https://project.nordu.net/projects/RADSECPROXY/issues/RADSECPROXY-47
Oh, as it was marked 'fixed' I thought that the whole if(client->ssl==NULL) stuff was the (non-complete) fix of this bug :).
It's not marked as fixed.
The attached patch against 1.6.9 fixes this while trying to be non-invasive to the code flow. There is still the chance of running SSL_write on a 'broken' tls connection but this is handled by openssl and 'cnt' anyway -- one could perhaps adjust the debug message then :).
How do you mean openssl and 'cnt' handle this?
IIRC (and from the behavior of the patched version running here) openssl notices on its own that the connection is gone and adjusts the return-value ('cnt'-variable) which we check anyway. This in mind the only thing we have to do is to wake up the sleeping process which is what I (try to) do with the sslalive variable.
Is this true for the case where we break out of the loop in tlsserverrd() because !radsrv(rq) too? I'd guess no.
An alternative fix might be to simply check for client->ssl != NULL before calling SSL_write() outside of the replyq->mutex lock. Did you consider this? Worth investigating?
The race would be closer but you could still have tlsserverrd() execute 'client->ssl=NULL' between your proposed check and the call to SSL_write(). If we want to go this way I think the only option is a new
You're right, of course.
mutex per tlsserverrd-tlsserverwr pair that hinders setting ssl=NULL before executing SSL_write() with it. But I don't see why this should be necessary as openssl seems to be clever enough.
Perhaps if we would explicitly close in the !radsrv(rq) case mentioned above.
While writing this: Where exactly do we ssl_free() the content of client->ssl?
Seems like tlsservernew() does that, once tlsserverrd() has returned, using another copy of the pointer to the SSL object (!).