Hey Linus.
Thanks for your feedback.
I think this is a race condition among 2 threads in the lines
https://git.nordu.net/?p=radsecproxy.git;a=blob;f=tls.c;h=567a6be3491751cb70... https://git.nordu.net/?p=radsecproxy.git;a=blob;f=tls.c;h=567a6be3491751cb70...
(i.e. client->ssl will be nulled before SSL_write() is called )
Thanks for finding and report this (with a fix, none the less!).
This looks like RADSECPROXY-47 [0] which was identified earlier but sadly never addressed.
[0] https://project.nordu.net/projects/RADSECPROXY/issues/RADSECPROXY-47
Oh, as it was marked 'fixed' I thought that the whole if(client->ssl==NULL) stuff was the (non-complete) fix of this bug :).
The attached patch against 1.6.9 fixes this while trying to be non-invasive to the code flow. There is still the chance of running SSL_write on a 'broken' tls connection but this is handled by openssl and 'cnt' anyway -- one could perhaps adjust the debug message then :).
How do you mean openssl and 'cnt' handle this?
IIRC (and from the behavior of the patched version running here) openssl notices on its own that the connection is gone and adjusts the return-value ('cnt'-variable) which we check anyway. This in mind the only thing we have to do is to wake up the sleeping process which is what I (try to) do with the sslalive variable.
An alternative fix might be to simply check for client->ssl != NULL before calling SSL_write() outside of the replyq->mutex lock. Did you consider this? Worth investigating?
The race would be closer but you could still have tlsserverrd() execute 'client->ssl=NULL' between your proposed check and the call to SSL_write(). If we want to go this way I think the only option is a new mutex per tlsserverrd-tlsserverwr pair that hinders setting ssl=NULL before executing SSL_write() with it. But I don't see why this should be necessary as openssl seems to be clever enough.
While writing this: Where exactly do we ssl_free() the content of client->ssl?