Hey list,
at the very busy (many threads, many packets, 8 cores) DFN radsecproxy we recently got some crashes looking like
#0 0x00002aeb05fffe54 in SSL_write (s=0x0, buf=0x2aeb482a1d00, num=20) at ssl_lib.c:989 No locals. #1 0x000000000040ce3c in tlsserverwr (arg=0x2aeb48177620) at tls.c:339 cnt = 0 error = <optimized out> client = 0x2aeb48177620 replyq = 0x2aeb48291e60 reply = 0x2aeb482c2190 #2 0x00002aeb06650064 in start_thread (arg=0x2aeb3212d700) at pthread_create.c:309 __res = <optimized out> pd = 0x2aeb3212d700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {47189645776640, 1280935663696751663, 0, 47190015195840, 15, 47189645776640, 4904604809065540655, 4904632170130975791}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread"
I think this is a race condition among 2 threads in the lines
https://git.nordu.net/?p=radsecproxy.git;a=blob;f=tls.c;h=567a6be3491751cb70... and https://git.nordu.net/?p=radsecproxy.git;a=blob;f=tls.c;h=567a6be3491751cb70...
(i.e. client->ssl will be nulled before SSL_write() is called )
The attached patch against 1.6.9 fixes this while trying to be non-invasive to the code flow. There is still the chance of running SSL_write on a 'broken' tls connection but this is handled by openssl and 'cnt' anyway -- one could perhaps adjust the debug message then :).
The same problem exists for the dtls-case for which I don't have a test case at the moment but a patch could look exactly the same.
https://git.nordu.net/?p=radsecproxy.git;a=blob;f=dtls.c;h=f8660925ab28caef6...
/Steffen