current git checkout crashes during peak time: ==============================================
Core was generated by /usr/local/sbin/radsecproxy'. Program terminated with signal SIGSEGV, Segmentation fault. (gdb) set print pretty (gdb) bt full #0 0x00007fcca3c4f518 in __regexec (preg=preg@entry=0xf25fe8, string=string@entry=0x7fcc9801c4c0 "anonymous@charite.de", nmatch=nmatch@entry=0, pmatch=pmatch@entry=0x0, eflags=eflags@entry=0) at regexec.c:243 err = <optimized out> start = 0 length = 20 dfa = <optimized out>
#1 0x00000000004057e7 in id2realm (realmlist=<optimized out>, id=id@entry=0x7fcc9801c4c0 "anonymous@charite.de") at radsecproxy.c:689 entry = 0xf27370 realm = 0xf25fd0 subrealm = <optimized out>
#2 0x00000000004084d1 in findserver (realm=realm@entry=0x7fcca4c15ed8, username=username@entry=0x7fcc9800a410, acc=<optimized out>) at radsecproxy.c:1289 srvconf = <optimized out> subrealm = <optimized out> server = 0x0 id = 0x7fcc9801c4c0 "anonymous@charite.de"
#3 0x00000000004088e4 in radsrv (rq=rq@entry=0x7fcc98023fc0) at radsecproxy.c:1453 msg = 0x7fcc98000a10 attr = 0x7fcc9800a410 userascii = 0x7fcc980216a0 "anonymous@charite.de" realm = 0x0 to = 0x0 from = 0x7fcc98010ba0 ttlres = -1 __func__ = "radsrv"
#4 0x000000000040fd04 in udpserverrd (arg=0xf273f0) at udp.c:282 rq = 0x7fcc98023fc0 sp = 0xf273f0
#5 0x00007fcca3f386ba in start_thread (arg=0x7fcca4c16700) at pthread_create.c:333 __res = <optimized out> pd = 0x7fcca4c16700 now = <optimized out> unwind_buf = { cancel_jmp_buf = {{ jmp_buf = {140516914194176, 3582087618930932556, 0, 140731320876959, 140516914194880, 0, -3589535791242282164, -3589537871384937652}, mask_was_saved = 0 }}, priv = { pad = {0x0, 0x0, 0x0, 0x0}, data = { prev = 0x0, cleanup = 0x0, canceltype = 0 } } } not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread"
#6 0x00007fcca3c6e3dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 No locals.
* Ralf Hildebrandt Ralf.Hildebrandt@charite.de:
current git checkout crashes during peak time:
Config attached...
* Ralf Hildebrandt Ralf.Hildebrandt@charite.de:
current git checkout crashes during peak time:
OS: Ubuntu LTS 16.04 SSL: libssl1.0.0:amd64 1.0.2g-1ubuntu4.8 amd64 Secure Sockets Layer toolkit - shared libraries
Ralf Hildebrandt Ralf.Hildebrandt@charite.de wrote Fri, 7 Jul 2017 13:24:15 +0200:
current git checkout crashes during peak time:
This issue is being tracked in RADSECPROXY-77.
Ralf has been very helpful debugging this issue. I think we might've found it -- realm data structures are reference counted (also in a static configuration) but increasing and decreasing the count is not protected.
Two threads trying to increase the counter (id2realm()) simultaneously risk overwriting the other threads update, resulting in the counter being one less than expected. This in turn will make the refcount go down to 0 too early (radsrv()), and get freed. After that it's just a matter of time before malloc() hands out the memory previously occupied by the realm and overwriting happens.
This is consistent with the observed partial overwriting of the realm data structure and that it happens to only one of the two realms in Ralf's config. It's also consistent with the observation that this happens only once the frequency of requests are high enough for two requests being handled _simultaneously_ in two threads running on two separate CPU cores.
Ralf is currently running with a patch that fixes this by taking a separate mutex before increasing or decreasing the reference count for a realm.
* Linus Nordberg linus@sunet.se:
Ralf is currently running with a patch that fixes this by taking a separate mutex before increasing or decreasing the reference count for a realm.
No crashes so far; I'd say we need to give this 2 more working days, but it sure ssem to be crashing less than before (which was about 4 times a day)
* Ralf Hildebrandt Ralf.Hildebrandt@charite.de:
- Linus Nordberg linus@sunet.se:
Ralf is currently running with a patch that fixes this by taking a separate mutex before increasing or decreasing the reference count for a realm.
No crashes so far; I'd say we need to give this 2 more working days, but it sure seem to be crashing less than before (which was about 4 times a day)
No crashes today, either.
Ralf Hildebrandt Ralf.Hildebrandt@charite.de wrote Tue, 1 Aug 2017 15:34:16 +0200:
Ralf is currently running with a patch that fixes this by taking a separate mutex before increasing or decreasing the reference count for a realm.
No crashes so far; I'd say we need to give this 2 more working days, but it sure seem to be crashing less than before (which was about 4 times a day)
No crashes today, either.
Thanks for the update. I think it's time for radsecproxy-1.6.9.
Two things regarding this bug though.
Why didn't we hear from this until now? The offending code is far from new. Who else besides Ralf run radsecproxy in a static configuration (ie no dynamicLookupCommand) on a multicore system and handle at least 10 requests/second? Would you mind grepping your logs for signs of crashes? 'createlistener' might be a good string to grep for.
I'm assuming that _reading_ a uint32_t without protection is going to be safe on all architectures we care about. Let me know if you think this is not true.
On 2 Aug 2017, at 09:58, Linus Nordberg linus@sunet.se wrote:
Ralf Hildebrandt Ralf.Hildebrandt@charite.de wrote Tue, 1 Aug 2017 15:34:16 +0200:
Ralf is currently running with a patch that fixes this by taking a separate mutex before increasing or decreasing the reference count for a realm.
No crashes so far; I'd say we need to give this 2 more working days, but it sure seem to be crashing less than before (which was about 4 times a day)
No crashes today, either.
Thanks for the update. I think it's time for radsecproxy-1.6.9.
Two things regarding this bug though.
Why didn't we hear from this until now?
Good question.
The offending code is far from new. Who else besides Ralf run radsecproxy in a static configuration (ie no dynamicLookupCommand) on a multicore system and handle at least 10 requests/second?
In Germany most of the eduroam federation members using radsecproxy already. So I’m very surprised that there is no information about so many crashes.
And we have a deadline on 2017, Dec. 1 in order to move all German eduroam federation members to RadSec standard protocol, radsecproxy or RADIATOR.
Finally, I’m very happy that you both found this bug and it would be grateful if you can release 1.6.9 soon, best before the deadline. :-)
Would you mind grepping your logs for signs of crashes? 'createlistener' might be a good string to grep for.
I’ll forward your email to the Germany eduroam list, hopefully to find out more.
I'm assuming that _reading_ a uint32_t without protection is going to be safe on all architectures we care about. Let me know if you think this is not true. _______________________________________________ radsecproxy mailing list radsecproxy@lists.nordu.net https://lists.nordu.net/listinfo/radsecproxy
Best regards Ralf
-- Verein zur Förderung eines Deutschen Forschungsnetzes e.V. Alexanderplatz 1, D - 10178 Berlin Tel.: 030 88 42 99 23 Fax: 030 88 42 99 70 http://www.dfn.de http://www.dfn.de/
Vorstand: Prof. Dr. Hans-Joachim Bungartz (Vorsitzender), Dr. Ulrike Gutheil, Dr. Rainer Bockholt Geschäftsführung: Dr. Christian Grimm, Jochem Pattloch
paffrath paffrath@dfn.de wrote Wed, 2 Aug 2017 10:30:57 +0200:
And we have a deadline on 2017, Dec. 1 in order to move all German eduroam federation members to RadSec standard protocol, radsecproxy or RADIATOR.
Finally, I’m very happy that you both found this bug and it would be grateful if you can release 1.6.9 soon, best before the deadline. :-)
I'm aiming for a release today.
Would you mind grepping your logs for signs of crashes? 'createlistener' might be a good string to grep for.
I’ll forward your email to the Germany eduroam list, hopefully to find out more.
Thanks. Please add the requirement that there need to be about 8 requests per second for one particular realm in order for this bug to show.
* paffrath paffrath@dfn.de:
Two things regarding this bug though.
Why didn't we hear from this until now?
Good question.
Maybe our setup is special? Is the number of RADIUS queries excessive? We're using Enterasys/Extreme hardware for our access points.
Maybe other brands use (more/better) caching?
The offending code is far from new. Who else besides Ralf run radsecproxy in a static configuration (ie no dynamicLookupCommand) on a multicore system and handle at least 10 requests/second?
In Germany most of the eduroam federation members using radsecproxy already. So I’m very surprised that there is no information about so many crashes.
Mabye nobody noticed. I was running radsecproxy from systemd (since that's what the distribution set up) and it was being restarted automatically.
* Ralf Hildebrandt Ralf.Hildebrandt@charite.de:
- paffrath paffrath@dfn.de:
Two things regarding this bug though.
Why didn't we hear from this until now?
Good question.
Maybe our setup is special? Is the number of RADIUS queries excessive? We're using Enterasys/Extreme hardware for our access points.
Maybe other brands use (more/better) caching?
Currently, we're running our radsecproxy host on a VMware VM with two virtual CPUs