Observations on SCTP and Linux
When I was still doing Linux kernel work with netfilter/iptables in the
early 2000's, I was somebody who actually regularly had a look at the
new RFCs that came out. So I saw the SCTP RFCs, SIGTRAN RFCs, SIP and
RTP, etc. all released during those years. I was quite happy to see
that for new protocols like SCTP and later DCCP, Linux quickly received
a mainline implementation.
Now most people won't have used SCTP so far, but it is a protocol used
as transport layer in a lot of telecom protocols for more than a decade
now. Virtually all protocols that have traditionally been spoken over
time-division multiplex E1/T1 links have been migrated over to SCTP
based protocol stackings.
Working on various Open Source telecom related projects, i of course
come into contact with SCTP every so often. Particularly some years
back when implementing the Erlang SIGTAN code in erlang/osmo_ss7 and most recently
now with the introduction of libosmo-sigtran with its OsmoSTP, both part
of the libosmo-sccp repository.
I've also hard to work with various proprietary telecom equipment over
the years. Whether that's some eNodeB hardware from a large brand
telecom supplier, or whether it's a MSC of some other vendor. And they
all had one thing in common: Nobody seemed to use the Linux kernel SCTP
code. They all used proprietary implementations in userspace, using RAW
sockets on the kernel interface.
I always found this quite odd, knowing that this is the route that you
have to take on proprietary OSs without native SCTP support, such as
Windows. But on Linux? Why? Based on rumors, people find the Linux
SCTP implementation not mature enough, but hard evidence is hard to come
by.
As much as it pains me to say this, the kind of Linux SCTP bugs I have
seen within the scope of our work on Osmocom seem to hint that there is
at least some truth to this (see e.g.
https://bugzilla.redhat.com/show_bug.cgi?id=1308360 or
https://bugzilla.redhat.com/show_bug.cgi?id=1308362).
Sure, software always has bugs and will have bugs. But we at Osmocom
are 10-15 years "late" with our implementations of higher-layer
protocols compared to what the mainstream telecom industry does. So if
we find something, and we find it even already during R&D of some
userspace code, not even under load or in production, then that seems a
bit unsettling.
One would have expected, with all their market power and plenty of
Linux-based devices in the telecom sphere, why did none of those large
telecom suppliers invest in improving the mainline Linux SCTP code? I
mean, they all use UDP and TCP of the kernel, so it works for most of
the other network protocols in the kernel, but why not for SCTP? I
guess it comes back to the fundamental lack of understanding how open
source development works. That it is something that the given
industry/user base must invest in jointly.
The leatest discovered bug
During the last months, I have been implementing SCCP, SUA, M3UA and
OsmoSTP (A Signal Transfer Point). They were required for an effort to
add 3GPP compliant A-over-IP to OsmoBSC and OsmoMSC.
For quite some time I was seeing some erratic behavior when at some
point the STP would not receive/process a given message sent by one of
the clients (ASPs) connected. I tried to ignore the problem initially
until the code matured more and more, but the problems remained.
It became even more obvious when using Michael Tuexen's m3ua-testtool,
where sometimes even the most basic test cases consisting of sending +
receiving a single pair of messages like ASPUP -> ASPUP_ACK was failing.
And when the test case was re-tried, the problem often disappeared.
Also, whenever I tried to observe what was happening by meas of strace,
the problem would disappear completely and never re-appear until strace
was detached.
Of course, given that I've written several thousands of lines of new
code, it was clear to me that the bug must be in my code. Yesterday I
was finally prepare to accept that it might actually be a Linux SCTP
bug. Not being able to reproduce that problem on a FreeBSD VM also
pointed clearly into this direction.
Now I could simply have collected some information and filed a bug
report (which some kernel hackers at RedHat have thankfully invited me
to do!), but I thought my use case was too complex. You would have to
compile a dozen of different Osmocom libraries, configure the STP, run
the scheme-language m3ua-testtool in guile, etc. - I guess nobody
would have bothered to go that far.
So today I tried to implement a test case that reproduced the problem in
plain C, without any external dependencies. And for many hours, I
couldn't make the bug to show up. I tried to be as close as possible to
what was happening in OsmoSTP: I used non-blocking mode on client and
server, used the SCTP_NODELAY socket option, used the sctp_rcvmsg()
library wrapper to receive events, but the bug was not reproducible.
Some hours later, it became clear that there was one setsockopt() in
OsmoSTP (actually, libosmo-netif) which enabled all existing SCTP
events. I did this at the time to make sure OsmoSTP has the maximum
insight possible into what's happening on the SCTP transport layer, such
as address fail-overs and the like.
As it turned out, adding that setsockopt for SCTP_FLAGS to my test code
made the problem reproducible. After playing around which of the flags,
it seems that enabling the SENDER_DRY_EVENT flag makes the bug appear.
You can find my detailed report about this issue in
https://bugzilla.redhat.com/show_bug.cgi?id=1442784 and a program to
reproduce the issue at
http://people.osmocom.org/laforge/sctp-nonblock/sctp-dry-event.c
Inside the Osmocom world, luckily we can live without the
SENDER_DRY_EVENT and a corresponding work-around has been submitted and
merged as https://gerrit.osmocom.org/#/c/2386/
With that work-around in place, suddenly all the m3ua-testtool and sua-testtool test cases are reliably green
(PASSED) and OsmoSTP works more smoothly, too.
What do we learn from this?
Free Software in the Telecom sphere is getting too little attention.
This is true even those small portions of telecom relevant protocols
that ended up in the kernel like SCTP or more recently the GTP module I
co-authored. They are getting too little attention in development, even
more lack of attention in maintenance, and people seem to focus more on
not using it, rather than fixing and maintaining what is there.
It makes me really sad to see this. Telecoms is such a massive
industry, with billions upon billions of revenue for the classic telecom
equipment vendors. Surely, they would be able to co-invest in some
basic infrastructure like proper and reliable testing / continuous
integration for SCTP. More recently, we see millions and more millions
of VC cash burned by buzzword-flinging companies doing "NFV" and
"SDN". But then rather reimplement network stacks in userspace than to
fix, complete and test those little telecom infrastructure components
which we have so far, like the SCTP protocol :(
Where are the contributions to open source telecom parts from Ericsson,
Nokia (former NSN), Huawei and the like? I'm not even dreaming about
the actual applications / network elements, but merely the maintenance
of something as basic as SCTP. To be fair, Motorola was involved early
on in the Linux SCTP code, and Huawei contributed a long series of fixes
in 2013/2014. But that's not the kind of long-term maintenance
contribution that one would normally expect from the primary interest
group in SCTP.
Finally, let me thank to the Linux SCTP maintainers. I'm not
complaining about them! They're doing a great job, given the arcane code
base and the fact that they are not working for a company that has
SCTP based products as their core business. I'm sure the would love
more support and contributions from the Telecom world, too.