Fix race condition in error printing in ssl_server2.c #1287

mpg · 2018-01-22T10:14:34Z

Description

Fix a race condition in ssl_server2.c that caused spurious failures of ssl-opt.sh especially on hghly-loaded CI machines where it is likely for a process to not get a slice of CPU often enough.

This is part 2 of the replacement for #1269 - addressing the "incomplete server logs" issue.

I believe this is not worth a ChangeLog entry as this is hardly visible / of importance to users other than us.

Status

READY

Requires Backporting

Yes
[x] Mbed TLS 2.1 #1290
[x] Mbed TLS 1.3 #1291

gilles-peskine-arm

This doesn't look right: it's both too much and too little. I don't see any reason to treat the two places where received_sigterm is checked differently.

gilles-peskine-arm · 2018-01-22T15:31:52Z

programs/ssl/ssl_server2.c

@@ -2020,7 +2020,6 @@ int main( int argc, char *argv[] )
    if( received_sigterm )
    {
        mbedtls_printf( " interrupted by SIGTERM\n" );
-        ret = 0;


With this change to never clear ret, for a normal test run (no failures, no timing weirdness), the server logs now contain

interrupted by SIGTERM Last error was: -0x45 - NET - The context is invalid, eg because it was free()ed

Although the message mentions an error, this error was intended: the connection is supposed to have been closed mid-connection. I think it would be best to keep suppressing the MBEDTLS_ERR_NET_INVALID_CONTEXT here.

Under what circumstances do you get this message? It don't this this happens when running ssl-opt.sh because otherwise Default would fail. If it's when running the server and killing in manually, then I would tend not to care: I assume a human can make sense of this error, so I chose to care only about stupid scripts - or more precisely, only those stupid scripts that are part of our test system.

Also, a motivation for not clearing ret except after net_accept() is that, since I don't expect SIGTERM to be received at any other place than net_accept() under normal (successful) testing circumstances, I'm afraid reseting it in other places could hide errors. Perhaps it doesn't, but I couldn't come up with a strong argument to that effect.

As much as I hate false positives on the CI, the last thing I hate more is false negatives. So while I'm not sure reseting ret here could hide errors, I don't want to take any unnecessary risk.

Run ssl-opt.sh -p and check where you get interrupted by SIGTERM vs interrupted by signal. Under Default (and most other test cases), the signal is caught at the interrupted by signal location. interrupted by SIGTERM is only for cases where the client disconnects unexpectedly.

@mpg I'm not sure we're understanding each other here. This is what looks right to me. No clearing of errors in general, but do clear an the error resulting from interrupting the network operation, and use the same treatment in both places where a signal is detected.

@gilles: thanks for letting me know about ssl-opt.sh -p, I wasn't aware of that option, and it help fix my misconception: signals can indeed be received outside of net_accept() during normal (successful) testing.

Also, regarding my fear that reseting ret when the signal is received outside accept(), it looks like currently we don't have tests that ensure that the server does not see the client disconnecting unexpectedly, and if in the future we want to add some, we can still do so by checking for the "interrupted" message rather than ret.

So I'll adopt your fix and take the occasion to enhance the messages.

gilles-peskine-arm · 2018-01-22T15:35:01Z

programs/ssl/ssl_server2.c

@@ -2020,7 +2020,6 @@ int main( int argc, char *argv[] )
    if( received_sigterm )
    {
        mbedtls_printf( " interrupted by SIGTERM\n" );
-        ret = 0;


There is another place in the code where received_sigterm is checked. Why ignore ret here, but reset it in the other place, after the call to mbedtls_net_accept?

I think that resetting ret can hide bugs in the server.

When ret = 0 is removed from the post-signal-handler code after mbedtls_net_accept, the error under normal conditions is

. Waiting for a remote connection ... interrupted by signal Last error was: -0x4A - NET - Could not accept the incoming connection

This is caught by the “Default” tests which check that the server does not report any error.

I think that in this place, ret should be reset only if it was MBEDTLS_ERR_NET_ACCEPT_FAILED.

Because net_accept() is the only place where we expect to be interrupted by a SIGTERM when running ssl-opt.sh and I chose to care only about that.

I didn't add the test to reset only if ret was MBEDTLS_ERR_NET_ACCEPT_FAILED because I thought it was the only possible return code of net_accept(), but I failed to check and now I realise I was wrong, at least with UDP it can return other things, so I'll add that check.

The race goes this way: 1. ssl_recv() succeeds (ie no signal received yet) 2. processing the message leads to aborting handshake with ret != 0 3. reset ret if we were signaled 4. print error if ret is still non-zero 5. go back to net_accept() which can be interrupted by a signal We print the error message only if the signal is received between steps 3 and 5, not when it arrives between steps 1 and 3. This can cause failures in ssl-opt.sh where we check for the presence of "Last error was..." in the server's output: if we perform step 2, the client will be notified and exit, then ssl-opt.sh will send SIGTERM to the server, but if it didn't get a chance to run and pass step 3 in the meantime, we're in trouble. The purpose of step 3 was to avoid spurious "Last error" messages in the output so that ssl-opt.sh can check for a successful run by the absence of that message. However, it is enough to suppress that message when the last error we get is the one we expect from being interrupted by a signal - doing more could hide real errors. Also, improve the messages printed when interrupted to make it easier to distinguish the two cases - this could be used in a testing script wanted to check that the server doesn't see the client as disconnecting unexpectedly.

mpg · 2018-01-25T10:55:54Z

@gilles-peskine-arm I've addressed your comment. Since this is a 1-commit, < 10-lines changes, I assumed review history was not that important so I forced-pushed a new version. Please review again.

Note: backports were updated as well, so they're ready for review too.

gilles-peskine-arm

I'm not completely sure that this solves the race condition, but in any case it's a good change: the previous code definitely could have been hiding bugs by suppressing errors too eagerly.

Patater

LGTM

Mbedtls 2.28.9 mergeback

mpg requested review from gilles-peskine-arm and Patater January 22, 2018 10:14

mpg added bug mbed TLS team needs-review labels Jan 22, 2018

This was referenced Jan 22, 2018

Be more patient in SSL tests #1269

Closed

Backport 2.1: Iotssl 2005 slow start #1288

Merged

Backport 2.1: Fix race condition in error printing in ssl_server2.c #1290

Merged

Backport 1.3 : Fix race condition in error printing in ssl_server2.c #1291

Merged

gilles-peskine-arm requested changes Jan 22, 2018

View reviewed changes

mpg force-pushed the iotssl-2006-incomplete-server-logs branch from 950e35d to 4fa619f Compare January 25, 2018 10:41

gilles-peskine-arm approved these changes Jan 25, 2018

View reviewed changes

Patater approved these changes Jan 25, 2018

View reviewed changes

mpg added approved and removed needs-review labels Jan 29, 2018

Patater merged commit 4fa619f into Mbed-TLS:development Jan 29, 2018

mpg deleted the iotssl-2006-incomplete-server-logs branch September 10, 2019 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in error printing in ssl_server2.c #1287

Fix race condition in error printing in ssl_server2.c #1287

mpg commented Jan 22, 2018 •

edited

Loading

gilles-peskine-arm left a comment

gilles-peskine-arm Jan 22, 2018

mpg Jan 23, 2018

mpg Jan 23, 2018

gilles-peskine-arm Jan 23, 2018 •

edited

Loading

gilles-peskine-arm Jan 24, 2018

mpg Jan 25, 2018

gilles-peskine-arm Jan 22, 2018

mpg Jan 23, 2018

mpg commented Jan 25, 2018 •

edited

Loading

gilles-peskine-arm left a comment

Patater left a comment

Fix race condition in error printing in ssl_server2.c #1287

Fix race condition in error printing in ssl_server2.c #1287

Conversation

mpg commented Jan 22, 2018 • edited Loading

Description

Status

Requires Backporting

gilles-peskine-arm left a comment

Choose a reason for hiding this comment

gilles-peskine-arm Jan 22, 2018

Choose a reason for hiding this comment

mpg Jan 23, 2018

Choose a reason for hiding this comment

mpg Jan 23, 2018

Choose a reason for hiding this comment

gilles-peskine-arm Jan 23, 2018 • edited Loading

Choose a reason for hiding this comment

gilles-peskine-arm Jan 24, 2018

Choose a reason for hiding this comment

mpg Jan 25, 2018

Choose a reason for hiding this comment

gilles-peskine-arm Jan 22, 2018

Choose a reason for hiding this comment

mpg Jan 23, 2018

Choose a reason for hiding this comment

mpg commented Jan 25, 2018 • edited Loading

gilles-peskine-arm left a comment

Choose a reason for hiding this comment

Patater left a comment

Choose a reason for hiding this comment

mpg commented Jan 22, 2018 •

edited

Loading

gilles-peskine-arm Jan 23, 2018 •

edited

Loading

mpg commented Jan 25, 2018 •

edited

Loading