Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in error printing in ssl_server2.c #1287

Merged

Conversation

mpg
Copy link
Contributor

@mpg mpg commented Jan 22, 2018

Description

Fix a race condition in ssl_server2.c that caused spurious failures of ssl-opt.sh especially on hghly-loaded CI machines where it is likely for a process to not get a slice of CPU often enough.

This is part 2 of the replacement for #1269 - addressing the "incomplete server logs" issue.

I believe this is not worth a ChangeLog entry as this is hardly visible / of importance to users other than us.

Status

READY

Requires Backporting

Yes
[x] Mbed TLS 2.1 #1290
[x] Mbed TLS 1.3 #1291

Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right: it's both too much and too little. I don't see any reason to treat the two places where received_sigterm is checked differently.

@@ -2020,7 +2020,6 @@ int main( int argc, char *argv[] )
if( received_sigterm )
{
mbedtls_printf( " interrupted by SIGTERM\n" );
ret = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change to never clear ret, for a normal test run (no failures, no timing weirdness), the server logs now contain

interrupted by SIGTERM
Last error was: -0x45 - NET - The context is invalid, eg because it was free()ed

Although the message mentions an error, this error was intended: the connection is supposed to have been closed mid-connection. I think it would be best to keep suppressing the MBEDTLS_ERR_NET_INVALID_CONTEXT here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances do you get this message? It don't this this happens when running ssl-opt.sh because otherwise Default would fail. If it's when running the server and killing in manually, then I would tend not to care: I assume a human can make sense of this error, so I chose to care only about stupid scripts - or more precisely, only those stupid scripts that are part of our test system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a motivation for not clearing ret except after net_accept() is that, since I don't expect SIGTERM to be received at any other place than net_accept() under normal (successful) testing circumstances, I'm afraid reseting it in other places could hide errors. Perhaps it doesn't, but I couldn't come up with a strong argument to that effect.

As much as I hate false positives on the CI, the last thing I hate more is false negatives. So while I'm not sure reseting ret here could hide errors, I don't want to take any unnecessary risk.

Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm Jan 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run ssl-opt.sh -p and check where you get interrupted by SIGTERM vs interrupted by signal. Under Default (and most other test cases), the signal is caught at the interrupted by signal location. interrupted by SIGTERM is only for cases where the client disconnects unexpectedly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpg I'm not sure we're understanding each other here. This is what looks right to me. No clearing of errors in general, but do clear an the error resulting from interrupting the network operation, and use the same treatment in both places where a signal is detected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gilles: thanks for letting me know about ssl-opt.sh -p, I wasn't aware of that option, and it help fix my misconception: signals can indeed be received outside of net_accept() during normal (successful) testing.

Also, regarding my fear that reseting ret when the signal is received outside accept(), it looks like currently we don't have tests that ensure that the server does not see the client disconnecting unexpectedly, and if in the future we want to add some, we can still do so by checking for the "interrupted" message rather than ret.

So I'll adopt your fix and take the occasion to enhance the messages.

@@ -2020,7 +2020,6 @@ int main( int argc, char *argv[] )
if( received_sigterm )
{
mbedtls_printf( " interrupted by SIGTERM\n" );
ret = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another place in the code where received_sigterm is checked. Why ignore ret here, but reset it in the other place, after the call to mbedtls_net_accept?

I think that resetting ret can hide bugs in the server.

When ret = 0 is removed from the post-signal-handler code after mbedtls_net_accept, the error under normal conditions is

  . Waiting for a remote connection ... interrupted by signal
Last error was: -0x4A - NET - Could not accept the incoming connection

This is caught by the “Default” tests which check that the server does not report any error.

I think that in this place, ret should be reset only if it was MBEDTLS_ERR_NET_ACCEPT_FAILED.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because net_accept() is the only place where we expect to be interrupted by a SIGTERM when running ssl-opt.sh and I chose to care only about that.

I didn't add the test to reset only if ret was MBEDTLS_ERR_NET_ACCEPT_FAILED because I thought it was the only possible return code of net_accept(), but I failed to check and now I realise I was wrong, at least with UDP it can return other things, so I'll add that check.

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
The race goes this way:
1. ssl_recv() succeeds (ie no signal received yet)
2. processing the message leads to aborting handshake with ret != 0
3. reset ret if we were signaled
4. print error if ret is still non-zero
5. go back to net_accept() which can be interrupted by a signal
We print the error message only if the signal is received between steps 3 and
5, not when it arrives between steps 1 and 3.

This can cause failures in ssl-opt.sh where we check for the presence of "Last
error was..." in the server's output: if we perform step 2, the client will be
notified and exit, then ssl-opt.sh will send SIGTERM to the server, but if it
didn't get a chance to run and pass step 3 in the meantime, we're in trouble.

The purpose of step 3 was to avoid spurious "Last error" messages in the
output so that ssl-opt.sh can check for a successful run by the absence of
that message. However, it is enough to suppress that message when the last
error we get is the one we expect from being interrupted by a signal - doing
more could hide real errors.

Also, improve the messages printed when interrupted to make it easier to
distinguish the two cases - this could be used in a testing script wanted to
check that the server doesn't see the client as disconnecting unexpectedly.
@mpg mpg force-pushed the iotssl-2006-incomplete-server-logs branch from 950e35d to 4fa619f Compare January 25, 2018 10:41
@mpg
Copy link
Contributor Author

mpg commented Jan 25, 2018

@gilles-peskine-arm I've addressed your comment. Since this is a 1-commit, < 10-lines changes, I assumed review history was not that important so I forced-pushed a new version. Please review again.

Note: backports were updated as well, so they're ready for review too.

Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure that this solves the race condition, but in any case it's a good change: the previous code definitely could have been hiding bugs by suppressing errors too eagerly.

Copy link
Contributor

@Patater Patater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mpg mpg added approved Design and code approved - may be waiting for CI or backports and removed needs-review Every commit must be reviewed by at least two team members, labels Jan 29, 2018
@Patater Patater merged commit 4fa619f into Mbed-TLS:development Jan 29, 2018
@mpg mpg deleted the iotssl-2006-incomplete-server-logs branch September 10, 2019 10:32
davidhorstmann-arm added a commit that referenced this pull request Aug 30, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Mbedtls 2.28.9 mergeback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Design and code approved - may be waiting for CI or backports bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants