-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in error printing in ssl_server2.c #1287
Fix race condition in error printing in ssl_server2.c #1287
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right: it's both too much and too little. I don't see any reason to treat the two places where received_sigterm
is checked differently.
programs/ssl/ssl_server2.c
Outdated
@@ -2020,7 +2020,6 @@ int main( int argc, char *argv[] ) | |||
if( received_sigterm ) | |||
{ | |||
mbedtls_printf( " interrupted by SIGTERM\n" ); | |||
ret = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change to never clear ret
, for a normal test run (no failures, no timing weirdness), the server logs now contain
interrupted by SIGTERM
Last error was: -0x45 - NET - The context is invalid, eg because it was free()ed
Although the message mentions an error, this error was intended: the connection is supposed to have been closed mid-connection. I think it would be best to keep suppressing the MBEDTLS_ERR_NET_INVALID_CONTEXT
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances do you get this message? It don't this this happens when running ssl-opt.sh
because otherwise Default
would fail. If it's when running the server and killing in manually, then I would tend not to care: I assume a human can make sense of this error, so I chose to care only about stupid scripts - or more precisely, only those stupid scripts that are part of our test system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, a motivation for not clearing ret
except after net_accept()
is that, since I don't expect SIGTERM
to be received at any other place than net_accept()
under normal (successful) testing circumstances, I'm afraid reseting it in other places could hide errors. Perhaps it doesn't, but I couldn't come up with a strong argument to that effect.
As much as I hate false positives on the CI, the last thing I hate more is false negatives. So while I'm not sure reseting ret
here could hide errors, I don't want to take any unnecessary risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run ssl-opt.sh -p
and check where you get interrupted by SIGTERM
vs interrupted by signal
. Under Default
(and most other test cases), the signal is caught at the interrupted by signal
location. interrupted by SIGTERM
is only for cases where the client disconnects unexpectedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gilles: thanks for letting me know about ssl-opt.sh -p
, I wasn't aware of that option, and it help fix my misconception: signals can indeed be received outside of net_accept()
during normal (successful) testing.
Also, regarding my fear that reseting ret
when the signal is received outside accept()
, it looks like currently we don't have tests that ensure that the server does not see the client disconnecting unexpectedly, and if in the future we want to add some, we can still do so by checking for the "interrupted" message rather than ret
.
So I'll adopt your fix and take the occasion to enhance the messages.
programs/ssl/ssl_server2.c
Outdated
@@ -2020,7 +2020,6 @@ int main( int argc, char *argv[] ) | |||
if( received_sigterm ) | |||
{ | |||
mbedtls_printf( " interrupted by SIGTERM\n" ); | |||
ret = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another place in the code where received_sigterm
is checked. Why ignore ret
here, but reset it in the other place, after the call to mbedtls_net_accept
?
I think that resetting ret
can hide bugs in the server.
When ret = 0
is removed from the post-signal-handler code after mbedtls_net_accept
, the error under normal conditions is
. Waiting for a remote connection ... interrupted by signal
Last error was: -0x4A - NET - Could not accept the incoming connection
This is caught by the “Default” tests which check that the server does not report any error.
I think that in this place, ret
should be reset only if it was MBEDTLS_ERR_NET_ACCEPT_FAILED
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because net_accept()
is the only place where we expect to be interrupted by a SIGTERM
when running ssl-opt.sh
and I chose to care only about that.
I didn't add the test to reset only if ret
was MBEDTLS_ERR_NET_ACCEPT_FAILED
because I thought it was the only possible return code of net_accept()
, but I failed to check and now I realise I was wrong, at least with UDP it can return other things, so I'll add that check.
The race goes this way: 1. ssl_recv() succeeds (ie no signal received yet) 2. processing the message leads to aborting handshake with ret != 0 3. reset ret if we were signaled 4. print error if ret is still non-zero 5. go back to net_accept() which can be interrupted by a signal We print the error message only if the signal is received between steps 3 and 5, not when it arrives between steps 1 and 3. This can cause failures in ssl-opt.sh where we check for the presence of "Last error was..." in the server's output: if we perform step 2, the client will be notified and exit, then ssl-opt.sh will send SIGTERM to the server, but if it didn't get a chance to run and pass step 3 in the meantime, we're in trouble. The purpose of step 3 was to avoid spurious "Last error" messages in the output so that ssl-opt.sh can check for a successful run by the absence of that message. However, it is enough to suppress that message when the last error we get is the one we expect from being interrupted by a signal - doing more could hide real errors. Also, improve the messages printed when interrupted to make it easier to distinguish the two cases - this could be used in a testing script wanted to check that the server doesn't see the client as disconnecting unexpectedly.
950e35d
to
4fa619f
Compare
@gilles-peskine-arm I've addressed your comment. Since this is a 1-commit, < 10-lines changes, I assumed review history was not that important so I forced-pushed a new version. Please review again. Note: backports were updated as well, so they're ready for review too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely sure that this solves the race condition, but in any case it's a good change: the previous code definitely could have been hiding bugs by suppressing errors too eagerly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Mbedtls 2.28.9 mergeback
Description
Fix a race condition in
ssl_server2.c
that caused spurious failures ofssl-opt.sh
especially on hghly-loaded CI machines where it is likely for a process to not get a slice of CPU often enough.This is part 2 of the replacement for #1269 - addressing the "incomplete server logs" issue.
I believe this is not worth a ChangeLog entry as this is hardly visible / of importance to users other than us.
Status
READY
Requires Backporting
Yes
[x] Mbed TLS 2.1 #1290
[x] Mbed TLS 1.3 #1291