HEXONET CP availability issues
Incident Report for HEXONET
Postmortem

Issue Summary

In follow up to our regular updates on status.hexonet.net, we have captured the following details to provide further insights to our dedicated and supportive customers. Thank you again for your understanding during the brief downtime.

Commencing on the morning of Wednesday, April 28th 2022, 03:15 +0000 (UTC) the HEXONET Control Panel was only partially available.

This was caused by a bug that, only under certain conditions, when searching for a punycode domain in our Control Panel search bar caused a server crash and restart. The suggestion engine modified the given term into an invalid input.

A solution has been put in place and fully availability was restored after 2022-04-28 06:30 +0000 (UTC).

Root Cause

A simple search for a punycode domain in our search on hexonet.net triggers our search engine to modify the entered term to provide alternatives. The library we use to convert punycode is amongst the most popular solutions and was even part of the nodeJS framework until recently: https://www.npmjs.com/package/punycode

The library implements RFC-3492 which extends RFC-3490 and mentions about the “toUnicode” method:

ToUnicode never fails. If any step fails, then the original input sequence is returned immediately in that step.

It turned out it does fail with a certain input that is searched for. Our suggestion engine modified that term and added a fitting TLD, seemingly fitting for the given input. This resulting term is an invalid input and caused a server crash/restart. Strangely, we only found this particular term being returned by a punycode conversion to trigger the bug. As such, the likelyhood to trigger the bug itself was extremely small.

Our implementation has a retry mechanism and will re-send the query after a little while if it doesn’t get the search results. In that case, it was never getting it because the query was crashing the server, so it was retrying and crashing the server repeatedly.

Our assumption is that possibly a user left their browser open and never got the search result so it was retrying and continuously crashing the server.

Our 2 Control Panel servers share a server for session management, and our loadbalancer directs the traffic evenly between servers, so that repeating query would randomly crash both servers.

The query extends the user session, and the session is managed in the server that wasn’t affected by the error, so that user session would continuously get extended.

In sum, only under rare conditions and only because of the given scenario with our Control Panel setup it was able to result in repeatedly crashing the Control Panel.

Remediation and Prevention

After a preliminary fix a full but yet relatively simple patch was subsequently deployed to catch errors occuring after conversion of punycode domains by our search engine.

Timeline and Detailed Description of Impact

Wednesday, April 28th 2022

03:15 +0000 (UTC)

We noticed and received first reports that the availability of the HEXONET Control Panel was only partially given.

05:03 +0000 (UTC)

Analysis started.

06:30 +0000 (UTC)

A fix has been put into place. Full availability of the HEXONET Control Panel has been restored.

No Impact on Data Security

The data integrety remained fully, at no point in time any data has been compromised.

__

As always, we remain dedicated to your success and hope this background information supports your understanding for all that transpired. If you have any questions or would like to connect with our team, please always contact us at help@hexonet.support.

Your HEXONET Team

Posted May 09, 2022 - 07:57 UTC

Resolved
As previously posted, mitigation in form of a fix has been put in place and our CP is fully operational again.

A deeper analysis of the underlying issue is ongoing.
Also, we're working to ensure the particular issue will not come up again.
A full incident report will be provided afterwards.

Until then we regard the incident as resolved.

Many thanks for your patience and please excuse all inconvenience caused.
Posted Apr 28, 2022 - 09:56 UTC
Monitoring
Our team has identified the source of the issue and put mitigations in place.
Availability of CP has been restored.
We're closely monitoring the situation and will continue updating here.
Posted Apr 28, 2022 - 07:21 UTC
Identified
HEXONET CP (Control Panel) is experiencing an intermittent availability issue to https://account.hexonet.net/
You may receive an error message trying to reach our CP.

Our teams are working on resolving the issue.

We apologize for the inconvenience.

We are actively monitoring the situation and we will update here as soon as possible.
Posted Apr 28, 2022 - 05:40 UTC
This incident affected: Hexonet Console (Web Interface).