Network fault in the LZR data center (beendet)
From: 09.06.2022 / 11:10
To: 13.06.2022 / 21:30
There is a major fault in the network of the virtualization environment. We are currently looking for the cause. Due to the malfunction, many VMs are currently not accessible.

=== June 09, 2022 ===
12:15: The root cause was found and has been eliminated.

14:30: There is another fault that leads to partial packet loss. We are working on this issue.

20:15: The cause of the packet loss has been identified. However, a fix is not immediately possible. Cisco TAC Support is involved and further analysis will be done tomorrow.

=== June 10, 2022 ===
10:30: The analysis with the Cisco TAC engineer has been running since 9 am. In parallel, we are evaluating different options on how to get back to normal operation today if the problem cannot be fixed.

13:30: The work was finished for today. The error could not be localized yet. As a workaround, one of the two firewalls has been removed from the cluster. The network traffic should work properly again.

14:45: Contrary to what was assumed, there are still disruptions in the network connections. We are on it.

17:00: Most systems should now be working again. Due to the intermittent nature of the fault, it is not apparent from our monitoring if individual VMs are still experiencing problems. If you are a VM admin and still experiencing a fault on your VM, please send an email to servicedesk@tu-dresden.de with the subject "VM-Störung <VM name>" with the following additional info: VM IP address, what exactly is not working (access to which port from which IP).

=== June 13, 2022 ===
08:30: On Monday, 13.06.2022, further analyses will take place in cooperation with our service provider for sustainable error elimination. For this purpose, the error state must be restored.

15:00: Thank you very much for your feedback! Another small error was found in the configuration of the firewall with us, which affected individual VMs and which is now fixed. Tonight from 19:00, the main problem will be further analyzed with a Cisco firewall specialist. This may result in further disruptions again, as it is likely that the firewall cluster will also have to be put back into operation for the analysis.

19:25 The firewall cluster is being reassembled, connection issues are to be expected again.

21:30 The fault has been corrected and the firewall cluster is running properly in redundant mode again.

