There have been a few hiccups with our webmail service over the past couple of weeks. Most of these issues were predominantly about the webmail taking considerable time to load. To balance the load, we decided to add a new server to this cluster. However, we faced a few setbacks after adding this server.
The problems started early yesterday after we added a new server to the webmail cluster. This was done to increase response times and share the load on our present set of servers, however, there was an issue with ethernet bonding on this server where suddenly we would see spurts of packet loss. This did not catch our attention straight away cause simultaneously we were also facing a few issues with our datacenter and we reckoned this was related to the datacenter issue. We did not recognize till we noticed the occsional switching of the virtual IP address.
We use an HA software for resource management. Every cluster deployment would have one to ensure availability. What this piece of applicaiton would do is, when you experience packet losses with one IP address,it would switch to the other Virtual IP in the table to make sure that our set of servers are always available. So, the sotware switched our IP to the next virtual IP available. But, due to the network issues we have been facing with the datacenter, we had recently updated our IP address to point to another network and the ARP cache on the switch was still pointing to the old server with the faulty network as the switch didn’t support multicast. Now, we were faced with two issues, the first with a broken bonding on our new server and the second with our switch.
We first fixed the bonding on the server and then setup a script that sent a unicast arping to clear/refresh the arp cache of the swtich in virtual IP’s. It took 10 minutes for the cache to clear and we were back online. As of now, the webmail interface is working seamlessly and all issues have been fixed.