Topic: Sporadic Downtime

Posted under General

Hey folks,

As you may have noticed, roughly every 24hrs, e621 (as well as a number of other sites I host) go offline for 15 to 45 minutes, then return.

This is because there is an as-yet-unresolved network issue causing sporadic network outages in the datacenter where e621 is hosted.

We are attempting to diagnose and rectify the problem, and are going to be swapping out the faulty 10-gigabit fiber line (and associated optics on both ends) for a pair of 2x1gbit copper lines as soon as possible, as it appears that all attempts to diagnose and resolve the 'link falling over' issue have proved fruitless. This has, naturally, proven extremely frustrating for all involved, and if this solution does not rectify it, we shall be replacing basically our entire core network with new gear, as we really cannot afford the downtime (especially for other customers).

So, please do not despair when the site goes down for 15-45 minutes, approximately once a day... I'm taking steps (as fast as reasonably possible) to resolve it.

Thank you all for your patience.

Varka

Updated

Well i guess in meantime, we just have to *puts on sunglasses* deal with it

Updated by anonymous

I was wondering what that was about.

Updated by anonymous

You may additionally note that the site is also sporadically slow. This also appears to be caused by a link issue, this time a capacity bottleneck.

It's not meant to be this slow, honest... (try pinging e621.net and watch the ping times rise up and down).

Again, we're working on balancing things out and getting this resolved.

Updated by anonymous

The site was going down? Never noticed. Especially against the issues the site used to have under old management.

Updated by anonymous

A progress update.

New networking gear (two Brocade 48-port, managed gigabit switches, with 2x 10gbit ports each) is on its way; we expect to be getting this racked up on approximately Tuesday and configuring them appropriately (and testing the configuration thoroughly), before swapping all our network ports on the two affected switches across to these (switch4 and switch1, the two that have been causing all the problems).

I expect that next Thursday we'll be swapping over to these and praying the problem goes away.

As for the current 'lag spike' issues, I've narrowed it down to a problem on the e621 server itself. Xen bridging appears to be randomly adding lag spikes proportional to the traffic being sent for a random period of time (from 100ms for 30-40mbit, to 1500ms for 130mbit) for no apparent reason. I will be fucking with the network configuration mainly during off-peak times (if I can) to try and resolve this.

In an attempt to diagnose the above issue, I have changed e621s primary IP from 66.160.196.207 to 199.167.134.31. I will be running both IPs for the rest of the day during the changeover, then turning off the old one once the DNS changes have propagated... so if you suddenly can't reach the site, this (and your ISP caching DNS records for longer than the time published on the records) is probably why.

Thanks for your patience, everyone. <3

Updated by anonymous

  • 1