General question for the IT systems people
I'm administrator for a small mail server that a few of our customers use. The e-mail services on that has died on arse before – for no reason. We didn't change or install anything, one day it had just had enough.
Some random "try this, try that" got it back to life in the end.
These things have to go wrong to keep us in a job 😀Posted 8 years ago
Possibly a hardware issue, although you'd expect DR hardware to be in place. If it is a hardware failure then they might be carrying out a recovery from a backup server or tape.
Servers that are left on for a long time do sometimes fail – either heat related (rack-mount servers in particular can run very warm) or sometimes just hard drives packing in having been constantly spinning for years.Posted 8 years ago
Is it impractical to have parrallel systems to keep things live when a hardware failure on one borks it?
Perfectly practical but more expensive, especially if running high-avaibility clusters that will continue to function if one server fails.
Production systems normally have redundant drives and power supplies so that a failure of one of those won't take the system down, however other failures (e.g. motherboard) can still take a server out.Posted 8 years agoCaptainFlashheartMember
Or they're too busy swatting up on their spoken Klingon for an upcoming convention?
Totally OT, but a worthwhile anecdote….
Near to Flash Towers in that London's Famous London is a bar called "Pages" where the Trekkies of Old London Town meet regularly. They go in full fig, speaking Klingon and much "Live long and prosper" among the almost authentic costumes. Many of them arrive by taxi, as even they are slightly embarrassed by tube travel in that garb.
Also near to Flash Towers there used to be a pub called "The Page". Many a Trekkie would leap in a cab, and asked to be taken to the aforementioned, getting the wording slightly wrong. Always provided a laugh for the regulars as the Trekkies strolled in to the pub expecting to see it full of their own!Posted 8 years agograhambMember
As others have said, any responsible mail hosting service should have all this HA'd, & mail is one of the easier services to cluster.
IME, for large mail clusters, it's normally the shared storage to the mail servers or the user authentication that's as likely to be causing the problem.Posted 8 years ago
I was a sys admin for a company that used Fasthosts when I arrived. After some serious downtime because of their hardware failure and then the password scandal I moved all our apps to Rackspace and haven't had any outages since.
Unexpected problems will always occur but some companies are better than others at minimizing the consequences. I can't believe Fasthosts are still in business.Posted 8 years ago
Ex network admin for medium sized nhs trustPosted 8 years ago
hardware failures – typically processor overheating is one common fault. disk issues tend not to be a big deal as most systems use some form of disk reslience, RAID 1 or 5.
Power faults – power to the main computer room goes. A room level UPS will probably keep the room going for about 30 minutes, then the individual rack level UPS' may kick in (if they're installed, giving another 15-20 minutes of life, as long as the local switches are still live.
Supervisor engines in Cisco core switches have had bouts of being temperature sensitive, so they can go if there's a thermal event in the room, eg air con failure. Blown S3 module can take a few days to replace.
If it's a Windows system that hasn't been rebooted this week, it's about time for it to take a break.
BB – If I wanted to move from fasthosts, where would you receommend?
Fasthosts hold a number of domain registrations for me, and host one very basic website. I have a handful of mailboxes (POP3 & SMTP) using the same domain name as the hosted site.
How hard is it to move all that over?Posted 8 years ago
currently Fasthosts POP3 and Webmail is down.
That is inconvenient, but I'll survive.
The Fasthosts system status site says: "14:37 This issue is ongoing – our Engineers are continuing to work to resolve the issue.
Status of this entry is Open"
Now, why is it some systems "fail"? I mean this mail system has been much the same for hte last 3 years, I dont think they screw around with it a great deal, why would it suddenly stop working, and indeed for an extended period? Dont they like to build simply, stable systems that stay up until then next big (tested) upgrade?
Seems odd that something could "stop working" that's all. 😕Posted 8 years ago
BB – If I wanted to move from fasthosts, where would you recommend?
Well Rackspace are certainly up there as one of the more reliable ones but they are quite expensive. Their technical support is some of the best I have experienced and they'd be able to help you through the move I'm sure. IIRC the other option that was going to be cheaper was http://mediatemple.net/ I'd heard lots of good things about them.
For basic DNS management I also used http://www.zoneedit.com/ which worked really well.Posted 8 years agowaihiboyMember
servers just break for no reason at all…. COUGH
"A long time ago, in a comms room far far away"
i was the system admin for a very large financial place in dublin years ago NT/2000 days. (way in over my head- got the job by luck and some lies by my old boss 'bigging me up')
it was just me and the IT manager, who was equally crap!
we had an exchange server and one day i was updating the software with a Patch. I didnt realise i had to restart after the patch, of course it came back up with the blue screen of death….
to cut a long story short, we had to call in the cavalry (outside IT firm) to sort it out as i didnt have a fekin clue… over 200 users going mental, had a to pay a small fortune to have the server re-built, they had to stay overnight for 2 days to sort it. ended up going for a pint with the guy who fixed it and he was laughing his head off becuase he knew i'd tried to fix it by taking the server apart and bascially ****ing up the inside.
i think exchange server has come a long way since but i still have the nightmares of watching all the users through the glass door banging their heads on the desk and the comms room phone lighting up and the total feeling of fear in the pit of my stomach, the look on the IT managers face will stay with me forever aswell…
i have changed career thankfully.Posted 8 years agokaminaMember
Apart from the typical failing hardware, I have seen pretty significant downtime due to other things too…
– AC failing causing machines to overheat and shut down. New flash central disk (SAN) just installed, but the disk (lun) sharing configuration wrong causing wrong severs to mount the wrong disks. Database servers file system get's messed up, backups have not been properly tested and are not working… Caused about 24 hours downtime for a service with about 1000000 distinct daily users.
– Electricity cut off for a few hours. Generators have not been properly tested and maintained and don't work properly. Servers have to be shut down to wait for mains to return.
– Problem with email system. This particular service was for smaller and medium size companies, so pretty high priority (compared to home users). Backups have a very short rotation for some strange reason, and there is problems getting the service back up. After a few days of fiddling somebody finally get's around to starting to restore the data from tape- only to find the backups had already been written over (caused quite a row, luckily my team was not involved).
Hardware problems are not usually very serious. They can cause some downtime of course if you don't have high availability planned (and make that high availability with out bottlenecks…). Problem is that if you have two servers for high availability you might also need two storage devices, two fiber switches, two regular switches, two routers, two internet connections from separate providers (one preferably a radio link) etc. A lot of people might consider that to be too big an expense compared to the risk.
Most of the bigger problems I have seen have more to do with A) Not planning on how you will recover when you face a bad situation and / or B) not testing the recovery procedure (and your backups).Posted 8 years agosimon_gSubscriber
To answer the question: they fail for a number of reasons. Well designed, highly available systems still have failures, you just don't notice them. Quite how much redundancy you want and how many scenarios you want to be able to cope with depends entirely on your budget.
Even Microsoft suffered a storage failure in their internal Exchange environment which left about 8000 people without mail for a few days. Sometimes a few things break at once in a way that can't immediately be fixed.Posted 8 years ago
The topic ‘General question for the IT systems people’ is closed to new replies.