General question for the IT systems people

Overview Chat Bike Members News Women

This topic has 29 replies, 14 voices, and was last updated 14 years ago by simon_g.

Viewing 30 posts - 1 through 30 (of 30 total)

General question for the IT systems people
Stoner
Free Member

currently Fasthosts POP3 and Webmail is down.
That is inconvenient, but I'll survive.

The Fasthosts system status site says: "14:37 This issue is ongoing – our Engineers are continuing to work to resolve the issue.
Status of this entry is Open"

Now, why is it some systems "fail"? I mean this mail system has been much the same for hte last 3 years, I dont think they screw around with it a great deal, why would it suddenly stop working, and indeed for an extended period? Dont they like to build simply, stable systems that stay up until then next big (tested) upgrade?

Seems odd that something could "stop working" that's all. 😕

Posted 14 years ago

cxi
Free Member

I'm administrator for a small mail server that a few of our customers use. The e-mail services on that has died on arse before – for no reason. We didn't change or install anything, one day it had just had enough.

Some random "try this, try that" got it back to life in the end.

These things have to go wrong to keep us in a job 😀

Posted 14 years ago

epicsteve
Free Member

Possibly a hardware issue, although you'd expect DR hardware to be in place. If it is a hardware failure then they might be carrying out a recovery from a backup server or tape.

Servers that are left on for a long time do sometimes fail – either heat related (rack-mount servers in particular can run very warm) or sometimes just hard drives packing in having been constantly spinning for years.

Posted 14 years ago

Stoner
Free Member

I suppose I can imagine a hard ware failure from wear and tear.

Is it impractical to have parrallel systems to keep things live when a hardware failure on one borks it?

Posted 14 years ago

epicsteve
Free Member

Is it impractical to have parrallel systems to keep things live when a hardware failure on one borks it?

Perfectly practical but more expensive, especially if running high-avaibility clusters that will continue to function if one server fails.

Production systems normally have redundant drives and power supplies so that a failure of one of those won't take the system down, however other failures (e.g. motherboard) can still take a server out.

Posted 14 years ago

allthepies
Free Member

Could be hardware failure – firewall/router/disk/server etc etc
Or numpty error – some disk partition/logfile filled up and the mail app stopped.
Other faves are some kind of security key expiring – certificate etc.

Or something else 🙂

Posted 14 years ago

Stoner
Free Member

hmm, wonder how long the IT monkeys will take to turn on the spare box and get it all magic'd up?
🙂

email down since 13:00. Concern levels are rising to "Agitated".

Posted 14 years ago

CaptainFlashheart
Free Member

Have you tried pinging it?

Posted 14 years ago

Stoner
Free Member

ping…………ping…………ping………..

Posted 14 years ago

uplink
Free Member

Maybe the cleaner unplugged it & the IT monkeys haven't quite got to the bottom of it yet?

Or they're too busy swatting up on their spoken Klingon for an upcoming convention?

😉

Posted 14 years ago

CaptainFlashheart
Free Member

Or they're too busy swatting up on their spoken Klingon for an upcoming convention?

Totally OT, but a worthwhile anecdote….

Near to Flash Towers in that London's Famous London is a bar called "Pages" where the Trekkies of Old London Town meet regularly. They go in full fig, speaking Klingon and much "Live long and prosper" among the almost authentic costumes. Many of them arrive by taxi, as even they are slightly embarrassed by tube travel in that garb.

Also near to Flash Towers there used to be a pub called "The Page". Many a Trekkie would leap in a cab, and asked to be taken to the aforementioned, getting the wording slightly wrong. Always provided a laugh for the regulars as the Trekkies strolled in to the pub expecting to see it full of their own!

Posted 14 years ago

grahamb
Free Member

As others have said, any responsible mail hosting service should have all this HA'd, & mail is one of the easier services to cluster.

IME, for large mail clusters, it's normally the shared storage to the mail servers or the user authentication that's as likely to be causing the problem.

Posted 14 years ago

bawbag
Free Member

I was a sys admin for a company that used Fasthosts when I arrived. After some serious downtime because of their hardware failure and then the password scandal I moved all our apps to Rackspace and haven't had any outages since.

Unexpected problems will always occur but some companies are better than others at minimizing the consequences. I can't believe Fasthosts are still in business.

Posted 14 years ago

BigButSlimmerBloke
Free Member

Ex network admin for medium sized nhs trust
hardware failures – typically processor overheating is one common fault. disk issues tend not to be a big deal as most systems use some form of disk reslience, RAID 1 or 5.
Power faults – power to the main computer room goes. A room level UPS will probably keep the room going for about 30 minutes, then the individual rack level UPS' may kick in (if they're installed, giving another 15-20 minutes of life, as long as the local switches are still live.
Supervisor engines in Cisco core switches have had bouts of being temperature sensitive, so they can go if there's a thermal event in the room, eg air con failure. Blown S3 module can take a few days to replace.
If it's a Windows system that hasn't been rebooted this week, it's about time for it to take a break.

Posted 14 years ago

Stoner
Free Member

BB – If I wanted to move from fasthosts, where would you receommend?

Fasthosts hold a number of domain registrations for me, and host one very basic website. I have a handful of mailboxes (POP3 & SMTP) using the same domain name as the hosted site.

How hard is it to move all that over?

Posted 14 years ago

nbt
Full Member

Posted 14 years ago

BigButSlimmerBloke
Free Member

To answer the OP question – they're machines and they break down. Like cars.

Posted 14 years ago

BigButSlimmerBloke
Free Member

To answer the OP question – they're machines and they break down. Like cars.

Posted 14 years ago

clubber
Free Member

Security patches are also a common cause of problems on systems that have been fine for years – apply them overnight and they cause a problem that either grows or only occurs when a particular service/etc is activated.

Posted 14 years ago

BigButSlimmerBloke
Free Member

..amd somethimes they do things for no good reason, like post things twice

Posted 14 years ago

Stoner
Free Member

To answer the OP question – they're machines and they break down. Like cars.

x2

Oh the irony 🙂

Posted 14 years ago

bawbag
Free Member

BB – If I wanted to move from fasthosts, where would you recommend?

Well Rackspace are certainly up there as one of the more reliable ones but they are quite expensive. Their technical support is some of the best I have experienced and they'd be able to help you through the move I'm sure. IIRC the other option that was going to be cheaper was http://mediatemple.net/ I'd heard lots of good things about them.

For basic DNS management I also used http://www.zoneedit.com/ which worked really well.

Posted 14 years ago

Stoner
Free Member

basic DNS management

what is that and am I using it?

Posted 14 years ago

bawbag
Free Member

you probably don't need it. It was only because as a company we had bought lots of domain names from different providers and it provided somewhere central to manage them e.g. http://www.mywebsite.com = 165.82.2.35

Sorry, I was probably just over complicating things. Good luck!

Posted 14 years ago

Stoner
Free Member

aha.
Cheers for pointers though.
If they keep this up I may well move….
3+hrs down email.

Concern now reported to be at "Might have to register on Facebook" levels…:)

Posted 14 years ago

epicsteve
Free Member

We partner with a few hosting providers and I'd also concur that Rackspace are one of the better ones.

Posted 14 years ago

waihiboy
Free Member

servers just break for no reason at all…. COUGH

"A long time ago, in a comms room far far away"

i was the system admin for a very large financial place in dublin years ago NT/2000 days. (way in over my head- got the job by luck and some lies by my old boss 'bigging me up')

it was just me and the IT manager, who was equally crap!

we had an exchange server and one day i was updating the software with a Patch. I didnt realise i had to restart after the patch, of course it came back up with the blue screen of death….

to cut a long story short, we had to call in the cavalry (outside IT firm) to sort it out as i didnt have a fekin clue… over 200 users going mental, had a to pay a small fortune to have the server re-built, they had to stay overnight for 2 days to sort it. ended up going for a pint with the guy who fixed it and he was laughing his head off becuase he knew i'd tried to fix it by taking the server apart and bascially ****ing up the inside.

i think exchange server has come a long way since but i still have the nightmares of watching all the users through the glass door banging their heads on the desk and the comms room phone lighting up and the total feeling of fear in the pit of my stomach, the look on the IT managers face will stay with me forever aswell…

i have changed career thankfully.

Posted 14 years ago

kamina
Free Member

Apart from the typical failing hardware, I have seen pretty significant downtime due to other things too…

– AC failing causing machines to overheat and shut down. New flash central disk (SAN) just installed, but the disk (lun) sharing configuration wrong causing wrong severs to mount the wrong disks. Database servers file system get's messed up, backups have not been properly tested and are not working… Caused about 24 hours downtime for a service with about 1000000 distinct daily users.

– Electricity cut off for a few hours. Generators have not been properly tested and maintained and don't work properly. Servers have to be shut down to wait for mains to return.

– Problem with email system. This particular service was for smaller and medium size companies, so pretty high priority (compared to home users). Backups have a very short rotation for some strange reason, and there is problems getting the service back up. After a few days of fiddling somebody finally get's around to starting to restore the data from tape- only to find the backups had already been written over (caused quite a row, luckily my team was not involved).

Hardware problems are not usually very serious. They can cause some downtime of course if you don't have high availability planned (and make that high availability with out bottlenecks…). Problem is that if you have two servers for high availability you might also need two storage devices, two fiber switches, two regular switches, two routers, two internet connections from separate providers (one preferably a radio link) etc. A lot of people might consider that to be too big an expense compared to the risk.

Most of the bigger problems I have seen have more to do with A) Not planning on how you will recover when you face a bad situation and / or B) not testing the recovery procedure (and your backups).

Posted 14 years ago

Stoner
Free Member

This is getting daft now.
Still an Open status! WTF are they doing there?

Anyone else affected by Fashosts mail going tits up for 20hrs?

Posted 14 years ago

simon_g
Full Member

To answer the question: they fail for a number of reasons. Well designed, highly available systems still have failures, you just don't notice them. Quite how much redundancy you want and how many scenarios you want to be able to cope with depends entirely on your budget.

Even Microsoft suffered a storage failure in their internal Exchange environment which left about 8000 people without mail for a few days. Sometimes a few things break at once in a way that can't immediately be fixed.

Posted 14 years ago

Viewing 30 posts - 1 through 30 (of 30 total)

The topic ‘General question for the IT systems people’ is closed to new replies.

Overview Chat Bike Members News Women

Down Gilet Rab Microlight