Viewing 30 posts - 1 through 30 (of 30 total)
  • General question for the IT systems people
  • Stoner
    Free Member

    currently Fasthosts POP3 and Webmail is down.
    That is inconvenient, but I'll survive.

    The Fasthosts system status site says: "14:37 This issue is ongoing – our Engineers are continuing to work to resolve the issue.
    Status of this entry is Open"

    Now, why is it some systems "fail"? I mean this mail system has been much the same for hte last 3 years, I dont think they screw around with it a great deal, why would it suddenly stop working, and indeed for an extended period? Dont they like to build simply, stable systems that stay up until then next big (tested) upgrade?

    Seems odd that something could "stop working" that's all. 😕

    cxi
    Free Member

    I'm administrator for a small mail server that a few of our customers use. The e-mail services on that has died on arse before – for no reason. We didn't change or install anything, one day it had just had enough.

    Some random "try this, try that" got it back to life in the end.

    These things have to go wrong to keep us in a job 😀

    epicsteve
    Free Member

    Possibly a hardware issue, although you'd expect DR hardware to be in place. If it is a hardware failure then they might be carrying out a recovery from a backup server or tape.

    Servers that are left on for a long time do sometimes fail – either heat related (rack-mount servers in particular can run very warm) or sometimes just hard drives packing in having been constantly spinning for years.

    Stoner
    Free Member

    I suppose I can imagine a hard ware failure from wear and tear.

    Is it impractical to have parrallel systems to keep things live when a hardware failure on one borks it?

    epicsteve
    Free Member

    Is it impractical to have parrallel systems to keep things live when a hardware failure on one borks it?

    Perfectly practical but more expensive, especially if running high-avaibility clusters that will continue to function if one server fails.

    Production systems normally have redundant drives and power supplies so that a failure of one of those won't take the system down, however other failures (e.g. motherboard) can still take a server out.

    allthepies
    Free Member

    Could be hardware failure – firewall/router/disk/server etc etc
    Or numpty error – some disk partition/logfile filled up and the mail app stopped.
    Other faves are some kind of security key expiring – certificate etc.

    Or something else 🙂

    Stoner
    Free Member

    hmm, wonder how long the IT monkeys will take to turn on the spare box and get it all magic'd up?
    🙂

    email down since 13:00. Concern levels are rising to "Agitated".

    CaptainFlashheart
    Free Member

    Have you tried pinging it?

    Stoner
    Free Member

    ping…………ping…………ping………..

    uplink
    Free Member

    Maybe the cleaner unplugged it & the IT monkeys haven't quite got to the bottom of it yet?

    Or they're too busy swatting up on their spoken Klingon for an upcoming convention?

    😉

    CaptainFlashheart
    Free Member

    Or they're too busy swatting up on their spoken Klingon for an upcoming convention?

    Totally OT, but a worthwhile anecdote….

    Near to Flash Towers in that London's Famous London is a bar called "Pages" where the Trekkies of Old London Town meet regularly. They go in full fig, speaking Klingon and much "Live long and prosper" among the almost authentic costumes. Many of them arrive by taxi, as even they are slightly embarrassed by tube travel in that garb.

    Also near to Flash Towers there used to be a pub called "The Page". Many a Trekkie would leap in a cab, and asked to be taken to the aforementioned, getting the wording slightly wrong. Always provided a laugh for the regulars as the Trekkies strolled in to the pub expecting to see it full of their own!

    grahamb
    Free Member

    As others have said, any responsible mail hosting service should have all this HA'd, & mail is one of the easier services to cluster.

    IME, for large mail clusters, it's normally the shared storage to the mail servers or the user authentication that's as likely to be causing the problem.

    bawbag
    Free Member

    I was a sys admin for a company that used Fasthosts when I arrived. After some serious downtime because of their hardware failure and then the password scandal I moved all our apps to Rackspace and haven't had any outages since.

    Unexpected problems will always occur but some companies are better than others at minimizing the consequences. I can't believe Fasthosts are still in business.

    BigButSlimmerBloke
    Free Member

    Ex network admin for medium sized nhs trust
    hardware failures – typically processor overheating is one common fault. disk issues tend not to be a big deal as most systems use some form of disk reslience, RAID 1 or 5.
    Power faults – power to the main computer room goes. A room level UPS will probably keep the room going for about 30 minutes, then the individual rack level UPS' may kick in (if they're installed, giving another 15-20 minutes of life, as long as the local switches are still live.
    Supervisor engines in Cisco core switches have had bouts of being temperature sensitive, so they can go if there's a thermal event in the room, eg air con failure. Blown S3 module can take a few days to replace.
    If it's a Windows system that hasn't been rebooted this week, it's about time for it to take a break.

    Stoner
    Free Member

    BB – If I wanted to move from fasthosts, where would you receommend?

    Fasthosts hold a number of domain registrations for me, and host one very basic website. I have a handful of mailboxes (POP3 & SMTP) using the same domain name as the hosted site.

    How hard is it to move all that over?

    nbt
    Full Member

    BigButSlimmerBloke
    Free Member

    To answer the OP question – they're machines and they break down. Like cars.

    BigButSlimmerBloke
    Free Member

    To answer the OP question – they're machines and they break down. Like cars.

    clubber
    Free Member

    Security patches are also a common cause of problems on systems that have been fine for years – apply them overnight and they cause a problem that either grows or only occurs when a particular service/etc is activated.

    BigButSlimmerBloke
    Free Member

    ..amd somethimes they do things for no good reason, like post things twice

    Stoner
    Free Member

    To answer the OP question – they're machines and they break down. Like cars.

    x2

    Oh the irony 🙂

    bawbag
    Free Member

    BB – If I wanted to move from fasthosts, where would you recommend?

    Well Rackspace are certainly up there as one of the more reliable ones but they are quite expensive. Their technical support is some of the best I have experienced and they'd be able to help you through the move I'm sure. IIRC the other option that was going to be cheaper was http://mediatemple.net/ I'd heard lots of good things about them.

    For basic DNS management I also used http://www.zoneedit.com/ which worked really well.

    Stoner
    Free Member

    basic DNS management

    what is that and am I using it?

    bawbag
    Free Member

    you probably don't need it. It was only because as a company we had bought lots of domain names from different providers and it provided somewhere central to manage them e.g. http://www.mywebsite.com = 165.82.2.35

    Sorry, I was probably just over complicating things. Good luck!

    Stoner
    Free Member

    aha.
    Cheers for pointers though.
    If they keep this up I may well move….
    3+hrs down email.

    Concern now reported to be at "Might have to register on Facebook" levels…:)

    epicsteve
    Free Member

    We partner with a few hosting providers and I'd also concur that Rackspace are one of the better ones.

    waihiboy
    Free Member

    servers just break for no reason at all…. COUGH

    "A long time ago, in a comms room far far away"

    i was the system admin for a very large financial place in dublin years ago NT/2000 days. (way in over my head- got the job by luck and some lies by my old boss 'bigging me up')

    it was just me and the IT manager, who was equally crap!

    we had an exchange server and one day i was updating the software with a Patch. I didnt realise i had to restart after the patch, of course it came back up with the blue screen of death….

    to cut a long story short, we had to call in the cavalry (outside IT firm) to sort it out as i didnt have a fekin clue… over 200 users going mental, had a to pay a small fortune to have the server re-built, they had to stay overnight for 2 days to sort it. ended up going for a pint with the guy who fixed it and he was laughing his head off becuase he knew i'd tried to fix it by taking the server apart and bascially ****ing up the inside.

    i think exchange server has come a long way since but i still have the nightmares of watching all the users through the glass door banging their heads on the desk and the comms room phone lighting up and the total feeling of fear in the pit of my stomach, the look on the IT managers face will stay with me forever aswell…

    i have changed career thankfully.

    kamina
    Free Member

    Apart from the typical failing hardware, I have seen pretty significant downtime due to other things too…

    – AC failing causing machines to overheat and shut down. New flash central disk (SAN) just installed, but the disk (lun) sharing configuration wrong causing wrong severs to mount the wrong disks. Database servers file system get's messed up, backups have not been properly tested and are not working… Caused about 24 hours downtime for a service with about 1000000 distinct daily users.

    – Electricity cut off for a few hours. Generators have not been properly tested and maintained and don't work properly. Servers have to be shut down to wait for mains to return.

    – Problem with email system. This particular service was for smaller and medium size companies, so pretty high priority (compared to home users). Backups have a very short rotation for some strange reason, and there is problems getting the service back up. After a few days of fiddling somebody finally get's around to starting to restore the data from tape- only to find the backups had already been written over (caused quite a row, luckily my team was not involved).

    Hardware problems are not usually very serious. They can cause some downtime of course if you don't have high availability planned (and make that high availability with out bottlenecks…). Problem is that if you have two servers for high availability you might also need two storage devices, two fiber switches, two regular switches, two routers, two internet connections from separate providers (one preferably a radio link) etc. A lot of people might consider that to be too big an expense compared to the risk.

    Most of the bigger problems I have seen have more to do with A) Not planning on how you will recover when you face a bad situation and / or B) not testing the recovery procedure (and your backups).

    Stoner
    Free Member

    This is getting daft now.
    Still an Open status! WTF are they doing there?

    Anyone else affected by Fashosts mail going tits up for 20hrs?

    simon_g
    Full Member

    To answer the question: they fail for a number of reasons. Well designed, highly available systems still have failures, you just don't notice them. Quite how much redundancy you want and how many scenarios you want to be able to cope with depends entirely on your budget.

    Even Microsoft suffered a storage failure in their internal Exchange environment which left about 8000 people without mail for a few days. Sometimes a few things break at once in a way that can't immediately be fixed.

Viewing 30 posts - 1 through 30 (of 30 total)

The topic ‘General question for the IT systems people’ is closed to new replies.