Apart from the typical failing hardware, I have seen pretty significant downtime due to other things too...
- AC failing causing machines to overheat and shut down. New flash central disk (SAN) just installed, but the disk (lun) sharing configuration wrong causing wrong severs to mount the wrong disks. Database servers file system get's messed up, backups have not been properly tested and are not working... Caused about 24 hours downtime for a service with about 1000000 distinct daily users.
- Electricity cut off for a few hours. Generators have not been properly tested and maintained and don't work properly. Servers have to be shut down to wait for mains to return.
- Problem with email system. This particular service was for smaller and medium size companies, so pretty high priority (compared to home users). Backups have a very short rotation for some strange reason, and there is problems getting the service back up. After a few days of fiddling somebody finally get's around to starting to restore the data from tape- only to find the backups had already been written over (caused quite a row, luckily my team was not involved).
Hardware problems are not usually very serious. They can cause some downtime of course if you don't have high availability planned (and make that high availability with out bottlenecks...). Problem is that if you have two servers for high availability you might also need two storage devices, two fiber switches, two regular switches, two routers, two internet connections from separate providers (one preferably a radio link) etc. A lot of people might consider that to be too big an expense compared to the risk.
Most of the bigger problems I have seen have more to do with A) Not planning on how you will recover when you face a bad situation and / or B) not testing the recovery procedure (and your backups).