Sign in to follow this  
Followers 0

Unexplained loads seen on a Linux webserver

45 posts in this topic

Been driving me mad for about 9 days. My dedicated server slows to a crawl, the server load shoots up, sites become unavailable, cpsrvd restarts and then after xx minutes it all goes back to normal.

 

We have been looking for DOS attacks, hackers, faulty hard drives, robots and spiders, rogue scripts etc etc. I am not good at this stuff, but have had software and hardware guys trying to work it out. Replaced all components except the hard drive so far.

 

However...it just happened again for 20-30 minutes and I took some screenshots in TOP when logged in via SSH. First is WITH high load, and the 2nd as you can see with the server load just coming down. Anyone see anything as a possible cause???

 

post-544-1223299130_thumb.jpg

0

Share this post


Link to post
Share on other sites

kswapd0 is the virtual memory management daemon but your VM cache is at zero. Some mistake there surely?

0

Share this post


Link to post
Share on other sites

Not too sure there, what flavour of Linux are you running at what version, and is it booted to runlevel 3 or 5 (ie. do you have the GUI running too?).

 

It looks like your pooter is waiting for IO so the disk should be investigated - the computer is IO limited. This link is quite interesting. The %97+ wa value could also come from a poorly configured gigabit ethernet card on a slow bus - Gigabit ethernet cards must be on PCIe, or ideally fully integrated on the mainboard, for them to work properly or they run the risk of drowning the system in interrupts.

 

Oh, and look out for stupid 'indexing' tasks that run from time to time. Older SuSE versions have one that does this at random points and it's a real pain in the tits.

0

Share this post


Link to post
Share on other sites

If it's waiting for IO and the swap daemon is the top process then you'd expect a prob with the swap file/disk. But according to top, there isn't one.

 

Maybe configure it with a swap file and see what happens then?

0

Share this post


Link to post
Share on other sites

Seems to be that the WAIT time is 97.8% at the top, so everyone is saying that something is causeing heavy I/O read-write to the hard disk. But we have no indication of what is running and causing the problem, although we are firewalled etc and my hosting company do not think it is hackers/DOS/robots etc from outside. They reckon an iffy script or process or something internally.

 

It looks like there is enough free RAM even when it is falling over??

0

Share this post


Link to post
Share on other sites

Actually what's happening is the CPU is waiting for IO, not actually doing it, so you may not see a process with a high CPU load. Check the link in my previous post on vmstat and ifconfig to find out if it's the hard disk or network that's giving you trouble. It may be that the disk is on the way out and is slow, or is simply overloaded.

 

 

If one of the wa or hi parameters is high, it can indicate a real problem. Normally, the wa parameter shows how much time the CPU has wasted waiting for I/O. This I/O can come from the hard disk or from the network. Therefore, a high value on the wa parameter often indicates a slow hard disk or a slow network connection, which will require some fine tuning. To find out if it is the hard disk or the network, you can use vmstat and ifconfig. The ifconfig command shows statistics on packets handled by a network card, whereas the vmstat command provides information about the amount of traffic handled by a hard disk. The latter is displayed by the bi (blocks in) and bo (blocks out) parameters. If these are really high, the disk may be the cause of the high value in top's wa parameter. In that case it may be useful to upgrade the disk channel.
0

Share this post


Link to post
Share on other sites

kswapd0 (Kernel Swapper) wouldn't be that busy if you had enough RAM. More RAM will help you with the IO performance as well, as free memory is used for disk caching. The reason why you see free memory in top is probably because kswap0 is busy killing processes which exhaust memory (How is swapping supposed to work if there is no swap partition?), thereby exacerbating the problem.

0

Share this post


Link to post
Share on other sites

You mention that the cpsrvd service sucks up resources and eventually restarts. You wouldn't by chance

be using C-Panel as your Mgmt. Interface would you? Older version of C-Panel have problems with memory

leaks, which would crash the service and slow the server to a crawl. Sounds exactly like whats happening to you.

 

Get you Provider to upgrade their C-Panel installs updated with all the latest patches. Getting all the other

software updated and patched wouldn't hurt either. If you have full control of the server, then think about

disabling the Web GUI Mgmt Interface program and perform your changes from the command line.

 

Upgrading the memory from 1GB to 2GB or more would also help, but if using an old ver. of C-Panel

this memory would eventually get sucked-up too.

0

Share this post


Link to post
Share on other sites

Doesn't look like a memory leak, there's nearly 400MB free memory from the first screenshot.

0

Share this post


Link to post
Share on other sites

 

Seems to be that the WAIT time is 97.8% at the top, so everyone is saying that something is causeing heavy I/O read-write to the hard disk.

Well, it's waiting for IO. Either there's a bottleneck due to underspecced or underperforming hardware or it's a config error causing a stall. My money's on the the latter. Either a buffer has filled up and can't be cleared, or the swap manager is trying to page some data out or in and fails. Are the disks RAIDed? If so, hardware or software RAID? Is there a cache on the RAID card? Battery backed? Kernel drivers up to date for the hardware? Etc etc.

0

Share this post


Link to post
Share on other sites

Its tricky 'cos all that stuff is in theory down to the guys who run the server - who are kinda spitting their dummies out now and saying it there is nothing wrong with the server, they have run loads of tests blah and that I must have a rogue bit of software.

 

Then the guy that actually write all the e-commerce software has been all over it, and says it ain't his software.

 

My current only plan is to get a brand new server, and rebuild my sites from scratch, watching like a hawk. But boy is that gonna be a drama.

 

It is weird that it kicks in at certain times. Like 9pm last night, and 1pm today for example. Basically it all goes horrid for 20 minutes, and then behaves after that.

0

Share this post


Link to post
Share on other sites

 

Its tricky 'cos all that stuff is in theory down to the guys who run the server - who are kinda spitting their dummies out now and saying it there is nothing wrong with the server, they have run loads of tests blah and that I must have a rogue bit of software.

 

Then the guy that actually write all the e-commerce software has been all over it, and says it ain't his software.

Thats typical finger-pointing...

 

 

My current only plan is to get a brand new server, and rebuild my sites from scratch, watching like a hawk. But boy is that gonna be a drama.

Try sticking another 1 GB main memory in first.

0

Share this post


Link to post
Share on other sites

And add a swap file. It's odd that there isn't one.

0

Share this post


Link to post
Share on other sites

But there's memory free... I'd try adding a swap partition (or a file) and see if that changes anything. If memory is the problem you should at least see some paging activity. I suspect you won't.

 

Also check your cron files/directories to see if there's some indexing action that kicks off at the times you see the problem.

 

'ps' accesses structures under /proc that can be locked in a deadlock or livelock scenario, so I often use the command

 

top -bn1 | grep ' D'

 

(spaces important) to identify processes that are waiting on IO or locks during such partial hangs.

 

What sort of filesystems are you using locally on the machine (ext3/xfs/reiser), and are any of them quite full? Maybe try letting a "find <fs> -type f -exec dd if={} of=/dev/null bs=1k count=1 \;" over the filesystems to see if that duplicates the problem (command is reading the first kilobyte of every file on the filesystem)? Defect storage hardware could conceivably be contributing to the problem here, but you'd expect errors in the messages file were that the case.

0

Share this post


Link to post
Share on other sites

Have you tried continually running a process list to a file while this problem is occuring?

0

Share this post


Link to post
Share on other sites

 

Its tricky 'cos all that stuff is in theory down to the guys who run the server - who are kinda spitting their dummies out now and saying it there is nothing wrong with the server, they have run loads of tests blah and that I must have a rogue bit of software.

This isn't an app problem, I'd put money on it. They need to get over themselves and do some proper troubleshooting.

 

When the server slows down is everything affected, or is it only web serving? If it's everything you can eliminate your apps by killing the httpd and mysql processes one at a time. If the server starts responding again then it is down to the last one you killed. If not it's a system/hardware thing.

0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0