PDA

View Full Version : Free up RAM on an NFS server


agathongroup
01-30-2009, 06:59 AM
This is more of a Linux sysadmin question, but I thought I'd try here anyway. :-) We have an NFS server, running in the stock NAS appliance in 2.4.x, that has its RAM slowly creep up until it consumes memory. On most servers a workaround is to (periodically) restart certain services (e.g., Apache) but on this server it appears the RAM usage must all be in the kernel (with NFS caches or whatnot). None of the processes really show particularly high RAM usage, so I'm assuming it's the kernel...

So I tried to restart nfs (service nfs restart) and hoo boy did that piss off the system. :-) I guess it didn't like that with still-mounted shares. Is there another way to flush the NFS caches and reclaim some of that RAM before the (swap-less) system starts randomly killing off processes?

Thanks,
Peter

PavelGeorgiev
01-30-2009, 12:23 PM
Peter,
It is OK if NSF (or any other service) takes up the available ram for caches and you should not try to battle this as it is actually a good thing. It is not OK if a standard appliance (running with its minimal resources) runs out of memory and starts killing processes. If that actually happened its a problem with the appliance that we need to take care of - let us know if this is the case. We`ll also need any info on how to reproduce the problem - how much memory did the appliance use, how loaded was it, which proc(s) were killed.

Regards,
Pavel

agathongroup
01-30-2009, 12:37 PM
It is OK if NSF (or any other service) takes up the available ram for caches and you should not try to battle this as it is actually a good thing. It is not OK if a standard appliance (running with its minimal resources) runs out of memory and starts killing processes. If that actually happened its a problem with the appliance that we need to take care of - let us know if this is the case. We`ll also need any info on how to reproduce the problem - how much memory did the appliance use, how loaded was it, which proc(s) were killed.

Normally, I'd agree — I rarely, if ever, start monkeying around with memory management tuning parameters. However, the NAS appliance in one of our applications (similar to LampCluster) has been periodically consuming its RAM and killing off processes. An example:

sshd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
[<c010622b>] show_trace+0x1b/0x20
[<c0106256>] dump_stack+0x26/0x30
[<c0148fc6>] out_of_memory+0x1a6/0x1e0
[<c014a976>] __alloc_pages+0x276/0x2e0
[<c014c29a>] __do_page_cache_readahead+0x12a/0x2c0
[<c014c7a0>] do_page_cache_readahead+0x40/0x60
[<c01482ab>] filemap_nopage+0x2eb/0x3c0
[<c0154abc>] __handle_mm_fault+0x1fc/0x1170
[<c0114053>] do_page_fault+0x123/0xc44
[<c0105a5b>] error_code+0x2b/0x30
DMA per-cpu:
cpu 0 hot: high 0, batch 1 used:0
cpu 0 cold: high 0, batch 1 used:0
DMA32 per-cpu: empty
Normal per-cpu:
cpu 0 hot: high 90, batch 15 used:70
cpu 0 cold: high 30, batch 7 used:18
HighMem per-cpu: empty
Free pages: 3040kB (0kB HighMem)
Active:1796 inactive:212 dirty:0 writeback:35 unstable:0 free:760 slab:59006 mapped:3 pagetables:190
DMA free:1116kB min:124kB low:152kB high:184kB active:268kB inactive:52kB present:16384kB pages_scanned:42667 all_unreclaimable? yes
lowmem_reserve[]: 0 0 248 248
DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 248 248
Normal free:1924kB min:1948kB low:2432kB high:2920kB active:6916kB inactive:796kB present:253952kB pages_scanned:58340 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 1*4kB 1*8kB 1*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1116kB
DMA32: empty
Normal: 1*4kB 16*8kB 10*16kB 5*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1924kB
HighMem: empty
Swap cache: add 0, delete 0, find 0/0, race 0+0
Free swap = 0kB
Total swap = 0kB
printk: 82222 messages suppressed.

It's a standard NAS appliance with 256 MB of RAM, 0.6 CPU (if that matters), and shares mounted via NFS and SMB on a web server and an email server. I'm not sure that it's readily reproduced, but let me know if there's any other information I can provide.

PeterNic
01-30-2009, 02:57 PM
Peter,

We never proved this but I suspected FC3 kernel had a bug that caused cache memory not to be released properly if there is no swap in the OS. You can try one of two things

add a small swap and see if this resolves the problem (if not, it will eat up all the swap and still run out of memory)
upgrade the NAS appliance to the very latest OS and kernel (this can also deal with potential bugs in the NFS server in old kernels)


Finally, you should be able to restart the NFS process -- or even restart the whole appliance. NFS with hard mounts should survive this even on the client side -- but it is quite a brutal solution.

Regards,
-- Peter

agathongroup
02-05-2009, 04:18 PM
Peter,

I've tried adding a small swap partition to one of our NAS appliances. I'll let you know what happens.

Peter G.

agathongroup
02-19-2009, 10:53 AM
As promised, an update:

NAS still puked after allocating all RAM and swap, kicking off the OOM killer and generally wreaking havoc on the appliance. :-(

Are there any plans to upgrade NAS to a more recent OS? If not, how would you suggest I fix this problem?

Thanks,
Peter

PeterNic
02-20-2009, 05:11 PM
Peter,

1. Can we get a reproducible scenario (i.e., what do we need to do to reproduce) -- just either post here or submit a ticket at the helpdesk?

2. I am sure we will renew the NAS appliance... what I can suggest is essentially rebuilding the NAS appliance with either LUX or Ubuntu -- at least for the test. All this should take is to branch LUX, install the NFS server and put it under the same conditions. If it survives, we can either cook one for you or we can help you move over the 2-3 configuration scripts that take NAS properties and configure nfs, http and samba.

Regards,
-- Peter

agathongroup
03-12-2009, 05:23 PM
1. I can't reliably reproduce the problem. :-(

2. It just occurred to me that branching LUX64 might not actually help me. Both LUX64 and NAS run on the same kernel — 2.6.18.8-domU — and the NFS server runs in kernel space. Since the NFS server appears to be the bit sucking up RAM and not letting go, does it stand to reason that the problem might foreseeably occur in a branch of LUX64?

If so, and a kernel/OS upgrade isn't likely to get us decent results, then I'm not sure what the next step would be. Hopefully, though, my assumptions are incorrect about the NFS server being in the 3tera-supplied domU kernel (and, thus, not affected by an "upgrade" from FC3 to CentOS 5.2).

I'm kinda rambling... anyone have thoughts about this?

Thanks,
Peter

PeterNic
03-12-2009, 06:36 PM
1. I can't reliably reproduce the problem. :-(

Can you unreliably reproduce it?

I have no problem getting a small test app and running it for a week until it "pukes" (to borrow your term).


2. It just occurred to me that branching LUX64 might not actually help me. Both LUX64 and NAS run on the same kernel — 2.6.18.8-domU — and the NFS server runs in kernel space. Since the NFS server appears to be the bit sucking up RAM and not letting go, does it stand to reason that the problem might foreseeably occur in a branch of LUX64?

If so, and a kernel/OS upgrade isn't likely to get us decent results, then I'm not sure what the next step would be. Hopefully, though, my assumptions are incorrect about the NFS server being in the 3tera-supplied domU kernel (and, thus, not affected by an "upgrade" from FC3 to CentOS 5.2).

It still may help -- there are some things in user mode that affect how kernel mode operates. I think it is worth a try. Just have LUX64 with nfs enabled (or LUX) -- anything based on CentOS 5 or later.


I'm kinda rambling... anyone have thoughts about this?

In any case, this seems to have been a pest for a while and I would like to have it resolved (as, I am sure, you would, too). Give me what you can so I can try to reproduce -- I need a NAS and a what?


Best regards,
-- Peter

LeoKalev
03-13-2009, 03:58 AM
Quick note: if the memory is being eaten up in the kernel itself, the contents of /proc/slabinfo may provide a clue as to what is eating up the memory.

agathongroup
03-19-2009, 10:40 AM
Leo,

Thanks for the hint. I checked out slabinfo, and here's what I saw:


# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
rpc_tasks 1022694 1022700 192 20 1 : tunables 120 60 8 : slabdata 51135 51135 0


The output from free at the time:


total used free shared buffers cached
Mem: 262288 258304 3984 0 5844 15336
-/+ buffers/cache: 237124 25164
Swap: 0 0 0


(The swap partition we monkeyed around with earlier isn't set on boot for that system.) So it seems clear that NFS is the culprit... but that's kinda where I get stuck. :-( Any further thoughts based on that output?

Thanks,
Peter

PeterNic
03-23-2009, 09:32 PM
Peter,

Can you give us a simple app that we can use to reproduce the problem (e.g., running bonnie++ on a LINUX connected to a NAS)...?

Regards,
-- Peter

agathongroup
04-03-2009, 08:37 AM
Sorry Peter, I just re-read the thread and you've asked for a sample app at least three separate times. I haven't meant to ignore you, it just happened that way. :-/

We haven't tried to reproduce the problem outside of the one application that is exhibiting it. That application is essentially a LAMP stack app with one web server and one email server connected to it. I'll have to see about reproducing in a simplified environment with bonnie++.

Incidentally, we went ahead and built a new NAS anyway, including near-dynamic volume resizing through the use of LVM. We'll probably be contacting you after it's been running for a while to help with the MON interaction. :-)

Thanks for all of the help!