Wednesday, October 7, 2009

Solaris 10 10/09 ZFS cache improvements

Larry Wake in an interview by Chhandomay Mandal provides a good overview of the new features of Solaris 10 10/09 (a.k.a. Update 8).


With respect to my work in filesystem caching strategies, this new Solaris release introduces three excellent new features.

First is the introduction of L2ARC cache support. This means that you can now employ readzilla and writezilla SSD devices into any Sun servers.

Second is the introduction of ZFS ARC cache controls through the primarycache (e.g. L1 ARC cache) and secondarycache (e.g. L2 ARC cache) filesystem properties. These new cache controls provide the ability to define what is in and more importantly not in the ZFS ARC cache. You may recall from my blog post on filesystem caching strategies that controlling the contents of the ZFS ARC cache can produce much better and more consistent performance results for data centric services such as directory server and others.

For example, lets say that I am deploying directory server (e.g. DSEE) with the following layout:
* DSEE Bits: ZFS filesystem zpool/bits -primary and secondary caches are disabled
* DS info and txn logs: ZFS filesystem zpool/logs - primary and secondary caches are disabled
* DS DB and ChangeLog: ZFS filesystem zpool/data -primary and secondary caches are enabled

By using the primary and secondary cache controls, I guarantee for the zpool ZFS pool that the
only data stored in the ARC cache is the DS data and changelog.

Here is how to disable both primary and secondary cache for the zpool/logs filesystem at the time of filesystem creation:
# zfs create -o primarycache=none -o secondarycache=none zpool/logs

Here is how to disable both primary and secondary cache for the zpool/logs filesystem at after the filesystem has already been created.
# zfs set primarycache=none zpool/logs
# zfs set secondarycache=none zpool/logs

Note that if you wanted to create and associate pre-tuned ZFS filesystems to a zone at the same time you are creating the zone, you can do this through The Zone Manager with the -r or -w flags. This is possible with the latest release
through the extension that allows you to pass ZFS options like "primarycache=none;secondarycache=none;compression=gzip" to the -r or -w flags. Click here to see full usage help.

The third new feature is the breakout of ZFS ARC cache accounting in the ::memstat kernel metrics. Although the Solaris documentation doesn't make mention of this feature, I presume that it is present in support of the new ARC cache controls. You can see for yourself by running the following command:

# nice -10 echo "::memstat"|mdb -k

Note that you should not run this command on a production server as it may significantly reduce performance of the system while it scans through all physical memory. Note also that the time to complete running is proportional to the amount of physical memory installed in the server.

Here is a sample output of ::memstat metrics prior to Solaris 10 10/09:
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                      94731               370    9%
Anon                        35113               137    3%
Exec and libs                4544                17    0%
Page cache                 150191               586   14%
Free (cachelist)           394526              1541   38%
Free (freelist)            367163              1434   35%


Here is a sample of what I hope that you will see with Solaris 10 10/09:
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     428086              3344    3%
ZFS File Data               25006               195    0%
Anon                     13992767            109318   85%
Exec and libs                 652                 5    0%
Page cache                  24979               195    0%
Free (cachelist)             1809                14    0%
Free (freelist)           1979424             15464   12%
Total                    16452723            128536
Have a super day!

Brad
PS: As soon as I get the chance to download and install Solaris 10 10/09, I will check the memstat data, I will confirm or deny the presence of the new memstat data.

7 comments:

Brad Diggs said...

I just confirmed that the ::memstat kernel metrics do break out ZFS data as mentioned in the blog post.

Have a super day!

Brad

r4sutton said...

Very helpful, thanks. Can you confirm if the /etc/system set command:

set zfs:zfs_arc_max=

is still valid in release 10/09

r4sutton@mac.com

Brad Diggs said...

Yes, zfs:zfs_arc_max still applies to Solaris 10/09. However, always refer to the Solaris documentation, the ZFS Evil Tuning Guide and the ZFS Best Practices Guide for the latest tuning features of the ZFS filesystem.

Donovan said...

Seeking how ZFS cache is handled by the Solaris VM subsystem when application memory demand increases.

On one of our Sun servers, noted ::memstat output shows all extra available memory going to ZFS cache. 67% of the memory is in this ZFS cache by ::memstat. The application only needs 19% (~6GB) of the 32GB. Traditionally Solaris would free the older file cache segments as the application memory demand increases. Can someone pipe in here the behavior of the Solaris VM manager in terms of freeing ZFS cache when application demands more memory?

I think ZFS related memory segments will free if the app demands based upon my observations.

Brad Diggs said...

Donovan,

The ZFS cache will free up space when an application needs it. However if the application needs a lot of memory quickly the application may not get it fast enough and can timeout. This is rare though. The best thing to do is tune the upper boundary of the ZFS ARC to not exceed a safe boundary so that you mitigate contention between the cache and applications. You can see learn more on this topic from my blog on the same subject:
Filesystem Cache Optimization Strategies

Trevor said...

Brad,

We experienced a problem on an M5000 where the whole box hung. The SUN engineers pointed to a possible memory shortage due to the over-usage of memory by ZFS ARC Cache and the inability to give it up. I would understand if the application had a timeout, but I am surprised and a little bit skeptical it would cause the cannister and the global zone to completely hang.

Have you ever seen this behaviour and resolved it by setting the zfs_arc_max lower?

Brad Diggs said...

Hello Trevor,

Thanks for your comment. I personally have not observed the specific scenario that you have described with current versions of Solaris (e.g. U7 or U8). However, I can envision a scenario where if the kernel (or some other large consumer or heavy user of RAM) and ZFS ARC competing for memory could slow things down from time to time.

You might want to make sure that you on the latest version of Solaris because earlier versions had more susceptibility to RAM overlap and contention than with this than current releases. For example, I encountered similar issues on my network backup server (basically little RAM plus lots of zones with rsyncs over ssh to each zone) with pre-u6 versions of Solaris 10. For both past and current versions, reducing the ZFS ARC size helped.

I hope that helps!

Brad