Thursday, March 26, 2009

Filesystem Cache Optimization Strategies

One often overlooked high performance feature of the Solaris operating system is the filesystem cache. The filesystem cache improves performance by temporarily storing data read from disk into available system memory. The time required to retrieve data from disk based storage varies anywhere from a few milliseconds to several hundred milliseconds depending on the type, configuration, performance, and utilization of the storage holding the data. However, retrieving the same data from memory is considerably faster. The difference in speed is found in length of the I/O path. Read operations from disk traverse the system bus, an I/O controller (SCSI, SATA, FC,...) a disk controller and finally get the data from one or more spinning platters and then traverses back through the same path to return the data to the process. Read operations from filesystem cache go no further than the system bus because the system memory is usually on the main system board. Each step along their respective paths adds an incremental amount of time to the overall response time of the read operation.

Why Use Filesystem Cache
What are the potential benefits that come with a filesystem caching strategy? One or more of the following benefits may apply to your data centric application.
  • If the application is a 32-bit application, it may not be able to address more than 4GB of memory. Using this strategy may enable the application to benefit from far more memory because the data is read from the filesystem cache instead of the storage devices.
  • May deliver equal to or in some cases better performance than the application's native caching methods.
  • May not flood the underlying storage during application checkpoints (or garbage collections in the case of Java applications). For a very large database cache and slow or very busy disk storage, checkpoints can result in drops in performance simply because the filesystem is so busy flushing data from the transaction logs into the database files. Filesystem caching can eliminate many un-necessary disk read operations. This would enable better write throughput.
  • The probability of a memory leak existing in the filesystem or its cache is very low.
  • If a memory leak ever does occur in the application, you will be able to detect it sooner, track it longer, and let it run longer before running out available memory.
  • If the application ever core dumps, the resulting core file could be significantly smaller for applications that otherwise would have a large internal cache. This has two benefits. First, it doesn't take nearly as long to upload the core file to support. Second, there is much less data in the core file for the engineers to cull through when diagnosing a problem.
  • May improve overall disk throughput because fewer read operations are going to disk. In some customer cases, this has made a very dramatic difference because the disk drives and I/O controllers were so busy with read operations that they could barely keep up with replication (write operations).
  • Can enable you to maximize application memory efficiency. Filesystem caching maps directory data directly into memory without any inflation. Some caches like database cache of Directory Server Enterprise Edition (DSEE)[nsslapd-dbcache] uses roughly 1.2x the amount of memory as the on disk format. The DSEE entry cache uses roughly 4x the amount memory as the on disk format because it converts the data to LDIF format for optimal consumption. Thus for DSEE, the filesystem cache offers the best memory efficiency. e.g. You can fit more data into the filesystem cache than you can the db or entry caches.
Solaris Filesystem Caches: segmap cache
Although Solaris supports many different filesystems, the two most commonly used filesystems are UFS and ZFS. Prior to ZFS, the filesystems including UFS, NFS, QFS, and VxFS used a common filesystem cache called segmap. Segmap is a pre-allocated memory pool used to map portions of file data into pages of kernel virtual memory. Once file data is mapped into the segmap cache, subsequent read operations are pulled from either the segmap cache or its overflow cache, the Free cachelist instead of from the underlying disk storage. Once data is evicted from the Free cachelist, it will need to be retrieved again from disk. See Rich McDougal's blog post to learn more on how segmap works.

Solaris Filesystem Caches: Vnode Page Mapping cache
Over the life span of the segmap design, there have been some performance penalties associated with page eviction and replacement from the cache. One example is that there can exist excessive cross calls when unloading page mappings during eviction. This can result in excessive CPU utilization during a heavy volume page evictions. These issues are addressed for UFS via CR 6256083 which implements a new lightweight file mapping mechanism that in most cases substitutes segmap. This new file mapping mechanism is called the Vnode Page Mapping (VPM) cache. As of Solaris 10 update 6, VPM is implemented in Sparc and x86 64-bit/x64 architectures.

Solaris Filesystem Caches: ZFS Adaptive Replacement Cache (ARC)
The ZFS ARC cache has two levels of filesystem caching. The primary cache, called the primary ARC cache uses physical memory and the second cache, called the secondary L2ARC cache uses Solid State disks to cache data. The L2ARC is not currently in Solaris 10 update 6 but is in OpenSolaris 2008.11. The default configuration of the ZFS ARC (primary) cache is to use all unused memory for caching data. Note that one of the primary design goals for ZFS is to enable it to be self adjusting so that it does not deep tuning skills to optimize it for specific use cases. With that in mind, some or all of the tuning options defined in this blog post may change in future versions of the ZFS filesystem. To ensure that you have the most up-to-date information on optimizing ZFS performance, see the ZFS Evil Tuning Guide and the ZFS Best Practices Guide.

Solaris Filesystem Caches: Memory Contention
Once server memory reaches capcity, the filesystem caches will give up memory to applications requesting memory. I refer to this overlapping use of memory as memory contention. The ZFS ARC and segmap filesystem caches handle memory contention a little differently.

The segmap cache architecture consists of two levels of caching. The primary level is the segmap cache itself. The second is the Free cachelist. This second cache is really free memory that happens to still occupy memory. So applications can claim memory from the Free cachelist with very little overhead.

The ZFS ARC cache on the other hand has a single cache that is maintained in the kernel itself. So when an application makes a request for memory, the overhead associated with freeing up that memory is a little greater than with the segmap cache. Consequently, the ZFS ARC cache may need a little more care to ensure that memory contention is avoided alltogether.

The best way to avoid memory contention is to eliminate potential overlap of memory consumption. I will talk more about this later. For now, just consider that a data centric application may not reach optimal performance when it constantly contends for memory.

How Filesystem Cache Can Improve Performance
The easiest example to show how a filesystem cache can improve data centric application performance is to consider the find command. The find command as its name implies is used to traverse a filesystem tree to find one or more files within a specified filesystem tree structure. To see the difference in performance between disk read and filesystem cache read data is to run the following find command twice in a row. For reference later in this document, this find command is looking for all files with the SUID or GUID bits set. This is a common command run by systems administrators to find potential security vulnerabilities.

# find / -type f \( -perm -004000 -o -perm -002000 \) -exec ls -lg {} \;

The first time it is run, the command reads the data from disk as it traverses the directory tree starting from the root (/) directory. As the find command reads the data from disk, that data is stored in the filesystem cache. More and more data will be populated into the filesystem cache until either the command completes or the filesystem cache reaches maximum capacity. The second time the command is run, some or all of the data is retrieved from the filesystem cache instead of from the disk storage. For example, lets look at how long it takes to complete the find command to run two times in a row. The first non-cached run took 32.93 seconds to complete. The second cached run took 1.74 seconds to complete. That is a 95% reduction in time to complete the same command. Here is the memstat data before (from a fresh Solairs reboot):

Fresh boot memstat data: Page Cache=69MB Free (cachelist)=42MB
First run memstat data: Page Cache=141Mb Free (cachelist)=42

From the above data, we see that the find pushed approximately 72MB of data into the filesystem cache.

Now that you see the value of leveraging the filesystem cache in order to improve data centric application performance, lets look at the configuration considerations for properly implementing a filesystem cache strategy.

Establish A Safe Ceiling
One of the goals of an effective filesystem caching strategy is to avoid memory contention. To do this, configure the filesystem cache to use just enough memory that the operating system and applications won't contend with the filesystem cache for available memory. Lets consider the following example. Consider a server that has 32GB of RAM. In this example lets assume that the sum of memory consumed by Solaris, the primary application, and all other programs is approximately 8GB. Lets assume that the primary application is a data centric application that interacts with as much as 80GB of data. Over a sufficient amount of time, the filesystem cache eventually fills to capacity as the primary application interacts with its 80GB of data.

You can get an estimate of the sum of memory consumed by all running processes with the following command.

# ps -eo rss | awk ' BEGIN { t = 0; }{ t += $1; } END { print t; } '

This command returns in kb the sum of all resident set size (rss) memory consumed by all running processes. Alternatively, use the modular debugger (mdb) to get a big picture view of how memory is allocated. Consider the following sample memstat output.
# echo "::memstat"|mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                      94731               370    9%
Anon                        35113               137    3%
Exec and libs                4544                17    0%
Page cache                 150191               586   14%
Free (cachelist)           394526              1541   38%
Free (freelist)            367163              1434   35%
Among other things, this output tells us that 35% of memory is unused (Free freelist), that 52% of memory is allocated to UFS filesystem cache (Page cache plus Free cachelist) and that 9% is used by the Kernel. Note that if this example was showing ZFS caching, the bulk of the percentage would show up in the Kernel line because ZFS ARC cache is currently accounted for in the Kernel. In a future Solaris 10 version, the ZFS ARC caches will be broken out into their own lines.

Lets get back to our example. Assume that the data centric application, like most follows the 80/20 rule. The 80/20 rule suggests that 20% of users drive 80% of a service load. Lets also assume that this means that 20% of users represents 20% of the 80GB of data. The ideal performance of data centric applications occurs when you keep active data in memory so that very few read operations go to disk storage. In this example, the ideal cache size would be 20% of 80GB which is 16GB. Lets consider three filesystem caching configurations to determine which would be the most optimal.

UFS Default Caching
The default UFS cache configuration for 64-bit Solaris allocates 12% (3.84GB) of physical memory (32GB) for the filesystem cache. This leaves approximately 69% of physical memory unused. This configuration would cache only 24% of the ideal memory allocation for optimal performance. As a result, the I/O subsystem would spend significantly more time performing read operations for data that is not in the filesystem cache. Thus overall performance of the application for data reads would be slower than if the data were in the filesystem cache.

ZFS Default Caching
The default ZFS cache configuration would consume all free memory (~26GB). However, it wouldn't leave any memory buffer to avoid contention with Solaris or other applications on the system. The performance of the application would be good because more than the ideal 16GB of data would be cached. However, there would be a certain amount of performance loss in the give and take contention between filesystem cache and all other applications consumption of physical memory. There is an additional issue that may come with the memory contention. If while the application is offline ZFS consumes all available memory, the application may take longer to start as it waits for ZFS to free enough memory for the application to start.

Optimized ZFS Filesystem Caching
In this configuration for ZFS, we establish a safe memory consumption upper boundary for the filesystem cache. The goal of the upper boundary is to avoid memory contention. In this example, we set the upper boundary of filesystem cache to 22GB, which provides a 2GB saftey buffer and is 4GB larger than the ideal size of 16GB. Thus, this will further improve application performance and avoid contention with Solaris or applications.

Clearly the optimized filesystem cache configuration offers the lowest risk with the best potential performance. Lets look at how to set the upper boundary of the UFS and ZFS filesystems.

Tuning ZFS Cache
The ZFS filesystem cache (ARC) size is specified as a hex value. Thus, to change the ZFS ARC cache from its default of no upper boundary to a fixed upper boundary of of 22GB, you can use the following to determine the hexidecimal value.

# bash
# numGigs=22
# decVal=$((${numGigs}*(2**30)));
# echo "obase=16;ibase=10; ${decVal}" | bc

Apply this upper boundary by adding the following entry in /etc/system and reboot the server in order for the change to take effect.

set zfs:zfs_arc_max = 0x580000000

Unlock The Governors
Although using the filesystem cache may improve the performance of some applications, it is not necessarily best for all applications. Solaris is tuned by default for general purpose computing. Consequently, the default Solaris configuration is not optimized to realize the full potential of the filesystem cache for some applications.

For UFS, the FreeBehind UFS filesystem kernel option is used to prevent caching sequentially read data. The freebehind option needs to be re-configured to allow all data read in to go into the filesystem cache. To do this, add the following line to /etc/system.

set ufs:freebehind = 0

Lets examine the limiting effect of freebehind for the UFS filesystem cache occupancy by looking at the cache before and after reading in large amounts of sequential data for two different configurations.

Here are the requisites for this experiment.
  • A non-production server that is running Solaris 10 Update 6 or greater,
  • with a minimum of 8GB of RAM,
  • a 10+GB disk drive that can be formatted with UFS and ZFS filesystems (ALL DATA ON THE DISK WILL BE DESTROYED after formatting),
  • and root access to the server.
Once you have found a server with sufficient storage, here are the sequence of steps for this experiment.
  1. Find or create a UFS filesystem with 6GB of free space. Change directories into that UFS filesystem.
  2. # cd /my_ufs_filesystem
  3. Create a 4GB file with randomized data.
  4. # openssl rand -out datafile 10737418240
  5. Reboot the server to clear the filesystem cache.
  6. # init 6
  7. Login as root to the server and run the following command to get a big picture view of memory allocation prior to populating the filesystem cache.
  8. # echo "::memstat"|mdb -k Note the value of the “Page cache”.
  9. Then, use dd to load the file contents into the filesystem cache
  10. # dd if=datafile of=/dev/null bs=512k
  11. Run the following command and compare the “Page cache” and "Free (cachelist)" values with the previous run of the same command.
  12. # echo "::memstat"|mdb -k
Note that the sum “Page cache” and "Free (cachelist)" values have grown but not by 4GB. In order to put the full contents of the datafile into the filesystem cache, we need to make sure the filesystem cache is large enough to hold the file and we need to adjust freebehind to allow ungoverned population of the filesystem cache. To do this, we add the following line to /etc/system and then reboot the server.

set ufs:freebehind = 0

Load the datafile into the filesytem cache followed by checking the “Page cache” with the following two commands.

# dd if=datafile of=/dev/null bs=512k
# echo "::memstat"|mdb -k

Note that the sum “Page cache” and "Free (cachelist)" values have increased the full 4GB compared to the prior iteration with the default values for freebehind and segmap.

Note that although not in the Solaris documentation, the segmap and freebehind options have been in Solaris since version 72.

Avoid Diluting The Filesystem Cache
The filesystem cache is a general purpose facility intended to improve general performance by minimizing frequent disk storage read operations. The bad part of it being general purpose is that any file can get loaded into the filesystem cache. Once the cache reaches capacity, as new data is read in old data is pushed out. This process can dilute the cache with data that you don't want in the cache. One common example that can dilute the filesystem is when an administrator runs the find command looking for setuid programs. Another example copying or backing up large files from one filesystem to another. Any of these examples can push a significant amount of data out of the filesystem cache.

The best strategy to avoid filesystem cache dilution is to ensure that only the application data can be loaded into the filesystem cache. There are two parts to this strategy. First, make sure that the application data is in a filesystem by itself. Second, make sure that all other filesystems do not use the filesystem cache. This latter part is managed through the following methods.

UFS caching is disabled by adding the forcedirectio option in the /etc/vfstab to filesystems that you don't want cached. Here is a sample filesystem entry.
/dev/dsk/c0t0d0s1  /dev/rdsk/c0t0d0s1 /var  ufs   1    no   logging,forcedirectio
Note that there is a known deadlock condition9 that should be avoided by not using the UFS "logging" option.

As for ZFS, there aren't any per filesystem cache controls prior to Solaris 10 update 6. However, a future update will introduce the ability to disable primary and secondary caches by filesystem. Both options are configured through the “zfs set command”. The following two sections taken from the OpenSolaris 2008.11 zfs man page describe the primary and secondary caches, their possible values and the default values.
primarycache=all | none | metadata

Controls what is cached in the primary cache (ARC). If this property is set to "all", then both user data and metadata is cached. If this property is set to "none", then neither user data nor metadata is cached. If this property is set to "metadata", then only metadata is
cached. The default value is "all".

secondarycache=all | none | metadata

Controls what is cached in the secondary cache (L2ARC). If this property is set to "all", then both user data and metadata is cached. If this property is set to "none", then neither user data nor metadata is cached. If this property is set to "metadata", then only metadata is cached. The default value is "all".
Note again that these two options don't yet exist in Solaris 10 (as of Solaris 10 update 6).
Here are OpenSolaris 2008.11 commands for disabling both the primary and secondary caches of the dump zfs filesystem.

# zfs set primarycache=none rpool/dump
# zfs set secondarycache=none rpool/dump

Match Data Access Patterns
Data access patterns typically fall into one of the following three categories.
  • Sequential Access – This access pattern reads data blocks one after another in sequence from the disk.
  • Random Access – This access pattern reads data blocks randomly throughout the disk.
  • Hybrid Access – This access pattern is some mixture of Sequential and Random access patterns.
Aligning the filesystem tuning according to the anticipated data access patterns can significantly improve performance of the filesystem and application interacting with the filesystem. For example, the changelog of just about any database has a sequential access pattern. However, the data access pattern of the database data may be random. Thus, tuning the filesystem containing the changelog for sequential access would improve its performance.
For ZFS, the main configuration option that improves sequential access patterns is pre-fetching. File level and block level pre-fetching is on by default. For random access patterns, you may want to consider disabling pre-fetching. To disable ZFS pre-fetching, add the following lines to /etc/system.

* Disable file level pre-fetching
set zfs:zfs_prefetch_disable = 1

* Disable block level pre-fetching.
set zfs:zfs_vdev_cache_bshift = 13

Unfortunately because ZFS sets pre-fetching system wide, you should make sure that the predominate access pattern is random before making this change. Alternatively, you can put just the random access data in a ZFS filesystem with random access configuration. Then put the sequential data (like the changelog) in a UFS filesystem that is optimized for read-ahead pre-fetching.

UFS read-ahead is configured per filesystem indirectly via the maxcontig option. The maxcontig option itself is defined (from the Solaris 10 tunefs man page) as the “disk drive maximum transfer size” divided by “the disk block size”. If the disk drive's maximum transfer size cannot be determined, the default value for maxcontig is calculated from kernel parameters as follows:

If maxphys is less than ufs_maxmaxphys, which is 1 Mbyte, then maxcontig is set to maxphys. Otherwise, maxcontig is set to ufs_maxmaxphys.

UFS read-ahead is determined indirectly through the maxcontig value. Setting the maxcontig value to a large value will look further ahead than if the maxcontig is set to a small value. The ideal value should be determined based on the average data size that the application uses and thorogh testing. Note also that if (as recommended earlier for non-data filesystems) forcedirectio not only bypasses the filesystem cache, it also disables read-ahead as well.5

Reducing the UFS read-ahead for the directory server of DSEE may improve throughput depending on the average entry size of the directory data. To decrease the UFS read-ahead, consider dropping the maxcontig from its default value to 16M through tunefs.

Consider Disabling vdev Caching
ZFS allocates a small amount of memory for each virtual device (a.k.a. vdev) that participates in a zpool. You may consider disabling this feature in order to eliminate all un-necessary caching. I have done done extensive testing with this to determine the full extent of memory savings by disabling. I recommend that you test it out in the lab before trying it in production. Having said that, here is the value to set in /etc/system to disable vdev caching:

* Disable vdev caching
set zfs:zfs_vdev_cache_size = 0

Minimize Application Data Pagesize
Data centric applications like relational databases and directory services often specify a specific page size when writing data to storage. The page size determines the size of data to allocate for each write operation. In the case of directory services, the actual data size may be significantly smaller than the page size. Mis-matched pagesize can result in degraded performance as a results of less efficient use of memory, and writing more data to disk than is necessary. More information on how to determine if your DSEE pagesize is mismatched and how to correct it here.

Match Average I/O Block Sizes
The average block size for a given data block should be used as the metric to map all other datablock sizes to. For example, the ZFS recordsize is 128kb by default. If the average block (or page) size of a directory server is 2k, then the mismatch in size will result in degraded throughput for both read and write operations. One of the benefits of ZFS is that you can change the recordsize of all write operations from the time you set the new value going forward. However, with UFS, you have to set the blocksize at the time of creation. Fortunately though for most directory services deployments the UFS blocksize is adequate.

Consider The Pros and Cons of Cache Flushes
The default ZFS configuration assumes the disks used to make up a zpool volume are just disks and not a storage array. If the disks are actually a storage array with a cache, consider disabling cache flushes by adding the following entry to /etc/system.

* If using caching storage array, disable cache flushes.
set zfs:zfs_nocacheflush = 1

Prime The Filesystem Cache
If the data of your application could fit entirely in the available filesystem cache, you may prefer load the data into the cache before starting the application. This could improve the consistency of response time for all operations that would otherwise read data from disk storage. If the data is much larger than the available cache, the performance benefit will not be as great.

There are a few ways to prime the filesystem cache. One way is to push the data in through the null device. For example, the following command could be used to load all of the DSEE data files of a directory instance that lives in /ds into the filesystem cache:

# find /ds/db -type f -name "*.db3" -exec dd if={} of=/dev/null bs=512k \;

Before closing out this blog post, I want to highlight again that one of the design goals of the ZFS filesystem is that you shouldn't have to tune anything. It will have the intelligence necessary to optimally tune itself. That being said, any of the /etc/system settings mentioned in this blog could change with future versions of ZFS. I highly suggest that you bookmark the ZFS Evil Tuning Guide and the ZFS Best Practices Guide as it will keep up with the ZFS performance tuning options as they evolve over time.

That concludes this blog entry. I hope that you find this information useful. Thanks to Arnaud Lacour, Steve Sistare, Mark Maybee, Marcos Ares, Pedro Vazquez, and others that contributed to this blog. Your help and input was tremendously helpful.

End Notes
1 - ZFS Evil Tuning Guide
2 - Historical freebehind and segmap references: Segmap Tuning, Understanding Perforce Server Performance, Performance Oriented System Administration, NFS CookBook - Based on Solaris 8, 9 (written prior to Solaris 10)
3 - Solaris 10 Solaris Tunable Parameters Reference Manual: freebehind
4 - Solaris 10 Solaris Tunable Parameters Reference Manual: segmap_percent
5 - Pages 158-160 of System Performance Tuning - By Gian-Paolo D. Musumeci, Mike Loukides, Michael Kosta Loukides
6 - System Administration Commands: tunefs
7 - Directory Server Databases and Usage of db_stat
8 - Three way deadlock involving UFS logging and directio
9 - UFS Directio Implementation
10 - Segmap default and max values

11 - Understanding Memory Allocation and File System Caching in OpenSolaris



Kevin said...

Hey Bradds,
Your blog about FS Cache is extremely useful. Thank you so much for posting it.
Solaris Administrator

for IT the said...

I have read your blog its very attractive and impressive. I like it your blog.

Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

Java Online Training Java Training in Chennai Core Java 8 Training in Chennai Core Java 8 Training in Chennai JavaEE Training in Chennai Java EE Training in Chennai

for IT the said...

Java Online Training Java Online Training Java Online Training Java Online Training Java Online Training Java Online Training

Hibernate Online Training Hibernate Online Training Spring Online Training Spring Online Training Spring Batch Training Online Spring Batch Training Online