Wednesday, February 10, 2010

Directory Data Priming Strategies

With the introduction of Sun (and now Oracle's) flash technologies, directory services architecture is going to begin to evolve in a new direction. One possible new direction will be to emphasize using flash to extend the ZFS filesystem cache sufficiently to hold all of the directory data. In this filesystem cache centric directory service architecture we want to find the most efficient path for loading the directory data into the filesystem cache. That is the focus of this blog post.

Priming the ZFS filesystem cache can significantly decrease the time required to reach optimal performance. This blog post compares, contrasts and walks you through three ZFS filesystem cache priming strategies. Note that all graphs shown represent the same exact directory server instance with the same data but primed with different methods. Further, the same SLAMD job was applied for each iteration.

Disclaimer
Although the content of this blog post speaks to delivering optimal directory server performance, it is not intended to represent best possible performance that can be delivered by DSEE 7.

LDAP Priming
The first strategy is to let the directory service prime the filesystem cache as data is retrieved through the natural operation of the directory service. I call this method LDAP Priming because the LDAP search, modify, add and delete operations are what load data into the filesystem cache. The benefit of this method is that there is no action required to implement it. Additionally, if the memory or flash is not large enough to hold all the directory data, then LDAP priming is the only method that will ensure the most active data stays in the filesystem cache. Most current implementations today do not have sufficient memory to hold all of the data. So, this is the only practical method for those architectures to ensure that the most active data gets loaded into the cache. However, the intent of this blog post campaign is to help you begin to consider a new flash centric architecture that targets having enough flash to contain all of the directory data for optimal performance.

The problems with the LDAP priming method include the following:
  • It will take much longer to reach optimal performance because data is not systematically loaded into the filesystem cache.
  • Overall response time will be at least 2 to 4 times greater until all the data is loaded into the filesystem cache.
  • Disk I/O will be significantly higher during LDAP priming. This may reduce overall LDAP write throughput as well as inject latency into replication propagation delay.
Graph 1 below illustrates the best case LDAP priming scenario where all of the directory data is systematically loaded through a LDAP search rate job. Note that the disks containing the data were near 98% busy until all the directory data completed loading into the filesystem cache.
Graph 1: LDAP Primed Load
dd Priming

The second priming strategy is to load the data directly into the filesystem cache via the Solaris 10 dd command. In this strategy, the dd command copies the directory server db3 files into /dev/null which in effect loads the data into the filesystem cache. The following command illustrates the basic dd syntax.
dd if=/db/telco_id2entry.db3 of=/dev/null bs=32k
The primary benefit of priming the data into the filesystem cache with dd is that once complete, the directory server will be able to retrieve all primed data from memory (e.g. DRAM) rather than from disk. The only downside to using this method is that from a best practice perspective, you may have to wait until the priming is complete before starting up the directory server instance. Otherwise, the priming process may degrade directory performance due to keeping the disks very busy while reading in all the data.

Graph 2 below shows the search rate versus response time of a dd primed directory server instance. Note that the throughput and response times are consistent throughout the job as opposed to the case with LDAP priming where there is a ramp up to the optimal throughput.
Graph 2: dd Primed Load
Note that in this SLAMD job, the disks containing the directory data were less than 1% busy throughout the entire job because all of the data had been loaded into the filesystem cache.


Multi-pass dd Priming
The third option is to run the script multiple times with a larger dd blocksize. Using a larger dd blocksize results in better throughput which may translate to an overall shorter load time to prime all the directory data into the filesystem cache. For example, the following table lists the results of several iterations with different dd blocksize values.




Table 1: Priming Analysis Results


Blocksize


Passes


dd Time

% of
data

LDAP
Time

Avg Primed
SearchRate
Avg Primed
Response
Time
N/A00min0%37min8500/sec18ms
32k162.2min99%2min9145/sec17.5ms
128k118min72%6min8400/sec18ms
128k234min92%4min8925/sec17.5ms
256k440min93%8min9067/sec17.5ms
N/A1* 24min100%0min* 9000/sec* 17.5ms


* The last row in this table represents conservative estimates for priming the ZFS cache via a sub-command of the zfs or zpool commands. This represents full throttle throughput of three partitions of an F20 PCIe card where each partition drives 15MB/sec of throughput. Note however, if the F20 PCIe card was sub-divided into 7 partitions where 6 of the 7 partitions were dedicated to ZFS L2 ARC cache, the load time could theoretically be as low as 12 minutes.

The first row where the blocksize value is "N/A" represents the LDAP primed data. It is critical to understand that the LDAP primed case does not represent the actual time that it would take for a production directory server instance to reach a fully primed state. This is because very few if any customers would employ a system to systematically load all directory data via LDAP searches after a server reboot.

The two key takeaways from Table 1 are as follows. First, the priming methods are intended to reduce the overall time taken to load the data into memory so that once the directory starts, the search rate throughput and response time will be near optimum values.

Second, once the directory data is fully primed, the relative throughput and response time are roughly equivalent regardless of the priming method. This finding suggests that the dd blocksize has no correlation to the nsslapd-db-page-size or ZFS recordsize.

Based on the results represented in this graph as well as other extensive tests that I have done as a part of this project, I highly recommend using the single pass 32k dd blocksize in order to load the most data into the filesystem cache prior to starting the directory server instance. This ensures the most consistent starting point for the directory server.

dd-Priming Optimizations for ZFS
At the present time, the dd priming method takes much longer than it should because the data cannot be sequentially loaded into the filesystem cache. When ZFS detects that data is being streamed into the filesystem, it stops caching the read data. Hopefully at some point in the future, the zfs or zpool commands will be extended to include a sub-command that can be used
to load data into the L1 and L2 ARC caches. If this feature was implemented, I am certain that the data priming load time would take significantly less time. Until such a ZFS feature exists, you will have to load the data in a single pass with a dd blocksize of 32k (or less) or iterate through multiple passes with a larger dd blocksize value.

I have found the following script to work reasonably well. If the number of backgrounded processes grows significantly, you may need to inject a sleep, or add in a counter that limits the number of concurrent dd processes. In my case, the dd commands completed so quickly that it never had more than 10 concurrent dd processes.
#!/usr/bin/bash
ArcChunk=$((2**15)) # 32KB
for file in ${*}
do
FileSize=`ls -al ${file} | awk '{ print $5 }'`
FileChunks=$((${FileSize}/${ArcChunk}))
for (( c=0; c< ${FileChunks}; c++ ))
do
dd if="${file}" of=/dev/null \
bs=${ArcChunk} \
iseek=${c} \
count=1 > /dev/null 2>&1 &
done
done
wait
SSD Partitioning
One observation that I made during this effort was that the very best write throughput that I could push through a solid state partition was around 17MB/sec. I added 2 more partitions and was able to drive all three partitions up to 15MB/sec as well. I would have liked to have re-partitioned the F20 card from 4 partitions to 7 and made all 7 partitions a part of the L2ARC cache. However, I didn't have time to get that done in time for this blog post. I suspect that each of the 7 partitions will be able to push 17MB/sec. The key takeaway from this finding is that more partitions may provide better overall performance than fewer partitions.

Reference Data
For reference, I wanted to add in some reference data to provide the context of my setup.
Directory Server: X4150 with 2 2.33GHz Quand CPUs, 32GB of DRAM, 8 10k 146GB SAS disk drives, and one F20 PCIe card. The server was configured with two non-global dsee zones. A single directory server instance was running in each zone. The two directory server instances were configured for multimaster replication. However, all search load was focused on the first of the two zones.

Here is a sample of the zpool -v output. What you are looking for is the capacity of the F20 partitions (e.g. c2t[1-3]d0). Looking at this particular output, we see that 62.4GB of 68.7GB was used for L2ARC. I observed throughout all tests that there is always some unused capacity in each of the F20 partitions.
capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
db           104G   712G      0      0      0      0
c0t1d0    17.3G   119G      0      0      0      0
c0t2d0    17.3G   119G      0      0      0      0
c0t3d0    17.3G   119G      0      0      0      0
c0t4d0    17.3G   119G      0      0      0      0
c0t5d0    17.3G   119G      0      0      0      0
c0t6d0    17.3G   119G      0      0      0      0
c2t0d0     272K  22.9G      0      0      0      0
cache           -      -      -      -      -      -
c2t1d0    20.9G  1.98G      0      0      0      0
c2t3d0    20.7G  2.15G      0      0      0      0
c2t2d0    20.8G  2.06G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
logs        29.7G   106G      0      0      0      0
c0t7d0    29.7G   106G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
rpool       76.7G  59.3G      1      6  80.4K   183K
c0t0d0s0  76.7G  59.3G      1      6  80.4K   183K
----------  -----  -----  -----  -----  -----  -----
Here also is a sample output of the memstat kernel data. This shows that the ZFS ARC for the L1 ARC was limited to 6GB.
# echo "::memstat"|mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     647543              2529    8%
ZFS File Data             1549168              6051   18%
Anon                       611003              2386    7%
Exec and libs                7761                30    0%
Page cache                  82585               322    1%
Free (cachelist)            12556                49    0%
Free (freelist)           5475766             21389   65%

Total                     8386382             32759
Physical                  8177488             31943

Summary
The key takeaway for this blog post is that directory data priming can promote more consistent directory server performance for directory service architectures that are designed to contain all of the directory data in the ZFS ARC either via sufficiently large DRAM or flash technologies.

That concludes this blog post.

Have an extremely prime day!

Brad

Wednesday, February 3, 2010

Flash Memory Basics

In preparation for blogging on the application of flash memory technologies to directory services, this blog post will cover the basics of flash memory. This will help people begin to understand why flash through the ZFS secondary [filesystem] cache (a.k.a. L2ARC) and ZFS Intent Log (a.k.a. ZIL) can improve overall directory performance. Further, the use of flash memory and ZFS will also enable radical new directory services architectures. More on those applications in a future blog post. For now, let's review the basics of flash memory.

Flash memory is a non-volatile storage medium that is read and written to through electrical erasure and reprogramming. Flash memory has better kinetic shock absorption properties, lower power consumption and much greater IOPS than hard disk drives. This combination of features makes flash memory a great intermediate storage medium between hard disks and DRAM memory in the storage stratum hierarchy. As an aside, read Brendan Gregg's blog post on how Sun's ZFS makes harnessing the best of flash memory possible and easily accessible by all applications through the ZFS Intent Log and ZFS secondary cache (a.k.a. L2ARC).

Flash memory consists of an array of memory cells in the form of floating-gate transistors. At the present time, there are two categories of flash memory devices. They are single-level cell (SLC) and multi-level cell (MLC) devices. SLC devices store a single bit of information per cell and MLC devices store multiple bits of information per cell. The inherent danger that comes with MLC devices is that the flash memory loses more data per block than compared to SLC when memory cells fail. StorageSearch.com has a great article titled Are MLC SSDs Ever Safe in Enterprise Apps that gives a great in-depth look at this issue. Sun recognizes this tradeoff and has elected to stick with SLC based flash memory for its performance and better reliability characteristics.

The two most well known limitations of flash memory are erase-before-write and write endurance. Erase-before-write requires that an occupied data block (i.e. typically a 512KB group of 4KB memory cells) must be erased before the new data can be written to that block. The performance implication of this limitation is that the device will have a perception of very high write performance until all blocks have been written to for the first time. Once the effects of erase-before-write kicks in, the overall write throughput can drop significantly.

Write endurance simply means that each memory cell has a limited number of times that it can be erased before the memory cell fails. Current flash memory device write-erase-cycles range from 100,000 to 1,000,000 where MLC devices represent the low end and SLC devices are on the high end.

Flash memory producers typically employ one or more of the following strategies to address these limitations.
  • Wear Leveling techniques distribute writes across the full array of memory cells in order to avoid pockets of cell failure. Wear leveling is also used to relocate bad blocks of data to working cells as well.
  • Garbage Collection is employed within the flash memory controller to during idle device cycles consolidate existing occupied data blocks and erase the freed blocks. Garbage collection helps to prevent erase-before-write during write operations by keeping the erased block pool as large as possible. You might say that Garbage Collection is automated defragmentation while the flash memory is idle.
  • The TRIM command, when invoked by the operating system tells the flash device to free up blocks that have been marked for deletion. For example, when you empty the trash bin of a Microsoft Windows desktop, the files are marked for deletion but not actually deleted. If Microsoft Windows is configured to run the TRIM command when files are deleted, the blocks occupied by the deleted files will be erased and made available for use by new data. If blocks were only marked as deleted and not actually erased, the flash memory device would fill up such that it would have to erase-before-write on every write operation because there wouldn't be any free (e.g. pre-erased) blocks available. Fortunately the use of flash memory by ZFS via the ZFS intent log and ZFS secondary cache do not need to invoke the TRIM command because the data is deleted instead of just being marked for deletion.
  • Reserve Capacity is employed to ensure that the flash memory device has exccess capacity for bad block management and working space for garbage collection. Sun's flash memory devices reserve approximately 25% per flash module for this purpose.
A third less known limitation of some flash memory devices is the issue of write cache reliability. The core issue here is that most flash memory devices use DRAM to buffer data before it is written to the flash memory device in order to increase throughput performance. The danger with this kind of buffering is that during a power loss, all data in DRAM cache is susceptible to being lost if not committed to flash memory in time. Sun's flash modules mitigate this problem by implementing a super-capacitor to give the on-board DRAM buffer memory time to commit all buffered data to flash memory before the DRAM power is exhausted.

Sun's Flash Modules, the F20 Flash Accelerator PCIe card, and the F5100 Flash Array through the ZFS intent log and ZFS secondary cache will make the power and performance of flash memory accessible to all applications including directory services going forward.












That wraps up this blog post.

Have a flashtastic day!




Brad