Wednesday, February 10, 2010

Directory Data Priming Strategies

With the introduction of Sun (and now Oracle's) flash technologies, directory services architecture is going to begin to evolve in a new direction. One possible new direction will be to emphasize using flash to extend the ZFS filesystem cache sufficiently to hold all of the directory data. In this filesystem cache centric directory service architecture we want to find the most efficient path for loading the directory data into the filesystem cache. That is the focus of this blog post.

Priming the ZFS filesystem cache can significantly decrease the time required to reach optimal performance. This blog post compares, contrasts and walks you through three ZFS filesystem cache priming strategies. Note that all graphs shown represent the same exact directory server instance with the same data but primed with different methods. Further, the same SLAMD job was applied for each iteration.

Although the content of this blog post speaks to delivering optimal directory server performance, it is not intended to represent best possible performance that can be delivered by DSEE 7.

LDAP Priming
The first strategy is to let the directory service prime the filesystem cache as data is retrieved through the natural operation of the directory service. I call this method LDAP Priming because the LDAP search, modify, add and delete operations are what load data into the filesystem cache. The benefit of this method is that there is no action required to implement it. Additionally, if the memory or flash is not large enough to hold all the directory data, then LDAP priming is the only method that will ensure the most active data stays in the filesystem cache. Most current implementations today do not have sufficient memory to hold all of the data. So, this is the only practical method for those architectures to ensure that the most active data gets loaded into the cache. However, the intent of this blog post campaign is to help you begin to consider a new flash centric architecture that targets having enough flash to contain all of the directory data for optimal performance.

The problems with the LDAP priming method include the following:
  • It will take much longer to reach optimal performance because data is not systematically loaded into the filesystem cache.
  • Overall response time will be at least 2 to 4 times greater until all the data is loaded into the filesystem cache.
  • Disk I/O will be significantly higher during LDAP priming. This may reduce overall LDAP write throughput as well as inject latency into replication propagation delay.
Graph 1 below illustrates the best case LDAP priming scenario where all of the directory data is systematically loaded through a LDAP search rate job. Note that the disks containing the data were near 98% busy until all the directory data completed loading into the filesystem cache.
Graph 1: LDAP Primed Load
dd Priming

The second priming strategy is to load the data directly into the filesystem cache via the Solaris 10 dd command. In this strategy, the dd command copies the directory server db3 files into /dev/null which in effect loads the data into the filesystem cache. The following command illustrates the basic dd syntax.
dd if=/db/telco_id2entry.db3 of=/dev/null bs=32k
The primary benefit of priming the data into the filesystem cache with dd is that once complete, the directory server will be able to retrieve all primed data from memory (e.g. DRAM) rather than from disk. The only downside to using this method is that from a best practice perspective, you may have to wait until the priming is complete before starting up the directory server instance. Otherwise, the priming process may degrade directory performance due to keeping the disks very busy while reading in all the data.

Graph 2 below shows the search rate versus response time of a dd primed directory server instance. Note that the throughput and response times are consistent throughout the job as opposed to the case with LDAP priming where there is a ramp up to the optimal throughput.
Graph 2: dd Primed Load
Note that in this SLAMD job, the disks containing the directory data were less than 1% busy throughout the entire job because all of the data had been loaded into the filesystem cache.

Multi-pass dd Priming
The third option is to run the script multiple times with a larger dd blocksize. Using a larger dd blocksize results in better throughput which may translate to an overall shorter load time to prime all the directory data into the filesystem cache. For example, the following table lists the results of several iterations with different dd blocksize values.

Table 1: Priming Analysis Results



dd Time

% of


Avg Primed
Avg Primed
N/A1* 24min100%0min* 9000/sec* 17.5ms

* The last row in this table represents conservative estimates for priming the ZFS cache via a sub-command of the zfs or zpool commands. This represents full throttle throughput of three partitions of an F20 PCIe card where each partition drives 15MB/sec of throughput. Note however, if the F20 PCIe card was sub-divided into 7 partitions where 6 of the 7 partitions were dedicated to ZFS L2 ARC cache, the load time could theoretically be as low as 12 minutes.

The first row where the blocksize value is "N/A" represents the LDAP primed data. It is critical to understand that the LDAP primed case does not represent the actual time that it would take for a production directory server instance to reach a fully primed state. This is because very few if any customers would employ a system to systematically load all directory data via LDAP searches after a server reboot.

The two key takeaways from Table 1 are as follows. First, the priming methods are intended to reduce the overall time taken to load the data into memory so that once the directory starts, the search rate throughput and response time will be near optimum values.

Second, once the directory data is fully primed, the relative throughput and response time are roughly equivalent regardless of the priming method. This finding suggests that the dd blocksize has no correlation to the nsslapd-db-page-size or ZFS recordsize.

Based on the results represented in this graph as well as other extensive tests that I have done as a part of this project, I highly recommend using the single pass 32k dd blocksize in order to load the most data into the filesystem cache prior to starting the directory server instance. This ensures the most consistent starting point for the directory server.

dd-Priming Optimizations for ZFS
At the present time, the dd priming method takes much longer than it should because the data cannot be sequentially loaded into the filesystem cache. When ZFS detects that data is being streamed into the filesystem, it stops caching the read data. Hopefully at some point in the future, the zfs or zpool commands will be extended to include a sub-command that can be used
to load data into the L1 and L2 ARC caches. If this feature was implemented, I am certain that the data priming load time would take significantly less time. Until such a ZFS feature exists, you will have to load the data in a single pass with a dd blocksize of 32k (or less) or iterate through multiple passes with a larger dd blocksize value.

I have found the following script to work reasonably well. If the number of backgrounded processes grows significantly, you may need to inject a sleep, or add in a counter that limits the number of concurrent dd processes. In my case, the dd commands completed so quickly that it never had more than 10 concurrent dd processes.
ArcChunk=$((2**15)) # 32KB
for file in ${*}
FileSize=`ls -al ${file} | awk '{ print $5 }'`
for (( c=0; c< ${FileChunks}; c++ ))
dd if="${file}" of=/dev/null \
bs=${ArcChunk} \
iseek=${c} \
count=1 > /dev/null 2>&1 &
SSD Partitioning
One observation that I made during this effort was that the very best write throughput that I could push through a solid state partition was around 17MB/sec. I added 2 more partitions and was able to drive all three partitions up to 15MB/sec as well. I would have liked to have re-partitioned the F20 card from 4 partitions to 7 and made all 7 partitions a part of the L2ARC cache. However, I didn't have time to get that done in time for this blog post. I suspect that each of the 7 partitions will be able to push 17MB/sec. The key takeaway from this finding is that more partitions may provide better overall performance than fewer partitions.

Reference Data
For reference, I wanted to add in some reference data to provide the context of my setup.
Directory Server: X4150 with 2 2.33GHz Quand CPUs, 32GB of DRAM, 8 10k 146GB SAS disk drives, and one F20 PCIe card. The server was configured with two non-global dsee zones. A single directory server instance was running in each zone. The two directory server instances were configured for multimaster replication. However, all search load was focused on the first of the two zones.

Here is a sample of the zpool -v output. What you are looking for is the capacity of the F20 partitions (e.g. c2t[1-3]d0). Looking at this particular output, we see that 62.4GB of 68.7GB was used for L2ARC. I observed throughout all tests that there is always some unused capacity in each of the F20 partitions.
capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
db           104G   712G      0      0      0      0
c0t1d0    17.3G   119G      0      0      0      0
c0t2d0    17.3G   119G      0      0      0      0
c0t3d0    17.3G   119G      0      0      0      0
c0t4d0    17.3G   119G      0      0      0      0
c0t5d0    17.3G   119G      0      0      0      0
c0t6d0    17.3G   119G      0      0      0      0
c2t0d0     272K  22.9G      0      0      0      0
cache           -      -      -      -      -      -
c2t1d0    20.9G  1.98G      0      0      0      0
c2t3d0    20.7G  2.15G      0      0      0      0
c2t2d0    20.8G  2.06G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
logs        29.7G   106G      0      0      0      0
c0t7d0    29.7G   106G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
rpool       76.7G  59.3G      1      6  80.4K   183K
c0t0d0s0  76.7G  59.3G      1      6  80.4K   183K
----------  -----  -----  -----  -----  -----  -----
Here also is a sample output of the memstat kernel data. This shows that the ZFS ARC for the L1 ARC was limited to 6GB.
# echo "::memstat"|mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     647543              2529    8%
ZFS File Data             1549168              6051   18%
Anon                       611003              2386    7%
Exec and libs                7761                30    0%
Page cache                  82585               322    1%
Free (cachelist)            12556                49    0%
Free (freelist)           5475766             21389   65%

Total                     8386382             32759
Physical                  8177488             31943

The key takeaway for this blog post is that directory data priming can promote more consistent directory server performance for directory service architectures that are designed to contain all of the directory data in the ZFS ARC either via sufficiently large DRAM or flash technologies.

That concludes this blog post.

Have an extremely prime day!


1 comment:

Constantin said...

Hi Brad,

excellent analysis, congratulations!

Your priming observations are not limited to LDAP servers. I'm sure that other types of databases as well as other applications with significant data sets would equally well benefit from dd priming.

The usefulness of priming the L2ARC you have found suggests that ZFS should be given subcommands to do efficient priming. One could imagine something like "zfs prime tank/fs1 /tank/fs1/data1.db /tank/fs1/data2.db ..." This would have two benefits: 1) ZFS could walk through the priming data faster than dd, and 2) ZFS would be able to mark the data as "frequently used" in the L2ARC even though it has effectively only been used once (which is the essence of priming: Make the system aware of frequently used data).

Thanks for posting this, especially the graphs show how useful L2ARC is and how priming can improve this even further.