Monday, March 1, 2010

Measuring network throughput via nicstat

When conducting directory services benchmark analysis, one of the areas that you should carefully monitor is network throughput. There have been several benchmarks that I have conducted where the throughput seemed very low relative to what I expected. When I looked at the network throughput I realized that the throughput was constrained due to the limitations of the network. e.g. 100Mbit network interface card (NIC) delivers <= 12.8MBytes/sec, 1Gbit NIC delivers <= 128MBytes/sec, and 10Gbit NIC delivers <= 1280MBytes/sec.

The most effective Solaris command that I have found to succinctly measure network throughput nicstat (Thanks Brendan). You can install CSWnicstat from blastwave or compile it yourself. To compile nicstat, you will need to download and install your preferred compiler and compile nicstat on a development server. Then copy the binary to the respective production server to monitor the network bandwidth.

Here is the gcc invocation that I used to compile nicstat:

# gcc nicstat.c -o nicstat -lkstat -lgen -lsocket -lrt

One issue that I encountered was that there was no 10GigE nic defined in the code. I just added nxge to the list of supported interfaces on line 105 and was able to compile it with no problems.

Line 105 before the change:

static char *g_network[] = { "be", "bge", "ce", "ci", "dmfe", "e1000g", "el",

Line 105 after the change:

static char *g_network[] = { "nxge", "be", "bge", "ce", "ci", "dmfe", "e1000g", "el",

Here is a sample invocation:

$ nicstat -zsi nxge0 10
Time Int rKb/s wKb/s
04:38:16 nxge0 432.278 339.883
04:38:26 nxge0 461.772 1026.978
04:38:36 nxge0 465.267 1035.693
04:38:46 nxge0 468.381 1045.200
04:38:56 nxge0 459.422 1020.522
04:39:06 nxge0 453.317 1009.152
04:39:16 nxge0 466.867 1043.582
04:39:26 nxge0 466.785 1041.348
04:39:36 nxge0 466.367 1040.200

Enjoy!

Brad

Participate in a Welcome to Oracle+Sun event near you...

In case you missed it, Oracle is hosting over 70 "Oracle + Sun Welcome" events around the world. If you want to know what the combined software and hardware portfolio looks like going forward, be sure to participate in one of these events.

You can also watch the Oracle + Sun Strategy Webcast series to get a high level perspective of the various product strategies as well.

I personally plan on participating in the Dallas event on March 24th. I hope to see you there.

Have a super day!

Brad


Wednesday, February 10, 2010

Directory Data Priming Strategies

With the introduction of Sun (and now Oracle's) flash technologies, directory services architecture is going to begin to evolve in a new direction. One possible new direction will be to emphasize using flash to extend the ZFS filesystem cache sufficiently to hold all of the directory data. In this filesystem cache centric directory service architecture we want to find the most efficient path for loading the directory data into the filesystem cache. That is the focus of this blog post.

Priming the ZFS filesystem cache can significantly decrease the time required to reach optimal performance. This blog post compares, contrasts and walks you through three ZFS filesystem cache priming strategies. Note that all graphs shown represent the same exact directory server instance with the same data but primed with different methods. Further, the same SLAMD job was applied for each iteration.

Disclaimer
Although the content of this blog post speaks to delivering optimal directory server performance, it is not intended to represent best possible performance that can be delivered by DSEE 7.

LDAP Priming
The first strategy is to let the directory service prime the filesystem cache as data is retrieved through the natural operation of the directory service. I call this method LDAP Priming because the LDAP search, modify, add and delete operations are what load data into the filesystem cache. The benefit of this method is that there is no action required to implement it. Additionally, if the memory or flash is not large enough to hold all the directory data, then LDAP priming is the only method that will ensure the most active data stays in the filesystem cache. Most current implementations today do not have sufficient memory to hold all of the data. So, this is the only practical method for those architectures to ensure that the most active data gets loaded into the cache. However, the intent of this blog post campaign is to help you begin to consider a new flash centric architecture that targets having enough flash to contain all of the directory data for optimal performance.

The problems with the LDAP priming method include the following:
  • It will take much longer to reach optimal performance because data is not systematically loaded into the filesystem cache.
  • Overall response time will be at least 2 to 4 times greater until all the data is loaded into the filesystem cache.
  • Disk I/O will be significantly higher during LDAP priming. This may reduce overall LDAP write throughput as well as inject latency into replication propagation delay.
Graph 1 below illustrates the best case LDAP priming scenario where all of the directory data is systematically loaded through a LDAP search rate job. Note that the disks containing the data were near 98% busy until all the directory data completed loading into the filesystem cache.
Graph 1: LDAP Primed Load
dd Priming
The second priming strategy is to load the data directly into the filesystem cache via the Solaris 10 dd command. In this strategy, the dd command copies the directory server db3 files into /dev/null which in effect loads the data into the filesystem cache. The following command illustrates the basic dd syntax.
dd if=/db/telco_id2entry.db3 of=/dev/null bs=32k
The primary benefit of priming the data into the filesystem cache with dd is that once complete, the directory server will be able to retrieve all primed data from memory (e.g. DRAM) rather than from disk. The only downside to using this method is that from a best practice perspective, you may have to wait until the priming is complete before starting up the directory server instance. Otherwise, the priming process may degrade directory performance due to keeping the disks very busy while reading in all the data.

Graph 2 below shows the search rate versus response time of a dd primed directory server instance. Note that the throughput and response times are consistent throughout the job as opposed to the case with LDAP priming where there is a ramp up to the optimal throughput.
Graph 2: dd Primed Load
Note that in this SLAMD job, the disks containing the directory data were less than 1% busy throughout the entire job because all of the data had been loaded into the filesystem cache.

Multi-pass dd Priming
The third option is to run the script multiple times with a larger dd blocksize. Using a larger dd blocksize results in better throughput which may translate to an overall shorter load time to prime all the directory data into the filesystem cache. For example, the following table lists the results of several iterations with different dd blocksize values.

Table 1: Priming Analysis Results


Blocksize


Passes


dd Time

% of
data

LDAP
Time

Avg Primed
SearchRate
Avg Primed
Response
Time
N/A00min0%37min8500/sec18ms
32k162.2min99%2min9145/sec17.5ms
128k118min72%6min8400/sec18ms
128k234min92%4min8925/sec17.5ms
256k440min93%8min9067/sec17.5ms
N/A1* 24min100%0min* 9000/sec* 17.5ms

* The last row in this table represents conservative estimates for priming the ZFS cache via a sub-command of the zfs or zpool commands. This represents full throttle throughput of three partitions of an F20 PCIe card where each partition drives 15MB/sec of throughput. Note however, if the F20 PCIe card was sub-divided into 7 partitions where 6 of the 7 partitions were dedicated to ZFS L2 ARC cache, the load time could theoretically be as low as 12 minutes.

The first row where the blocksize value is "N/A" represents the LDAP primed data. It is critical to understand that the LDAP primed case does not represent the actual time that it would take for a production directory server instance to reach a fully primed state. This is because very few if any customers would employ a system to systematically load all directory data via LDAP searches after a server reboot.

The two key takeaways from Table 1 are as follows. First, the priming methods are intended to reduce the overall time taken to load the data into memory so that once the directory starts, the search rate throughput and response time will be near optimum values.

Second, once the directory data is fully primed, the relative throughput and response time are roughly equivalent regardless of the priming method. This finding suggests that the dd blocksize has no correlation to the nsslapd-db-page-size or ZFS recordsize.

Based on the results represented in this graph as well as other extensive tests that I have done as a part of this project, I highly recommend using the single pass 32k dd blocksize in order to load the most data into the filesystem cache prior to starting the directory server instance. This ensures the most consistent starting point for the directory server.

dd-Priming Optimizations for ZFS
At the present time, the dd priming method takes much longer than it should because the data cannot be sequentially loaded into the filesystem cache. When ZFS detects that data is being streamed into the filesystem, it stops caching the read data. Hopefully at some point in the future, the zfs or zpool commands will be extended to include a sub-command that can be used
to load data into the L1 and L2 ARC caches. If this feature was implemented, I am certain that the data priming load time would take significantly less time. Until such a ZFS feature exists, you will have to load the data in a single pass with a dd blocksize of 32k (or less) or iterate through multiple passes with a larger dd blocksize value.

I have found the following script to work reasonably well. If the number of backgrounded processes grows significantly, you may need to inject a sleep, or add in a counter that limits the number of concurrent dd processes. In my case, the dd commands completed so quickly that it never had more than 10 concurrent dd processes.
#!/usr/bin/bash
ArcChunk=$((2**15)) # 32KB
for file in ${*}
do
FileSize=`ls -al ${file} | awk '{ print $5 }'`
FileChunks=$((${FileSize}/${ArcChunk}))
for (( c=0; c< ${FileChunks}; c++ ))
do
dd if="${file}" of=/dev/null \
bs=${ArcChunk} \
iseek=${c} \
count=1 > /dev/null 2>&1 &
done
done
wait
SSD Partitioning
One observation that I made during this effort was that the very best write throughput that I could push through a solid state partition was around 17MB/sec. I added 2 more partitions and was able to drive all three partitions up to 15MB/sec as well. I would have liked to have re-partitioned the F20 card from 4 partitions to 7 and made all 7 partitions a part of the L2ARC cache. However, I didn't have time to get that done in time for this blog post. I suspect that each of the 7 partitions will be able to push 17MB/sec. The key takeaway from this finding is that more partitions may provide better overall performance than fewer partitions.

Reference Data
For reference, I wanted to add in some reference data to provide the context of my setup.
Directory Server: X4150 with 2 2.33GHz Quand CPUs, 32GB of DRAM, 8 10k 146GB SAS disk drives, and one F20 PCIe card. The server was configured with two non-global dsee zones. A single directory server instance was running in each zone. The two directory server instances were configured for multimaster replication. However, all search load was focused on the first of the two zones.

Here is a sample of the zpool -v output. What you are looking for is the capacity of the F20 partitions (e.g. c2t[1-3]d0). Looking at this particular output, we see that 62.4GB of 68.7GB was used for L2ARC. I observed throughout all tests that there is always some unused capacity in each of the F20 partitions.
               capacity     operations    bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
db 104G 712G 0 0 0 0
c0t1d0 17.3G 119G 0 0 0 0
c0t2d0 17.3G 119G 0 0 0 0
c0t3d0 17.3G 119G 0 0 0 0
c0t4d0 17.3G 119G 0 0 0 0
c0t5d0 17.3G 119G 0 0 0 0
c0t6d0 17.3G 119G 0 0 0 0
c2t0d0 272K 22.9G 0 0 0 0
cache - - - - - -
c2t1d0 20.9G 1.98G 0 0 0 0
c2t3d0 20.7G 2.15G 0 0 0 0
c2t2d0 20.8G 2.06G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
logs 29.7G 106G 0 0 0 0
c0t7d0 29.7G 106G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
rpool 76.7G 59.3G 1 6 80.4K 183K
c0t0d0s0 76.7G 59.3G 1 6 80.4K 183K
---------- ----- ----- ----- ----- ----- -----
Here also is a sample output of the memstat kernel data. This shows that the ZFS ARC for the L1 ARC was limited to 6GB.
# echo "::memstat"|mdb -k
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 647543 2529 8%
ZFS File Data 1549168 6051 18%
Anon 611003 2386 7%
Exec and libs 7761 30 0%
Page cache 82585 322 1%
Free (cachelist) 12556 49 0%
Free (freelist) 5475766 21389 65%

Total 8386382 32759
Physical 8177488 31943

Summary
The key takeaway for this blog post is that directory data priming can promote more consistent directory server performance for directory service architectures that are designed to contain all of the directory data in the ZFS ARC either via sufficiently large DRAM or flash technologies.

That concludes this blog post.

Have an extremely prime day!

Brad

Wednesday, February 3, 2010

Flash Memory Basics

In preparation for blogging on the application of flash memory technologies to directory services, this blog post will cover the basics of flash memory. This will help people begin to understand why flash through the ZFS secondary [filesystem] cache (a.k.a. L2ARC) and ZFS Intent Log (a.k.a. ZIL) can improve overall directory performance. Further, the use of flash memory and ZFS will also enable radical new directory services architectures. More on those applications in a future blog post. For now, let's review the basics of flash memory.

Flash memory is a non-volatile storage medium that is read and written to through electrical erasure and reprogramming. Flash memory has better kinetic shock absorption properties, lower power consumption and much greater IOPS than hard disk drives. This combination of features makes flash memory a great intermediate storage medium between hard disks and DRAM memory in the storage stratum hierarchy. As an aside, read Brendan Gregg's blog post on how Sun's ZFS makes harnessing the best of flash memory possible and easily accessible by all applications through the ZFS Intent Log and ZFS secondary cache (a.k.a. L2ARC).

Flash memory consists of an array of memory cells in the form of floating-gate transistors. At the present time, there are two categories of flash memory devices. They are single-level cell (SLC) and multi-level cell (MLC) devices. SLC devices store a single bit of information per cell and MLC devices store multiple bits of information per cell. The inherent danger that comes with MLC devices is that the flash memory loses more data per block than compared to SLC when memory cells fail. StorageSearch.com has a great article titled Are MLC SSDs Ever Safe in Enterprise Apps that gives a great in-depth look at this issue. Sun recognizes this tradeoff and has elected to stick with SLC based flash memory for its performance and better reliability characteristics.

The two most well known limitations of flash memory are erase-before-write and write endurance. Erase-before-write requires that an occupied data block (i.e. typically a 512KB group of 4KB memory cells) must be erased before the new data can be written to that block. The performance implication of this limitation is that the device will have a perception of very high write performance until all blocks have been written to for the first time. Once the effects of erase-before-write kicks in, the overall write throughput can drop significantly.

Write endurance simply means that each memory cell has a limited number of times that it can be erased before the memory cell fails. Current flash memory device write-erase-cycles range from 100,000 to 1,000,000 where MLC devices represent the low end and SLC devices are on the high end.

Flash memory producers typically employ one or more of the following strategies to address these limitations.
  • Wear Leveling techniques distribute writes across the full array of memory cells in order to avoid pockets of cell failure. Wear leveling is also used to relocate bad blocks of data to working cells as well.
  • Garbage Collection is employed within the flash memory controller to during idle device cycles consolidate existing occupied data blocks and erase the freed blocks. Garbage collection helps to prevent erase-before-write during write operations by keeping the erased block pool as large as possible. You might say that Garbage Collection is automated defragmentation while the flash memory is idle.
  • The TRIM command, when invoked by the operating system tells the flash device to free up blocks that have been marked for deletion. For example, when you empty the trash bin of a Microsoft Windows desktop, the files are marked for deletion but not actually deleted. If Microsoft Windows is configured to run the TRIM command when files are deleted, the blocks occupied by the deleted files will be erased and made available for use by new data. If blocks were only marked as deleted and not actually erased, the flash memory device would fill up such that it would have to erase-before-write on every write operation because there wouldn't be any free (e.g. pre-erased) blocks available. Fortunately the use of flash memory by ZFS via the ZFS intent log and ZFS secondary cache do not need to invoke the TRIM command because the data is deleted instead of just being marked for deletion.
  • Reserve Capacity is employed to ensure that the flash memory device has exccess capacity for bad block management and working space for garbage collection. Sun's flash memory devices reserve approximately 25% per flash module for this purpose.
A third less known limitation of some flash memory devices is the issue of write cache reliability. The core issue here is that most flash memory devices use DRAM to buffer data before it is written to the flash memory device in order to increase throughput performance. The danger with this kind of buffering is that during a power loss, all data in DRAM cache is susceptible to being lost if not committed to flash memory in time. Sun's flash modules mitigate this problem by implementing a super-capacitor to give the on-board DRAM buffer memory time to commit all buffered data to flash memory before the DRAM power is exhausted.

Sun's Flash Modules, the F20 Flash Accelerator PCIe card, and the F5100 Flash Array through the ZFS intent log and ZFS secondary cache will make the power and performance of flash memory accessible to all applications including directory services going forward.












That wraps up this blog post.

Have a flashtastic day!

Brad

Wednesday, January 27, 2010

DSEE Import Rate Improvements through ZFS ARC

Hello all,

My friend Wajih Ahmed
blogged today about how the ZFS filesystem Adaptive Replacement Cache can be used to improve DSEE import speed. I observed similar behavior during our 100M entry benchmark at the Sun Benchmark Center.

With the ZFS primarycache=all on the ZFS filesystem containing the ldif import file, it only took approximately 5 hours to import 100M entries. I attempted an import with primarycache set to none and let it run for a long time but was unwilling to wait for it to complete. I estimate that it easily would have taken 18-24 hours to complete. Instead, for fun I set the primarycache to all while the import was running. Below is the excerpt from the import output.

[18/Dec/2009:05:36:58 -0800] - import telco: Processed 18689616 entries -- average rate 1414.1/sec, recent rate 1417.3/sec, hit ratio 100% -- written to database.
[18/Dec/2009:05:37:18 -0800] - import telco: Processed 18718370 entries -- average rate 1414.1/sec, recent rate 1425.6/sec, hit ratio 100% -- written to database.
[18/Dec/2009:05:37:38 -0800] - import telco: Processed 18753933 entries -- average rate 1414.6/sec, recent rate 1607.9/sec, hit ratio 100% -- written to database.
[18/Dec/2009:05:37:58 -0800] - import telco: Processed 18879468 entries -- average rate 1422.0/sec, recent rate 4027.4/sec, hit ratio 100% -- written to database.
[18/Dec/2009:05:38:18 -0800] - import telco: Processed 18999088 entries -- average rate 1428.8/sec, recent rate
6128.9/sec, hit ratio 100% -- written to database.
[18/Dec/2009:05:38:38 -0800] - import telco: Processed 19115830 entries -- average rate 1435.4/sec, recent rate 5909.0/sec, hit ratio 100% -- written to database.
[18/Dec/2009:05:38:58 -0800] - import telco: Processed 19232989 entries -- average rate 1442.1/sec, recent rate 5847.6/sec, hit ratio 100% -- written to database.
[18/Dec/2009:05:39:18 -0800] - import telco: Processed 19348836 entries -- average rate 1448.6/sec, recent rate 5825.1/sec, hit ratio 100% -- written to database.
....
[18/Dec/2009:05:43:58 -0800] - import telco: Processed 20999677 entries -- average rate 1539.9/sec, recent rate 5883.3/sec, hit ratio 100% -- written to database.
[18/Dec/2009:05:44:18 -0800] - import telco: Processed 21117566 entries -- average rate 1546.3/sec, recent rate 5932.6/sec, hit ratio 100% -- written to database.
[18/Dec/2009:05:44:38 -0800] - import telco: Processed 21234080 entries -- average rate 1552.5/sec, recent rate 5860.1/sec, hit ratio 100% -- written to database.
[18/Dec/2009:05:44:58 -0800] - import telco: Processed 21352390 entries -- average rate 1558.9/sec, recent rate 5870.6/sec, hit ratio 100% -- written to database.

Notice that the recent and average import rates remained around 1400/sec up until I enabled the primarycache. Once enabled, the recent import rate jumped from 1400/sec up above 6000/sec in less than 60 seconds. The average import rate then remained around 5800/sec for the remainder of the import. That is approximately a
4.28 times boost in performance!

In Wajih's case, the improvement wasn't as great. I suspect that is due to difference in servers. As you can see from the following specs, my server was faster in every way. Further the storage containing the LDIF import file was dramatically different as well.

Here were the specs of Wajih's server:
Server: Sun X4150
DRAM: 32GB of 667MHz DDR2
CPU Cores: 8 2.33GHz
Zpool Containing LDIF: 146GB from 1 15k SAS disk
Entry Count: 3M
LDIF File Size: 14GB

Here were the specs of my server:
Server: Sun X4270
DRAM: 72GB of 1066MHz DDR3
CPU Cores: 8 HT enabled 2.93GHz (Total threads = 16)
Zpool Containing LDIF: 2TB from 20 virtual disks supplied by Sun StorageTek 9980.
Entry Count: 100M
LDIF File Size: 556GB

Just for fun, I looked at the import rate differential for OpenDS using the same import file. The OpenDS import rate went from around 1,200/sec to 12,000/sec. That is a 10 times boost. Wow!!!

The net of both of our findings is that it is important that you put the LDIF file on a ZFS filesystem that has the primarycache enabled for optimal DSEE import rate.

Have a super day!

Brad

Wednesday, January 20, 2010

DSEE 7 Entry Compression Rocks!

The purpose of this blog is to explore the benefits of DSEE 7 entry compression. In short, the top two key benefits include reducing storage footprint by more than 50% and extending search performance by more than 50% through increased caching potential.

Disclaimer
Although the content of this blog post applies to performance, the purpose is to compare relative performance improvement and not the best possible performance that can be delivered by DSEE 7.

The Goal
One of the most important factors for optimal directory server performance is maximizing the use of entry caching. I made the case in my previous blog post on filesystem caching strategies that maximum memory (or more specifically caching) efficiency is achieved by using the filesystem cache to store directory data.

The Problem
One of the biggest performance constraints to directory server is when data interaction transitions from memory to disk. Reading and writing from disk can be an order or two magnitude slower than from memory. The ideal configuration would be to fit all directory data into memory such that almost all interactions with the data are through memory instead of disk. However, for many larger directory deployments this is just not possible. Solid state disk technologies (a.k.a. flash) through the secondary filesystem cache (a.k.a. L2ARC) and the ZFS intent log of the ZFS filesystem are extending the caching potential far beyond the limitations of RAM and at a lower cost.

Another challenge is that directory entry data expands over time due to the addition of operational attributes and replication metadata. This expansion can erode caching efficiency by reducing the amount of data that can fit into memory and flash. Fortunately, DSEE has reduced this erosion by improving how directory stores data in the db files. For example, version 5.x kept a history of changes for multi-valued attributes that enables significant entry expansion over time. With DSEE 6.x and now 7.0, entry growth over time has been greatly reduced.

Lets look at an example of how an entry can grow over time. To do this we will examine the capacity consumed by an entry in LDIF form before it gets imported into the directory, just after it is imported and then lastly after all of the attributes have been modified. In order to make the pre and post-replicated comparison fair, I determined average attribute data size by dividing the sum of the pre-imported attributes came to 3660 bytes by the number of modifiable attributes (68). This comes to 53.8. In order to be conservative, I used 52. Then I used SLAMD to modify all the attributes with a randomly generated 52 character value. SLAMD computes a single random 52 character value and then replaces the existing value of each of the 68 attributes with the new value.

The pre-import entry in LDIF format consumed 4992 bytes. Importing the entry into the director followed immediately by exporting the directory again adds operational attributes and changes the password value to a hashed value. This increases the size of the LDIF form of the entry to 5161 bytes (e.g. 3% growth). Finally, after the entry has been modified several times we export the data into its LDIF format again. The new size of the post-replicated entry is 9107 bytes (e.g. 45.2% growth).

If this LDIF representation was what determined the potential growth, the caching efficiency would be reduced by half if the server had been sized to hold only the pre-imported capacity fully in memory. Fortunately, DSEE 6 and 7 do a good job of minimizing this growth such that it isn't as much of a problem as it had been in the past.

The Discovery
In a recent DSEE 7 benchmark involving Sun's F20 PCIe flash cards (more on this in a future blog) we explored the use of LDAP entry compression to determine all of its possible benefits. Here were the most significant findings.
  1. The storage footprint was reduced by as much as 66%.
  2. The directory was able to cache greater than 50% more entries into the filesystem cache.
  3. Entry compression helped to minimize further the effects of average entry growth.
  4. The nsslapd-db-page-size could be smaller and more consistent with entry growth over time.
The primary target of that benchmark was to explore the use of expanding entry caching potential through the use of solid state devices in the ZFS secondary cache (e.g. L2ARC). If the L2ARC enables greater entry caching potential, then entry compression can extend the caching potential even further by holding more entries in the same amount of memory and flash.

Entry Compression Basics
Before exploring the benefits of entry compression, let's see how compression in general can help to reduce the impacts of entry growth by normalizing the capacity consumed by an entry. Note again that the LDIF form from an export is not representative of the format and storage consumed in the binary db3 files. The binary db3 form is much more compact and does not grow as much as the LDIF form. That being said, the following table lists the capacities consumed by an entry that has had several different compression algorithms applied to it. The percentages are the percentage decrease in size per compression algorithm compared to the non-compressed size.

The first row represents a pre-import entry. The pre-import entry size represents the amount of storage occupied by a single entry in LDIF format that has not yet been imported into a directory.

The second row is the post-import entry. The post-import entry represents the amount of storage consumed by a single entry after it has been imported into the directory. This post-import entry adds operational attributes and re-formats some attributes like the userPassword.

The third row is the post-replication entry. The post-replication entry represents the amount of storage occupied by an entry for which almost every attribute has been modified with a 52 character sequence of random ascii multiple times per attribute. Note though that the same 52 character value was used for every attribute.

The fourth row is the scrambled entry. This scrambled entry is just a duplicate of the post-replication entry but I manually replaced each of the 52 character values with unique random values.

Table: Entry Compression Comparison
File ContentsNoneLempel-Ziv (Z)Gzip (gz-6)Gzip (gz-9)bzip2 (bz2)
Pre-import Entry49923592 (28.0%)2042 (59.1%)2042 (59.1%)2426 (51.4%)
Post-import Entry51613817(26.0%)2277 (55.9%)2277 (55.9%)2646 (48.7%)
Post-replicated Entry91072752 (69.8%)1383 (84.8%)1373 (86.9%)1359 (85.1%)
Scrambled Entry89725665 (36.9%)3964 (55.8%)3960 (55.9%)4028 (55.1%)

The key observation from this table is that regardless of the compression algorithm applied the relative size of the compressed entries remains similar over the life of an entry. For example, the replicated size of the Lempel-Ziv of the Solaris compression command compressed entry is nearly the same as its pre-imported size. However, the un-compressed size of the post-replicated entry is nearly 2 times the pre-import entry size.

An interesting secondary observation is that size of the compressed post-replicated entries are
all smaller than compressed pre and post-import sizes for all compression algorithms. This most likely occurred because the same randomly generated value was assigned to all 68 attributes. This creates repetitive data that is more compressible than if each attribute value was unique. In order to prove this point, I created the fourth artificial row called the scrambled entry. The scrambled row is just a duplicate of the post-replicated entry but with truly random 52 character values for each of the 68 modifiable attributes. You can see that the size is more in line with the expected growth but still proportionately smaller than the uncompressed size.

Side Bar: Entry Compression vs. ZFS Compression
I have long been an advocate for ZFS and believe strongly that each of its features brings a tremendous amount of value. The compression feature thus would be a natural consideration for this analysis. The ZFS compression feature enables per-filesystem compression. The compression options include the following:
  • off, or no compression, which is the default,
  • on, or lzjb which is a low overhead lossless compression algorithm,
  • gzip, or Gnu zip where the compression level is set to 6
  • gzip-N where N represents an integer from 1 to 9. 1 is the fastest and 9 is slowest but offers the best compression ratio.
With its variety of compression algorithms, it seems like ZFS would take preference over the one Limpel-Ziv compression algorithm that is used by DSEE 7. However, there is one critical advantage that entry compression has over ZFS compression.

Entry compression's advantage is that the data is compressed before being stored on disk. This means that the storage occupied on disk, in memory, in the DSEE db cache, and in the L1ARC and L2ARC caches is the compressed size.

Conversely, ZFS compresses the data as it is being written to disk. Thus, the only place in that benefits from the compression is on disk. When data is read from the compressed ZFS filesystem, the memory consumed is the uncompressed size. This pretty much rules out ZFS since one of the most desirable features for maximum directory performance is caching efficiency.

Storage Footprint Reduction
Clearly one of the most important advantages of compression is to reduce the amount of disk capacity required to store the data. DSEE 7 entry compression delivers by reducing the storage footprint between 50-60%.

Let's look at the baseline storage efficiencies added by compression for freshly imported 100k and 1M entry DSEE 7 directory server instances with DB page size of 16k. The following chart compares the the sum of the db and indexes excluding the changelog using the following command (Kudos to Terry Gardner for this script goodness).

ls -l *.db3 | grep -v cl | awk ' BEGIN { t = 0; }{ t += $5; } END { print t/1024/1024/1024; }'

Table: Post-import DSEE 7 DB Compression Comparison
DS VersionEntry CountUncompressedCompressed% Compression
DSEE 7.0100k1.57GB551MB64.9%
DSEE 7.01M15.7GB5.5GB64.97%

Table: Post-replication DSEE 7 DB Compression Comparison
DS VersionEntry CountUncompressedCompressed% Compression
DSEE 7.0100k1.63GB594MB65%
DSEE 7.01M15.76GB5.57GB64.66%

The Cost of Compression
The cost of compression is the tradeoff between making more efficient use of disk and memory with higher CPU utilization. Each time an entry is read or written by the directory server, the directory server instance uncompresses or compresses respectively that entry. Compressing and uncompressing an entry uses more CPU than if the entry was just read or written. The logical question is, "Is the overhead worth the tradeoff?" My emphatic response at this point in time is yes. Unfortunately we didn't focus a detailed analysis of the delta of CPU utilization that results from running with and without compression. However, my anecdotal observation was that the overall CPU overhead was not significant enough to make a difference in the overall performance. I suggest that the CPU impact is less than 10%. However, when the server nears full capacity at 100% CPU utilization the contention for CPU will be more noticeable in terms of response time. However, I'm sure someone will want to study this aspect to the N-th degree. If you happen to take on this job, please share your results with the community.

Increased Caching Potential
By compressing the data before storing it on disk, the data is in a compressed format in the filesystem cache. Consequently, the best possible caching potential is increased by the same ratio as the on disk compression, which is between 50-60%. That means we can fit between 50-60% more entries into the filesystem cache than with uncompressed data.

Lets look at some real data. I configured a DS instance without entry compression that contains 1M entries. The storage configuration consists of two ZFS pools where the first is a 6 disk striped volume containing the db and the other is a single disk for the transaction and informational logs (i.e. access, error, and audit). The sum of the db3 files excluding the changelog consumed approximately 15.7GB. I constrained the primary cache (a.k.a. L1ARC) of the ZFS filesystem to 6GB so that the data couldn't all fit into the L1ARC cache. Then I ran three different SLAMD SearchRate jobs to determine the throughput of each configuration. The graph descriptions below spell out the results, search span and average resource consumptions. Click on each graph to see the details of each.

Without Entry Compression
Left Graph: 10459 ops/sec - 10k of 1M entries without flash - 100% busy CPU - 0% busy disk
Middle Graph: 3175 ops/sec - 1M of 1M entries without flash - 68% busy CPU - 99% busy disk
Right Graph: 5675 ops/sec - 1M of 1M entries with flash - 100% busy CPU - 1% busy disk


Next I reconfigured the same DS instance with entry compression that contains 1M entries. The sum of the db3 files excluding the changelog consumed approximately 1.5GB. Again, the L1ARC was constrained to 6GB. I ran three different SLAMD SearchRate jobs to determine the throughput of each configuration. The graph descriptions below spell out the results, search span and average resource consumptions. Click on each graph to see the details of each.

With Entry Compression
Left Graph: 10513 ops/sec - 10k of 1M entries without flash - 100% busy CPU - 0% busy disk
Middle Graph: 9672 ops/sec - 1M of 1M entries without flash - 100% busy CPU - 24% busy disk
Right Graph: 9749 ops/sec - 1M of 1M entries with flash - 100% busy CPU - 0% busy disk


Note that the middle graph experienced a small degree of disk activity. This is because the db3 files are not the only data using the L1ARC cache. Thus, not all of the 5.52GB of db3 data fit into the 6GB of L1ARC. This reduced slightly the throughput. However, as you see in the right graph, as soon as the data was cached by the L2ARC (i.e. solid state flash device), the disks dropped back to 0% busy and the throughput increased.

As you can see from these results, implementing DSEE 7 entry compression improved performance and flash extended the performance benefit even further.

Simplifying DB Page Size
One of the DSEE 7 configurations that needs to be set correctly in order to deliver maximum throughput is the nsslapd-db-page-size (a.k.a. DB page size) attribute. The DB page size is used by the Sleepycat DB when storing directory entries on disk. Maximum throughput is achieved when each entry fits on its own page. This is achieved by configuring the page size to be at least 4 times larger than the average post-replicated entry size. If the size of the key or the data exceeds 25% of the DB page size, then instead of placing that data into that B-Tree leaf page, it instead creates an overflow page that will only contain data from that one entry.*

Let's determine the optimum DB page size for our entries. To do this we run the exact db_stat command from Sleepycat that corresponds exactly to the DSEE 7 platform and version that we are running. See the this link to learn more about overflow pages and how to run db_stat to determine your DSEE 7 db3 files are in an overflow condition. Note that it is not recommended that you run db_stat on your own. The version and patch level of db_stat required to show the proper data must match exactly the version used to build the respective directory server version. With that very important caveat out of the way, the following table lists the overflow pages results for various DB page sizes for fresh imports (post-import) and after having modified all the entries (e.g. post-replication) of an 100k entry directory.

Table: Uncompressed Overflow Analysis of id2entry for 100k Entry
DSEE 7Post-importPost-importPost-replicationPost-replication
DB Page SizeKey/Data SizeOverflow PagesKey/Data SizeOverflow Pages
4096 (4k)10072000001007200000
8192 (8k)20311000002031100000
16384 (16k)40791000004079100000
32768 (32k)8175081750

Now, let's consider how entry compression might impact the DB page size. If the entry is compressed before it is stored on disk, then the data to fit within a page will correspondingly be much smaller. Thus, we can most likely use a smaller DB page size. Further, note that as the average entry size grows, it remains more constant over time. Thus, the probability of needing to change the DB page size in the future is much smaller. The table below looks at the same data as in the previous table but this time with entry compression enabled.

Table: Compressed Overflow Analysis of id2entry for 100k Entry
DSEE 7Post-importPost-importPost-replicationPost-replication
DB Page SizeKey/Data SizeOverflow PagesKey/Data SizeOverflow Pages
4096 (4k)10071000001007100000
8192 (8k)203110000020310
16384 (16k)4079040790
32768 (32k)8175081750

Note that the reason why the post-replication overflow pages for 8k dropped from 100% to 0% is because once all of the entries were modified, the greater compression for larger entries kicked in and fit the entries back into a single page.

A Friendly Reminder
After you have determined the optimal DB page size then configure the ZFS recordsize (or UFS blocksize) of all directory related filesystems to match the size of the DB page size. For example, this applies to all filesystems containg the db, change log, or transaction log or even the operational logs (e.g. acces, audit, and error). If the DB page size is set to 16364, then the ZFS recordsize should be set to 16k.

Command Reference
Now that you realize the potential of entry compression, let's look at how to implement it. Enabling entry compression is as simple as running the following two commands with the appropriate parameters.

dsconf set-suffix-prop -p port-number dc=example,dc=com compressed-entries:all
dsconf set-suffix-prop -p port-number dc=example,dc=com compression-mode:DSZ
However, based on the DB page size analysis mentioned earlier, you may want to discover the new optimum DB page size with compression enabled in your lab environment before rolling it out into production.

Here is a sample command for setting the ZFS recordsize of a specific ZFS filesystem.
zfs set recordsize=16k db/ds1

Conclusion
The net takeaway from this post is that DSEE 7 entry compression can significantly reduce your storage footprint, increase caching efficiency, mitigate the impacts of entry growth over time and enable you to more stably tune the DB page size.

I hope you have a very compressed and cache efficient day!

Brad
PS: Huge thanks to Benoit Chaffanjon and the Sun Benchmark Center for allowing us to use their facility and expert assistance to do these benchmarks. It would not have been possible without their contributions.

Special Thanks also to all of the contributors from directory engineering, directory marketing, and specialists from the field. These great folks include the following Pedro Vazquez, Ludovic Poitou, Arnaud Lacour, Mark Craig, Fabio Pistolesi, Nick Wooler, Etienne Remillon, Wajih Ahmed, Jeffery Tye and Terry Gardner.