Wednesday, January 20, 2010

DSEE 7 Entry Compression Rocks!

Update: Updated to reflect new Oracle documentation links...

The purpose of this blog is to explore the benefits of DSEE 7 entry compression. In short, the top two key benefits include reducing storage footprint by more than 50% and extending search performance by more than 50% through increased caching potential.

Although the content of this blog post applies to performance, the purpose is to compare relative performance improvement and not the best possible performance that can be delivered by DSEE 7.

The Goal
One of the most important factors for optimal directory server performance is maximizing the use of entry caching. I made the case in my previous blog post on filesystem caching strategies that maximum memory (or more specifically caching) efficiency is achieved by using the filesystem cache to store directory data.

The Problem
One of the biggest performance constraints to directory server is when data interaction transitions from memory to disk. Reading and writing from disk can be an order or two magnitude slower than from memory. The ideal configuration would be to fit all directory data into memory such that almost all interactions with the data are through memory instead of disk. However, for many larger directory deployments this is just not possible. Solid state disk technologies (a.k.a. flash) through the secondary filesystem cache (a.k.a. L2ARC) and the ZFS intent log of the ZFS filesystem are extending the caching potential far beyond the limitations of RAM and at a lower cost.

Another challenge is that directory entry data expands over time due to the addition of operational attributes and replication metadata. This expansion can erode caching efficiency by reducing the amount of data that can fit into memory and flash. Fortunately, DSEE has reduced this erosion by improving how directory stores data in the db files. For example, version 5.x kept a history of changes for multi-valued attributes that enables significant entry expansion over time. With DSEE 6.x and now 7.0, entry growth over time has been greatly reduced.

Lets look at an example of how an entry can grow over time. To do this we will examine the capacity consumed by an entry in LDIF form before it gets imported into the directory, just after it is imported and then lastly after all of the attributes have been modified. In order to make the pre and post-replicated comparison fair, I determined average attribute data size by dividing the sum of the pre-imported attributes came to 3660 bytes by the number of modifiable attributes (68). This comes to 53.8. In order to be conservative, I used 52. Then I used SLAMD to modify all the attributes with a randomly generated 52 character value. SLAMD computes a single random 52 character value and then replaces the existing value of each of the 68 attributes with the new value.

The pre-import entry in LDIF format consumed 4992 bytes. Importing the entry into the director followed immediately by exporting the directory again adds operational attributes and changes the password value to a hashed value. This increases the size of the LDIF form of the entry to 5161 bytes (e.g. 3% growth). Finally, after the entry has been modified several times we export the data into its LDIF format again. The new size of the post-replicated entry is 9107 bytes (e.g. 45.2% growth).

If this LDIF representation was what determined the potential growth, the caching efficiency would be reduced by half if the server had been sized to hold only the pre-imported capacity fully in memory. Fortunately, DSEE 6 and 7 do a good job of minimizing this growth such that it isn't as much of a problem as it had been in the past.

The Discovery
In a recent DSEE 7 benchmark involving Sun's F20 PCIe flash cards (more on this in a future blog) we explored the use of LDAP entry compression to determine all of its possible benefits. Here were the most significant findings.
  1. The storage footprint was reduced by as much as 66%.
  2. The directory was able to cache greater than 50% more entries into the filesystem cache.
  3. Entry compression helped to minimize further the effects of average entry growth.
  4. The nsslapd-db-page-size could be smaller and more consistent with entry growth over time.
The primary target of that benchmark was to explore the use of expanding entry caching potential through the use of solid state devices in the ZFS secondary cache (e.g. L2ARC). If the L2ARC enables greater entry caching potential, then entry compression can extend the caching potential even further by holding more entries in the same amount of memory and flash.

Entry Compression Basics
Before exploring the benefits of entry compression, let's see how compression in general can help to reduce the impacts of entry growth by normalizing the capacity consumed by an entry. Note again that the LDIF form from an export is not representative of the format and storage consumed in the binary db3 files. The binary db3 form is much more compact and does not grow as much as the LDIF form. That being said, the following table lists the capacities consumed by an entry that has had several different compression algorithms applied to it. The percentages are the percentage decrease in size per compression algorithm compared to the non-compressed size.

The first row represents a pre-import entry. The pre-import entry size represents the amount of storage occupied by a single entry in LDIF format that has not yet been imported into a directory.

The second row is the post-import entry. The post-import entry represents the amount of storage consumed by a single entry after it has been imported into the directory. This post-import entry adds operational attributes and re-formats some attributes like the userPassword.

The third row is the post-replication entry. The post-replication entry represents the amount of storage occupied by an entry for which almost every attribute has been modified with a 52 character sequence of random ascii multiple times per attribute. Note though that the same 52 character value was used for every attribute.

The fourth row is the scrambled entry. This scrambled entry is just a duplicate of the post-replication entry but I manually replaced each of the 52 character values with unique random values.

Table: Entry Compression Comparison
File ContentsNoneLempel-Ziv (Z)Gzip (gz-6)Gzip (gz-9)bzip2 (bz2)
Pre-import Entry49923592 (28.0%)2042 (59.1%)2042 (59.1%)2426 (51.4%)
Post-import Entry51613817(26.0%)2277 (55.9%)2277 (55.9%)2646 (48.7%)
Post-replicated Entry91072752 (69.8%)1383 (84.8%)1373 (86.9%)1359 (85.1%)
Scrambled Entry89725665 (36.9%)3964 (55.8%)3960 (55.9%)4028 (55.1%)

The key observation from this table is that regardless of the compression algorithm applied the relative size of the compressed entries remains similar over the life of an entry. For example, the replicated size of the Lempel-Ziv of the Solaris compression command compressed entry is nearly the same as its pre-imported size. However, the un-compressed size of the post-replicated entry is nearly 2 times the pre-import entry size.

An interesting secondary observation is that size of the compressed post-replicated entries are
all smaller than compressed pre and post-import sizes for all compression algorithms. This most likely occurred because the same randomly generated value was assigned to all 68 attributes. This creates repetitive data that is more compressible than if each attribute value was unique. In order to prove this point, I created the fourth artificial row called the scrambled entry. The scrambled row is just a duplicate of the post-replicated entry but with truly random 52 character values for each of the 68 modifiable attributes. You can see that the size is more in line with the expected growth but still proportionately smaller than the uncompressed size.

Side Bar: Entry Compression vs. ZFS Compression
I have long been an advocate for ZFS and believe strongly that each of its features brings a tremendous amount of value. The compression feature thus would be a natural consideration for this analysis. The ZFS compression feature enables per-filesystem compression. The compression options include the following:
  • off, or no compression, which is the default,
  • on, or lzjb which is a low overhead lossless compression algorithm,
  • gzip, or Gnu zip where the compression level is set to 6
  • gzip-N where N represents an integer from 1 to 9. 1 is the fastest and 9 is slowest but offers the best compression ratio.

With its variety of compression algorithms, it seems like ZFS would take preference over the one Limpel-Ziv compression algorithm that is used by DSEE 7. However, there is one critical advantage that entry compression has over ZFS compression.

Entry compression's advantage is that the data is compressed before being stored on disk. This means that the storage occupied on disk, in memory, in the DSEE db cache, and in the L1ARC and L2ARC caches is the compressed size.

Conversely, ZFS compresses the data as it is being written to disk. Thus, the only place in that benefits from the compression is on disk. When data is read from the compressed ZFS filesystem, the memory consumed is the uncompressed size. This pretty much rules out ZFS since one of the most desirable features for maximum directory performance is caching efficiency.

Storage Footprint Reduction
Clearly one of the most important advantages of compression is to reduce the amount of disk capacity required to store the data. DSEE 7 entry compression delivers by reducing the storage footprint between 50-60%.

Let's look at the baseline storage efficiencies added by compression for freshly imported 100k and 1M entry DSEE 7 directory server instances with DB page size of 16k. The following chart compares the the sum of the db and indexes excluding the changelog using the following command (Kudos to Terry Gardner for this script goodness).

ls -l *.db3 | grep -v cl | awk ' BEGIN { t = 0; }{ t += $5; } END { print t/1024/1024/1024; }'

Table: Post-import DSEE 7 DB Compression Comparison

DS VersionEntry CountUncompressedCompressed% Compression
DSEE 7.0100k1.57GB551MB64.9%
DSEE 7.01M15.7GB5.5GB64.97%

Table: Post-replication DSEE 7 DB Compression Comparison

DS VersionEntry CountUncompressedCompressed% Compression
DSEE 7.0100k1.63GB594MB65%
DSEE 7.01M15.76GB5.57GB64.66%

The Cost of Compression
The cost of compression is the tradeoff between making more efficient use of disk and memory with higher CPU utilization. Each time an entry is read or written by the directory server, the directory server instance uncompresses or compresses respectively that entry. Compressing and uncompressing an entry uses more CPU than if the entry was just read or written. The logical question is, "Is the overhead worth the tradeoff?" My emphatic response at this point in time is yes. Unfortunately we didn't focus a detailed analysis of the delta of CPU utilization that results from running with and without compression. However, my anecdotal observation was that the overall CPU overhead was not significant enough to make a difference in the overall performance. I suggest that the CPU impact is less than 10%. However, when the server nears full capacity at 100% CPU utilization the contention for CPU will be more noticeable in terms of response time. However, I'm sure someone will want to study this aspect to the N-th degree. If you happen to take on this job, please share your results with the community.

Increased Caching Potential
By compressing the data before storing it on disk, the data is in a compressed format in the filesystem cache. Consequently, the best possible caching potential is increased by the same ratio as the on disk compression, which is between 50-60%. That means we can fit between 50-60% more entries into the filesystem cache than with uncompressed data.

Lets look at some real data. I configured a DS instance without entry compression that contains 1M entries. The storage configuration consists of two ZFS pools where the first is a 6 disk striped volume containing the db and the other is a single disk for the transaction and informational logs (i.e. access, error, and audit). The sum of the db3 files excluding the changelog consumed approximately 15.7GB. I constrained the primary cache (a.k.a. L1ARC) of the ZFS filesystem to 6GB so that the data couldn't all fit into the L1ARC cache. Then I ran three different SLAMD SearchRate jobs to determine the throughput of each configuration. The graph descriptions below spell out the results, search span and average resource consumptions. Click on each graph to see the details of each.

Without Entry Compression

Left Graph: 10459 ops/sec - 10k of 1M entries without flash - 100% busy CPU - 0% busy disk
Middle Graph: 3175 ops/sec - 1M of 1M entries without flash - 68% busy CPU - 99% busy disk
Right Graph: 5675 ops/sec - 1M of 1M entries with flash - 100% busy CPU - 1% busy disk

Next I reconfigured the same DS instance with entry compression that contains 1M entries. The sum of the db3 files excluding the changelog consumed approximately 1.5GB. Again, the L1ARC was constrained to 6GB. I ran three different SLAMD SearchRate jobs to determine the throughput of each configuration. The graph descriptions below spell out the results, search span and average resource consumptions. Click on each graph to see the details of each.

With Entry Compression
Left Graph: 10513 ops/sec - 10k of 1M entries without flash - 100% busy CPU - 0% busy disk
Middle Graph: 9672 ops/sec - 1M of 1M entries without flash - 100% busy CPU - 24% busy disk
Right Graph: 9749 ops/sec - 1M of 1M entries with flash - 100% busy CPU - 0% busy disk

Note that the middle graph experienced a small degree of disk activity. This is because the db3 files are not the only data using the L1ARC cache. Thus, not all of the 5.52GB of db3 data fit into the 6GB of L1ARC. This reduced slightly the throughput. However, as you see in the right graph, as soon as the data was cached by the L2ARC (i.e. solid state flash device), the disks dropped back to 0% busy and the throughput increased.

As you can see from these results, implementing DSEE 7 entry compression improved performance and flash extended the performance benefit even further.

Simplifying DB Page Size
One of the DSEE 7 configurations that needs to be set correctly in order to deliver maximum throughput is the nsslapd-db-page-size (a.k.a. DB page size) attribute. The DB page size is used by the Sleepycat DB when storing directory entries on disk. Maximum throughput is achieved when each entry fits on its own page. This is achieved by configuring the page size to be at least 4 times larger than the average post-replicated entry size. If the size of the key or the data exceeds 25% of the DB page size, then instead of placing that data into that B-Tree leaf page, it instead creates an overflow page that will only contain data from that one entry.*

Let's determine the optimum DB page size for our entries. To do this we run the exact db_stat command from Sleepycat that corresponds exactly to the DSEE 7 platform and version that we are running. See the this link to learn more about overflow pages and how to run db_stat to determine your DSEE 7 db3 files are in an overflow condition. Note that it is not recommended that you run db_stat on your own. The version and patch level of db_stat required to show the proper data must match exactly the version used to build the respective directory server version. With that very important caveat out of the way, the following table lists the overflow pages results for various DB page sizes for fresh imports (post-import) and after having modified all the entries (e.g. post-replication) of an 100k entry directory.

Table: Uncompressed Overflow Analysis of id2entry for 100k Entry
DSEE 7Post-importPost-importPost-replicationPost-replication
DB Page SizeKey/Data SizeOverflow PagesKey/Data SizeOverflow Pages
4096 (4k)10072000001007200000
8192 (8k)20311000002031100000
16384 (16k)40791000004079100000
32768 (32k)8175081750

Now, let's consider how entry compression might impact the DB page size. If the entry is compressed before it is stored on disk, then the data to fit within a page will correspondingly be much smaller. Thus, we can most likely use a smaller DB page size. Further, note that as the average entry size grows, it remains more constant over time. Thus, the probability of needing to change the DB page size in the future is much smaller. The table below looks at the same data as in the previous table but this time with entry compression enabled.

Table: Compressed Overflow Analysis of id2entry for 100k Entry

DSEE 7Post-importPost-importPost-replicationPost-replication
DB Page SizeKey/Data SizeOverflow PagesKey/Data SizeOverflow Pages
4096 (4k)10071000001007100000
8192 (8k)203110000020310
16384 (16k)4079040790
32768 (32k)8175081750

Note that the reason why the post-replication overflow pages for 8k dropped from 100% to 0% is because once all of the entries were modified, the greater compression for larger entries kicked in and fit the entries back into a single page.

A Friendly Reminder
After you have determined the optimal DB page size then configure the ZFS recordsize (or UFS blocksize) of all directory related filesystems to match the size of the DB page size. For example, this applies to all filesystems containg the db, change log, or transaction log or even the operational logs (e.g. acces, audit, and error). If the DB page size is set to 16364, then the ZFS recordsize should be set to 16k.

Command Reference
Now that you realize the potential of entry compression, let's look at how to implement it. Enabling entry compression is as simple as running the following two commands with the appropriate parameters.

dsconf set-suffix-prop -p port-number dc=example,dc=com compressed-entries:all
dsconf set-suffix-prop -p port-number dc=example,dc=com compression-mode:DSZ
However, based on the DB page size analysis mentioned earlier, you may want to discover the new optimum DB page size with compression enabled in your lab environment before rolling it out into production.

Here is a sample command for setting the ZFS recordsize of a specific ZFS filesystem.
zfs set recordsize=16k db/ds1

The net takeaway from this post is that DSEE 7 entry compression can significantly reduce your storage footprint, increase caching efficiency, mitigate the impacts of entry growth over time and enable you to more stably tune the DB page size.

I hope you have a very compressed and cache efficient day!

PS: Huge thanks to Benoit Chaffanjon and the Sun Benchmark Center for allowing us to use their facility and expert assistance to do these benchmarks. It would not have been possible without their contributions.

Special Thanks also to all of the contributors from directory engineering, directory marketing, and specialists from the field. These great folks include the following Pedro Vazquez, Ludovic Poitou, Arnaud Lacour, Mark Craig, Fabio Pistolesi, Nick Wooler, Etienne Remillon, Wajih Ahmed, Jeffery Tye and Terry Gardner.



Benoitf said...

Interesting... I'm planning to move from 5.2 to 7.0, and to optimize our DS environment (about 1.2 million entries, planning to grow up to 2.5 million)

About DSEE optimization, I'm trying pretty hard to figure out what is the best CPU for running DSEE?

T Series?
M Series?

Someone on a Sun forum states that T series is best for DSEE doing mostly read requests, while M would be better for write requests...

I'm talking about a database that would fit in the server memory, so I/O is not in the picture here...


chaffanjon said...

Both SPARC64VII and INTEL Nehalem are excellent choices for running DSEE. The M3000@2.75Ghz deserve your special attention...


Brad Diggs said...

CMT, SPARC64VII, Nehalem and AMD Opteron are all good choices depending on your requirements. CMT has slower single thread performance but offers many threads and great power savings. SPARC64VII, Nehalem and AMD Opteron all offer good single thread performance. However, SPARC64VII and AMD Opteron seem to scale better once you get past 4 cores.


Benoitf said...

chaffanjon, Brad, thanks for your answers!

Actually, I was wondering :

Will DSEE perform better with :

1. A slower clock speed, with more cores/threads (CMT)
2. A faster clock speed, with less cores/threads (SPARC64VII)

This is in the context where the DSEE server performs mostly reads (more than 99% of the requests are reads), and I/O is not an issue (the database fits in the RAM).

At first, I was thinking about going with the M3000, but I was wondering if a CMT processor might perform better (since I can get, for example, a T5120 and a M3000 for about the same price).


Brad Diggs said...

If I was choosing between M3000 and CMT, I would choose M3000 because it will offer better response time per operation. This is mainly because most of my customers have rather rigid response time criteria. However, we have had customers for which power consumption per rack unit was a much higher priority. For that case, CMT will be much better solution.

Hope that helps!


Benoitf said...


Our customer really doesn't care (yet!) about power consumption, but is really concerned about response-time, so I'm going to try out the M3000.

Thanks for your help!