Typical Progression
Initially they may start by loading the file with ldapmodify. They quickly realize that for a large LDIF file, this may take a long time because ldapmodify processes the LDIF sequentially rather than in parallel. If the customer is a programmer they may write up a script or program that breaks up the load into parallel threads. This works fairly well initially but then may overload the directory service with write load. The parallelizing may also get pre-requisite objects out of order and cause errors during loading of the data. For example, if one thread tries to add entries to a branch of the directory that does not yet exist, those adds will fail.
Bulk Loading Best Practices
So what can be done to solve this conundrum? They can apply several bulk loading best practices that will enable them to safely and reliably load the data as quickly as possible. Below are the bulk loading best practices that I have found to enable safe, reliable and fast bulk loading of LDIF data over the LDAP protocol.
- Make sure that the requisite Directory Information Tree (DIT) is in place before leaf entries are created.
- Monitor the data load progress to ensure that it does not exceed a specified upper boundary of operations per time interval.
- Monitor the directory service to ensure that general query response time is not increasing due to the bulk load.
- Monitor bulk load client and server utilization to ensure that the resources do not exceed a specified capacity.
- Modulate the flow of updates in order to stay within specified or monitored performance metrics.
- Have the ability to pause a bulk load that is in progress. This is very important because some customers have a fixed maintenance window in which to complete a bulk load. If the load looks like it is going to exceed the maintenance window, you need the ability to pause the load and then resume it in the next maintenance window.
- Segment the data to be loaded so that if interrupted, the amount of data to be re-played is minimal. For example, interrupting the load of a file containing 1M entries will be more problematic than a file containing only 100 entries.
Reference Example
I researched this problem years ago and wrote a bash script that incorporates most these bulk loading best practices. This script is not intended to be a production ready tool for loading your data but rather an educational example of one way to apply these best practice principles. You can download the DSBulkLoader script here.
To see the usage of the DSBulkLoader, run it with the -H flag.
Note that the script depends on the bash shell and the GNU stream editor (gsed). gsed was necessary because the Solaris sed breaks under duress. See the script help (-H) output for more information on this topic.
Usage Examples
Here are a few examples of its syntax.Example 1: Add new 100,000 new subscriber entries.
The following command will add 100,000 new subscriber entries to the directory service. The bulk loader will run as many as 4 (default) threads simultaneously. Each thread will process 1,000 entries (default chunk size) before closing the LDAP connection for that thread.
# DSBulkLoader -f add100kSubscribers.ldif -h dsM1 -w /.pwf
Example 2: Add a single new attribute to all 100,000 subscriber entries.
The following command will add a single new attribute to all 100,000 subscriber entries. The bulk loader will run as many as 4 (default) threads simultaneously. Each thread will process 1,000 entries (default chunk size) before closing the LDAP connection for that thread.
# DSBulkLoader -f addAttr.ldif -h dsM1 -T mod -w /.pwf
Example 3: Modify a single attribute on all 100,000 subscriber entries.
The following command will modify a single attribute on all 100,000 subscriber entries. The bulk loader will run as many as 4 (default) threads simultaneously. Each thread will process 1,000 entries (default
chunk size) before closing the LDAP connection for that thread.
# DSBulkLoader -f modAttr.ldif -h dsM1 -T mod -w /.pwf
Disclaimer
This script is in no way supported by me or my employer, Oracle. No warranty or guarantee is available for your use of the script. If you screw up your data its your fault.
That does it for this blog post. I hope that you find it useful.
Brad
PS: As always, the sample scripts provided are for reference and are not supported in any way.