While teaching a recent ForgeRock OpenDJ class, a student of mine observed an interesting behavior that at first seemed quite odd. While rebuilding his attribute indexes, the student found that the overall database size seemed to grow each time he performed a reindex operation. What seems obvious to me now sure made me scratch my head as I scrambled for an answer. I am sharing my findings here in the hopes that others will either a) find this information useful or b) find comic relief as to my misfortune.
|Note: If you are unclear about the information contained in OpenDJ’s database files, then I highly recommend that you read my posting entitled, Unlocking the Mystery behind the OpenDJ User Database. In that article I describe the overall structure of the Berkeley DB Java Edition database used by OpenDJ and how both entries (and indexes) are maintained in the same database.|
In OpenDJ, the rebuild-index command is used to update any attribute indexes contained in the OpenDJ database. This is necessary after you make a configuration change that affects indexes (such as modifying the index entry limit). Indexes are database specific and you can elect to rebuild a single attribute index or rebuild all attribute indexes for a particular database.
The following syntax is used to rebuild ALL indexes associated with the dc=example,dc=com suffix and its use is what caused the frantic head-scratching to occur:
$ rebuild-index -h ldap.example.com -p 4444 -D "cn=Directory Manager" -w password -b dc=example,dc=com --rebuildAll
The student observed (and questioned) that every time he rebuilt the indexes, the aggregated size of the
*.jdb files actually increased by some factor. In the case of a rebuild-all, it was about 18 MB each time he ran the command; in the case of rebuilding a single index, it was only about 3 MB each time. But the increase was consistent each time he rebuilt the index(es). This continued to occur until it reached a certain size at which time the consumption fell back to its original size (in our observations this occurred at roughly 200 MB when using the rebuild-all option).
The following details the output of the
du -sh command on the
userRoot database each time the
rebuild-index command was run:
- 124 MB
- 142 MB (+ 18MB)
- 160 MB (+ 18MB)
- 178 MB (+ 18MB)
- 200 MB (+ 22MB)
- 123 MB (- 77MB)
This trend was consistent over several iterations.
We continued testing and observed that in addition to the increasing size, the database files on the file system (
*.jdb) were changing as well. What was once
000000002.jdb now became
000000002.jdb file and
000000003.jdb and later became
000000003.jdb file and
000000004.jdb. This occurred at the same time that we dropped back down to the 123 MB size and was the clue that unlocked the mystery.
Unlike the Berkeley Sleepycat database used in OpenDJ’s forefathers, when data is modified in the OpenDJ database, it is not immediately removed from the database. Instead it is marked for removal and the record essentially becomes inactive. Updated records are then appended to the end of the database in a log file fashion.
This process continues until OpenDJ cleaner threads detect that a database file contains less than 50% active records. Once that occurs, the cleaner threads migrate all active records from the file and append them to the end of the last file in the OpenDJ database (a new file is created if necessary). Once migrated, the cleaner threads delete the database file containing the stale entries.
During the rebuild process, old index values in each of the
*.jdb files are marked as inactive and new indexes are added to the database. Simply marking these indexes as inactive does not eliminate their existence in the database and they continue to consume disk space. This process continues until the point where the cleaner threads detect that old indexes account for > 50% of the database entries. At this point, the migration process occurs, new
*.jdb files are created to store the new indexes, old stale
*.jdb files are deleted (hence the
*.jdb file name changes), and the disk space is returned.
When an index is rebuilt, the whole btree is marked as deleted. But since it actually represents specific records of the database files, they will only be collected when the file itself reaches the threshold that triggers recollection.
With small databases, you will see the behavior you’re describing. With larger databases, this will be less noticeable as the amount of index records will be larger and cleanup point may be reached faster.
So there you go, mystery solved!
I recently read a Computerworld article that discussed the reluctance of physicians to share patient data with the patients themselves. The article referenced a survey conducted by Accenture and Harris Interactive that found of the 3,700 physicians asked, only 31% felt that patients should have access to their own healthcare records.
“It found that 82% of U.S. physicians want patients to update their electronic health records with information about themselves, but only 31% believe patients should have full access to that record; 65% believe patients should have only limited access. Four percent said patients should have no access at all.”
This can best be represented by the following graphic from the Computerworld article:
This is old school thinking and is akin to asking someone to “show me yours and I will ‘think’ about showing you mine” (but probably won’t). How very one-sided.
When I first joined the Personal Data Ecosystem Consortium (PDEC), I did so because I believed that people should be allowed to take control of their own data. To me, “personal data” was roughly defined as identity and PII data; this was largely due to my identity background. But over the past year this has shifted towards healthcare data and while many of the same thoughts apply, the ROI on managing healthcare data can be much higher as it directly correlates to a person’s primary asset – their health.
Google Health, Microsoft HealthVault, CareZone – there is no shortage of applications designed to assist people in managing their healthcare data. While some efforts have failed, others remain hopeful. But as this survey demonstrates, there is still a long way to go to change the minds of those who are diagnosing and managing this data – the physicians, themselves. If they could only understand that patients are uniquely capable of assisting in the management of their own healthcare; but in order to do so, they need the data (and they need to understand what it means).
Over the past couple of years we have been developing applications that utilize the Lifedash platform. This allows our users to take control of their own data and selectively share it with others. Our latest application is CareSync and it is directly focused on healthcare. We are currently in a beta of the Web application and are piloting our Health Assistant services. Both of these offerings allow people to aggregate and manage their own healthcare in a collaborative environment but allows them to do it safely and securely. The feedback we have received from our participants has been overwhelmingly positive as people are losing faith in the healthcare system. They either want to (or feel forced to) take an active role in managing their own (or family’s) healthcare but to do so, they need the data.
With the reluctance of most physicians to share it is challenging at best. There are, however, techniques that you can use to obtain this information but it requires persistence (the word “nagging” comes to mind). It should not be that way - after all, it is our data.
In the words of healthcare activist e-Patient Dave, just “give me my damn data!” Or as I would add, just “give me my damn data, help me to understand what you just gave me, and tell me how I compare to others in my situation!”
A humorous look at the role of a leader in any organization. If you have ever been a leader, I’m sure you can relate to this quote from an anonymous author.
As nearly everyone knows, a leader has practically nothing to do except to decide what is to be done; tell somebody to do it; listen to reasons why it should not be done or why it should be done in a different way; follow up to see if the thing has been done; discover that it has not; inquire why; listen to excuses from the person who should have done it; follow up again to see if the thing has been done, only to discover that it has been done incorrectly; point out how it should have been done; conclude that as long as it has been done, it may as well be left where it is; wonder if it is not time to get rid of a person who cannot do a thing right; reflect that the person probably has a spouse and a large family, and any successor would be just as bad and maybe worse; consider how much simpler and better matters would be now if he had done it himself in the first place; reflect sadly that he could have done it right in twenty minutes, and, as things turned out, he has had to spend two days to find out why it has taken three weeks for somebody else to do it wrong.
Did you know that Lord Nelson, England’s famous naval hero, suffered from seasickness his entire life.
“I am ill every time it blows hard and nothing but my enthusiastic love for the profession keeps me one hour at sea.”
(See YouTube video: Lord Nelson Seasickness Letter in Tunbridge Wells.)
How could the man who destroyed Napoleon’s fleet lead men into battle when he himself was fighting a battle within himself? He did so by not only learning to live with his weakness – he learned to conquer it. And in so doing, he went on to become England’s greatest Naval hero.
Most of us have situations in our own lives that challenge us on a day to day basis. These may be physical or they may be psychological, but rest assured, everyone who has ever tried to accomplish anything in life has had to overcome their own personal seasickness.
Oftentimes it is a private war; carried on quietly within our own lives. But unlike heroes like Nelson, no one will celebrate our victories, no one will recognize our successes, and no one will pin a medal to our chest for winning. But even without the fanfare from others, nothing can dim the quiet satisfaction of knowing in our own hearts that we did not give up!
The OpenDJ directory server is highly scalable and can process all sorts of requests from different types of clients over various protocols. The following diagram provides an overview of how OpenDJ processes these requests. (See The OpenDJ Architecture for a more detailed description of each component.)
Note: The following information has been taken from ForgeRock’s OpenDJ Administration, Maintenance and Tuning Class and has been used with the permission of ForgeRock.
Client requests are accepted and processed by an appropriate Connection Handler. The Connection Handler decodes the request according to the protocol (LDAP, JMX, SNMP, etc.) and either responds immediately or converts it into an LDAP Operation Object that is added to the Work Queue.
Analogy: I like to use the analogy of the drive-through window at a fast food restaurant when describing this process. You are the client making a request of the establishment. The Connection Handler is the person who takes your order; they take your request and enter it into their ordering system (the Work Queue). They do not prepare your food; their jobs are simply to take the order as quickly and efficiently as possible.
Worker Threads monitor and detect items on the Work Queue and respond by processing them in a first in, first out fashion. Requests may be routed or filtered based on the server configuration and then possibly transformed before the appropriate backend is selected.
Analogy: Continuing with the fast food analogy, the Worker Threads are similar to the people who prepare your food. They monitor the order system (Work Queue) for any new orders and process them in a first in, first out fashion.
Note: OpenDJ routing is currently limited to the server’s determination of the appropriate backend. In future versions, this may take on more of a proxy or virtual directory type of implementation.
The result is returned to the client by the Worker Threads using the callback method specified by the Connection Handler.
Analogy: Once your order is completed, the food (or the results of your request) is given to you by one of the Worker Threads who has been tasked with that responsibility. This is the only place where the analogy somewhat breaks down. In older fast food restaurants (ones with only one window) this may sometimes be the person who took your order in the first place. In our analogy, however, the Connection Handler never responds to your request. This model is more closely attuned to more recent fast food establishments where they have two windows and there is a clear delineation of duties between the order taker (Connection Handler) and the one who provides you with your food (the Worker Thread).
Other services such as access control processing (ACIs), Logging, and Monitoring provide different access points within the request processing flow and are used to control, audit, and monitor how the requests are processed.
So, what do OpenDJ and McDonald’s have in common? They are both highly efficient entities that have been streamlined to process requests in the most efficient manner possible.
I recently attended a high school reunion where a major draw involved the use of a photo booth. You remember photo booths, right? Kiosks where one or more people hide behind a curtain and take pictures of themselves in all sorts of poses. At the end of the session, the kiosk spits out copies of the pictures much to the chagrin of those who aren’t quite as photogenic as they initially thought they were. In our case, reunion attendees were treated to an assortment of funny hats, glasses, and mustaches before entering the booth. They posed with silly expressions, engaged in silly activities, and in some cases even took silly actions to the extreme (I will leave that to your own imagination).
The point I am trying to make is that once the curtain was closed and the camera light came on people began performing in ways that would be considered unheard of in other settings. Adults who mere minutes before were prim and proper were now raving exhibitionists behind the privacy of a thin veil of cloth. When the curtain was once again opened, they returned to their “normal” behavior and giggled as they left the booth with memories in hand.
So why the sudden change? How did a thin piece of cloth make any difference as to how they acted? The difference was not the curtain, the difference stemmed from their perception of privacy and the context of the situation. People tend to act differently in settings where they feel their actions are private and when the context of the situation is known, they oftentimes let their guard down and act more naturally (or more boldly as the case may be). Just think about Congressman Weiner and his Twitter outing, Alec Baldwin and his fatherly advice to his daughter, or even conversations that you may have had over email, chat, or text when you didn’t think anyone was looking. When people feel more secure in their settings (privacy) and know the rules by which to play (context), they oftentimes act in totally different ways.
The problem with this behavior in a digital society is that you are never truely off the grid and it is all too easy for things to be taken out of context when information is shared inadvertantly. In our current digital society privacy is a facade as few companies take privacy seriously and there are fewer online places where your information is truly secure. Unfortunately, that can also be said of our offline world as more and more of it is becomming digitized as well.
Even within the sacred confines of a photo booth our privacy is not really private at all. Ironically photo booths now take digital photos which are then stored on the kiosk’s computer hard drive. While this expidites the printing process, the possability of those photos being shared with unintended parties is very real. At least that is what I observed shortly after the reunion when pictures from the photo booth began appearing on Facebook. At first I thought that attendees were scanning their own photos and posting them. This thought was immediately dismissed when I saw my own pictures start to appear.
From what I can surmise, the operator of the photo booth provided digital copies of everyone’s photos to one of the reunion committee members who took it upon themselves to post the pictures to Facebook. I am not going to get into the legal, moral, or ethical issues behind this action, but suffice to say, no notice was posted and no permission was granted. Now, I truly believe that those involved had the best intentions of the reunion attendees in mind, but the problem is that they did not have the right to make that decision on their own.
Intersection cameras, movies on demand (on any device), automobiles that act as WIFI hot spots, Internet connected scales, and yes photo booths – these are only a few examples of how every aspect of our life is becoming affected (or even consumed) by digitalization. All of that content is finding its way into the hands of people who may have good intentions, but who do not understand the ramifications that disclosure of such information may have. As such, they may not take the same care that you or I might take with our own information and may share it with others – all under the guise of good intentions.
So what happens to our privacy when our information falls into the hands of others? Is it even possible to assume that they have our best intentions in mind when their own companies make money by selling our data to the highest bidder? Can we assume that the context in which we operated is even valid when it may simply be a ruse to get us to let our guards down? Like Rip Van Winkle awaking from his 20 year slumber only to find a world that he no longer recognizes, we too must take care that we resist our own apethetical slumber or we too will wake up to a world we no longer recognize.