An interesting question was posed on LinkedIn that asked, “If you were the architect of LinkedIn, MySpace, Facebook or other social networking sites and wanted to model the relationships amongst users and had to use LDAP, what would the schema look like?”
You can find the original post and responses here.
After reading the responses from other LinkedIn members, I felt compelled to add my proverbial $.02.
Directory Servers are simply special purpose data repositories. They are great for some applications and not so great for others. You can extend the schema and create a tree structure to model just about any kind of data for any type of application. But just because you “can” do something does not mean that you “should” do it.
The question becomes “Should you used a directory server or should you use a relational database?” For some applications a directory server would be a definite WRONG choice, for others it is clearly the RIGHT one, for yet others, the choice is not so clear. So, how do you decide?
Here are some simply rules of thumb that I have found work for me:
1) How often does your data change?
Keep in mind that directory servers are optimized for reads — this oftentimes comes at the expense of write operations. The reason is that directory servers typically implement extensive indexes that are tied to schema attributes (which by the way are tied to the application fields). So the question becomes, how often do these attributes change? If they do so often, then a directory server may not be the best choice (as you would be constantly rebuilding the indexes). If, however, they are relatively static, then a directory server would be a great choice.
2) What type of data are you trying to model?
If your data can be described in an attribute: value pair (i.e., name:Bill Nelson), then a directory server would be a good choice. If, however, your data is not so discrete, then a directory server should not be used. For instance, uploads to YouTube should NOT be kept in a directory server. User profiles in LinkedIn, however, would be.
3) Can your data be modeled in a hierarchical (tree-like) structure?
Directory servers implement a hierarchical structure for data modeling (similar to a file system layout). A benefit of a directory server is the ability to apply access control at a particular point in the tree and have that apply to all child elements in the tree structure. Additionally, you can start searching at a lower (child element) and increase your search performance times (much like selecting the proper starting point for the Unix “find” command). Relational databases cannot do this. You have to search all entries in the table. If your data lends to a hierarchical structure then a directory server might be a good choice.
I am a big fan on directory servers and have architected/implemented projects that sit 100% on top of a directory, 100% on top of relational databases, and a hybrid of both. Directory servers are extremely fast, flexible, scalable, and are able to handle the type of traffic you see on the Internet very well. Their ability to implement chaining, referrals, web services, and a flexible data modeling structure make them a very nice choice to use as a data repository to many applications, but I would not always lead with a directory server for every application.
So how do you decide which is best? It all comes down to the application, itself, and the way you want to access your data.
A site like LinkedIn might actually be modeled pretty well with a directory server as quite a bit of the content is actually static, lends well to an attribute:value pair, and can easily be modeled in a heirarchical structure. The user profiles for a site like facebook or YouTube could easily be modeled in a directory server, but I would NOT attempt to reference the YouTube or facebook uploads or the “what are you working on now” status with a directory server as it is constantly changing.
If you do decide to use a directory server, here are the general steps you should consider for development (your mileage may vary, but probably not too much):
- Evaluate the data fields that you want to access from your application
- Map the fields to existing directory server schema (extend if necessary).
- Build a heirarchical structure to model your data as appropriate (this is called the directory information tree, or DIT)
- Architect a directory solution based on where your applications reside thorughout the world (do you need one, two, or multiple directories?) and then determine how you want your data to flow through the system (chaining, referrals, replication)
- Implement the appropriate access control for attributes or the DIT in general
- Implement an effective indexing strategy to increase performance
- Test, test, test