Dashboard > ApacheDS Addons > ... > Developer's Guide > Search Issues
Search Issues
Added by Alex Karasulu, last edited by Alex Karasulu on Feb 03, 2006  (view change)
Labels: 
(None)


Search Issues

There are two related issues which need to be resolved pertaining to the search algorithm. These issues are listed below:

Both cause the search results returned to be effected by the schema. Solutions to these problems will be discussed here.

DIREVE-276 Search for super OC does not return subclasses if add op does not add complete objectClass lineage

This is a problem where for example entries of objectClass inetOrgPerson are not returned when they are in scope and a filter of (objectClass=person) is used for example. The following entry will not be matched:

objectClass: inetOrgPerson
sn: Karasulu
givenName: Alex
cn: Alex Karasulu
o: Apache Software Foundation

The following inetOrgPerson will however get matched by the filter:

objectClass: person
objectClass: inetOrgPerson
sn: Karasulu
givenName: Alex
cn: Alex Karasulu
o: Apache Software Foundation

The incorrect matching is due to literal matching in the search algorithm without taking into account objectClass inheritence.

Possible Solutions

One solution is to have the DSA manage the objectClass attributes in an entry. It would enforce the presence of all values corresponding to the lineage of the all STRUCTURAL and AUXILIARY objectClasses. Add operations would introduce missing ancestors. Delete and modify operations would enforce the presence of the lineage. This approach bloats the objectClass index which will eventually be made into a system index. Strategies for customizing this index for preformance are being formulated. Hence the bloat should not factor into the choice of the solution.

Another solution is to allow entries to exist without the entire objectClass lineage in the entry, and to use filter expansion to make sure all subclasses are selected by the filter. A disconjunction (OR) is used to select for all subclasses below the original filter expression. Using the example above the filter would be expanded like so:

(| (objectClass=person)
   (objectClass=organizationalPerson)
   (objectClass=residentialPerson)
   (objectClass=inetOrgPerson)
)

This filter would correctly select both example entries. Such a search expression is costly however that's inherent in poor search selection criteria where a general expression is used. One would hope that this is not a common occurrance. This filter expression is equivalent to conducting 4 separate search expressions. Because a disjunction is used the optimizer cannot use scan counts to guess the best plan for conducting the search. All nodes must be merged together in a union with OR'd expressions while conducting partial scans on the objectClass index. The search is conducted as follows:

  1. The optimizer will annotate each node with a scan count but this will do us no good since we have to scan for all values due to the OR expression
  2. An index enumerator is built for each leaf node to scan the objectClass index for matching values. The cursor for each leaf enumerator advances immediately in almost constant time to a matching value if one is present.
  3. All leaf index enumerators are added to a disjunction enumerator which exhausts one enumerator at a time while keeping track of duplicates and prefetching. Duplicates of index records are not returned.
  4. The disjunction enumerator is wrapped by a SearchResultFilteringEnumerator which fetches Attributes objects for each index record returned from the underlying disjunction filter. The Attributes objects are returned within SearchResult objects.
  5. Once the enumerator hierarchy is built it is returned. At this point before any results are returned at least 3 index entries will have been prefetched: one by the first leaf enumerator to be exhausted, one by the disjunction enumerator which makes the leaf enumerator prefetch again. And the third by the SearchResult enumerator which prefetches from the disjuction enumerator which in turn prefetches from the leaf enumerator. At this point the system is ready to be enumerator by a client.

For completeness the thread iterating through a loop to pull search results from the SearchResultFilteringEnumerator will drive the exhaustion of the first leaf enumerator. The disjunction filter will then switch to another enumerator and so on until all leaf enumerators have walked the index over entries matching the values they match. At this point the enumerators close from the bottom up.

As can be seen this is exactly like conduction 4 separate searches except duplicates are filtered out for you. Also there is a minute advantage where you don't have to pay a setup cost for multiple searches. The only thing cached in memory are integer ids of returned entries to prevent duplicates from being returned.

How do the two approaches compare?

The filter expansion approach presumes one or more users do not properly manage objectClasses. This means users that do add all ancestors in the objectClass attributes of entries will pay a price during search. Namely these users will have more duplicates filtered. For example there will always be overlaps between inetOrgPerson and person. So we're going to punish those that use the directory properly. Then again these people will most likely use specific filters with constrained scope . Joking aside the filter expansion approach will probably cost much more than just having the DSA inject and manage the objectClasses.

  1. Filter and schema analysis will be required during search with expansion, while similar calculation for attribute maintenance occurs during write operations which need not be as fast
  2. Filter expansion makes presumptions about user input for an entire DSA
  3. Filter expansion taxes the optimizer needlessly
  4. Filter expansion requires more memory to track duplicates
  5. Filter expansion realistically iterates over more index records during search time than does attribute maintenance
  6. Filter expansion is much more complicated

Drawbacks to using objectClass attribute maintenance:

  1. Slows down write operations a tiny bit (we can live with that)
  2. Bloats the objectClass index (think of what happens with top)

The only real concern for attribute maintenance is drawback (2). Here the objectClass index is going to have as many values for (top) as there are entries in the partition. This can be remedied by just not adding top, or making sure it is removed when entries are added or modified. On return top can be injected back into the entry's objectClass attribute. For all practical purposes search filters for (objectClass=top) or any substring expression matching top can be replaced by (objectClass=*). This saves the objectClass index a lot of hassle however it will always be a contension point and very heavily used with many duplicates for some values. This is why an optimized objectClass index is critical: the same applies to the hierarchy index but this is another story.

Decision Time

I don't like having adhoc rules like the one above to handle certain conditions but it does save us the hassle and performance hit we would take if we used filter expansion. In the end the decision to be made is clear. objectClass attribute maintenance is a better approach than filter expansion.

DIREVE-277 Less specific filters do not match entries with more specific attributes

This is the same problem as DIREVE-276 except juxtaposed onto attributes. Here filters on less specific attributes must match entries with more specific attributeType subclasses. Again using the two entires above I may try to match for (name=Alex). This filter will fail to match any of the example entries above when it should match both. Here givenName, sn, and cn all derive from name. So the non-specific filter should match both entires according to 2251.

ApacheDS again matches litterally for attributes and their values without considering the attribute hierarchy. Unfortunately unlike DIREVE-276 our options are limited. We really cannot add extra name attributes to the entry. First of all we couldn't track when to remove them to mandate the what you put in is what you get out rule. Secondly this just flat out violates the protocol. The only options we have are hard ones.

Possible Solutions

We could again use filter expansion with a disjunction. The example filter, (name=Alex), would expand out to several nodes since many attributes extend from name. However here's the expansion that would result in a correct match:

( | (cn=Alex) 
    (sn=Alex)
    (givenName=Alex)
)

This would correctly select both entries. However it has the same problems in terms of performance as does the filter expansion solution for DIREVE-276. Another approach may be to attribute attribute hierarchy awareness directly into the partition implementation, but there are severe drawbacks to this. Let's first explore how this could be done.

Adding AttributeType Hierarchy Awareness to Partitions

For the default partition implementation based on ldbm this means massive changes in the search algorithm. First all the subtypes of all attributeTypes in the filter expression must be determined. We need to do this for filter expansion as well so it's not as much of a drawback. When using indices we must make sure we use the indices of all subtypes as well in a disjunction. When resorting to master table scans we have to attempt matches on each candidate using all subtypes of every attribute in the filter. This means we must build an assertion tree of the disjunction to test the disjunction on the candidate entry.

Implementing this is error prone and will make more dense an already difficult peice of code. Furthermore all partition implementations would have to duplicate this behavoir based on their own implemenation. It's hard enough to implement new partitions. Adding this additional semantic will further burden implementors.

Furthermore the end result of these changes to the partition implementation are the same as filter expansion. There really is very little difference besides these major drawbacks.

Decision Time

It's clear the best option is with filter expansion in this case. There just is no better approach that I can see at this point in time. All are welcome to comment on new ideas for solving this problem.

Site running on a free Atlassian Confluence Open Source Project License granted to Safehaus. Evaluate Confluence today.
Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.5.4 Build:#809 Jun 12, 2007) - Bug/feature request - Contact Administrators