In Zope/Plone applications, one often asks the catalog for an intersection of results from several indexes. For example, suppose I want all items where all of the following are true:
-
Subject field is “sports”
-
SearchableText matches the keyword query “stadium disab*”
-
Path is equal to ‘/foo/bar’
There are three typical ways to achieve this in a user-interface: (1) an advanced search form/page, (2) collections (saved searches), and (3) drill-down filters within faceted navigation.
This writeup is interested in the third case – faceted search: why it is useful, why it is difficult, and how something effective might be implemented in Plone.
What facets are:
-
Aspects of a set of search results, usually a sub-set of results.
-
Often, clickable links to refine your search within a larger set of results.
With faceted search navigation, filtering existing search results is the name of the game. If I search for a term (often a full text search), it may be helpful to ask me whether I want to view only known subsets of the search results I am presented. For example, in a search for “Lincoln,” it may be helpful to see clickable filters beside the results that let me choose to search only the Automotive category, or the History category. It may also be helpful to limit my search to items published within the last
year.
Faceted search is often used in a variety of web applications, but is especially useful for large collections of news and information, product reviews, and local search (entertainment guides, business directories). The more structured your metadata, the more likely faceted navigation is helpful to you. With free services like Calais popping up, getting better metadata on even unstructured content is easier to automate.
Unlike navigating using taxonomies within controlled vocabularies, there is no distinct linear hierarchy to this. Applied in Plone, most applications choosing to use this pattern will treat each index as a facet and rely on the portal_catalog machinery to return the results. It is a useful oversimplification to say such a navigation strategy is the iterative set-intersection of multiple results by end-users. The same is true for faceted navigation of search results, or simple collection navigation.
What might this look like? Figure 1 shows possible click behavior and permutations of three different search criteria.
Figure 1 - faceted navigation results in a need for result set intersections. We might want to cache these…
Faceted navigation is more like browsing than searching, most of the time. Clickable filters are easy. But with the ability to hyperlink comes the ability to abuse it. And with good metadata often comes large vocabularies of possible choices. There are two good UI guidelines apply here:
-
Do not show links to facets/filters that have no results, even if the filter in question is a popular term in your vocabulary of possible choices. If it is not germane to the results already
presented to a user, omit the useless link from your navigation.
-
Related: show the number of results in each facet link in the navigation. If you are searching automobiles, and have a facet for color, if the results page the user is already on lists seven red cars, thirty-two blue cars, and three black cars, show those numbers within the respective links. If there are zero purple cars in the result sets, just omit the link labeled “purple” per #1.
Figure 2 - faceted navigation. This particular example is an application some folks I work with built using Django and PostgreSQL with stored procedures used to offload some the work of creating link set counts from the application.
Both of these common-sense user interface constraints pose challenges. Running a single search result that intersects several indexes is one thing (not a problem), but it turns out getting those counts for all the various permutations of facet choices is tricky, and expensive. Getting result counts means querying the catalog, and if you have dozens of clickable facet links on your pages, and you are filtering through, say 100,000 results, you have a problem. Actually you have dozens of problems, each as big as the result set they contain. There’s one good reason one might need a memory watchdog.
To add insult to injury, what if all this expense tied up resources on your application server, sharing such burden with the threads rendering your pages? Such collapsing of burdens is hard to escape the way Zope is designed (index operations happen in the same thread and on the same hardware as the rest of the application’s execution) – a possible solution is an external catalog, much like Nuxeo used with CPS via NXLucene, and possibly like what will soon be possible with Plone and Solr (collective.solr and colletive.indexing).
This isn’t meant to sound bleak. I built a system like this that scaled okay with 30,000 items (throwing hardware at the problem). Trying to make the same system work for nearly 200,000 items (local yellow pages listings) never succeeded into production. I write this because I’ve seen what does (to some extent) and does not work, and have a ideas on how the situation can be improved. And if those do not work, we can steal^W borrow ideas from Solr [more].
This problem does not have easy answers, but there are ways to improve this, including caching the metadata for each facet permutation (read: set intersections) using an out-of-band cache-warmer (possible an external thread talking to a cache that is not thread-local?). Asynchronous invalidation notifications to such a cache could happen within writes to Zope-based content (read: event subscribers and a message queue). Such solutions help partition the work of away from an in-process bottleneck inside a Plone-based system.
In a follow-up post, I will detail some more specific ideas I have about addressing this area. There is more potential than down-side, we just need to be careful of the “gotchas.”