lucene-java-3.5.0/lucene/contrib/facet/docs/userguide.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
   2 <!--
   3  Licensed to the Apache Software Foundation (ASF) under one or more
   4  contributor license agreements.  See the NOTICE file distributed with
   5  this work for additional information regarding copyright ownership.
   6  The ASF licenses this file to You under the Apache License, Version 2.0
   7  (the "License"); you may not use this file except in compliance with
   8  the License.  You may obtain a copy of the License at
   9
  10      http://www.apache.org/licenses/LICENSE-2.0
  11
  12  Unless required by applicable law or agreed to in writing, software
  13  distributed under the License is distributed on an "AS IS" BASIS,
  14  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  15  See the License for the specific language governing permissions and
  16  limitations under the License.
  17 -->
  18 <html>
  19 <title>Facet Userguide</title>
  20
  21 <!-- load stylesheet + javascript in development mode -->
  22 <link rel="stylesheet" type="text/css" href="../../../src/tools/prettify/prettify.css">
  23 <script src="../../../src/tools/prettify/prettify.js" type="text/javascript"></script>
  24
  25 <!-- load stylesheet + javascript in distribution mode -->
  26 <link rel="stylesheet" type="text/css" href="prettify.css">
  27 <script src="prettify.js" type="text/javascript"></script>
  28
  29 <script language="javascript">
  30         window.onload=function() {
  31                 prettyPrint();
  32         }
  33 </script>
  34
  35 <style>
  36 body {
  37   margin-left: 20%;
  38   width: 60%;
  39   counter-reset: section;
  40   text-align: left;
  41 }
  42
  43 h1.title {
  44   text-align: center;
  45   margin-top: 30px;
  46   font-size: 5em;
  47   line-height: 150%;
  48 }
  49
  50 h1.section {
  51   margin-top: 50px;
  52   font-size: 2.5em;
  53   counter-reset: subsection;
  54   border: 1px solid black;
  55   background-color: #D8D8D8;
  56   padding-left: 5px;
  57 }
  58
  59 h2.subsection {
  60   font-size: 2em;
  61   border: 1px solid black;
  62   background-color: #D8D8D8;
  63   padding-left: 5px;
  64 }
  65
  66 /* auto-generated heading numbers */
  67 h1.section:before {
  68 counter-increment: section;
  69 content: counter(section) ". ";
  70 }
  71
  72 h2.subsection:before  {
  73 counter-increment: subsection;
  74 content: counter(section) "." counter(subsection) " ";
  75 }
  76
  77 /* override from prettify.css - add shadow, padding etc. */
  78 pre.prettyprint {
  79   margin-left: 2%;
  80   width: 80%;
  81   padding: 5px 3px 5px 3px;
  82   /* shadow */
  83   -moz-box-shadow: 5px 5px 2px #888;
  84   -webkit-box-shadow: 5px 5px 2px #888;
  85   box-shadow: 5px 5px 2px #888;
  86 }
  87
  88 /* override from prettify.css - make keywords appear in bold */
  89 span.kwd {
  90   font-weight: bold;
  91 }
  92
  93 ol.toc a {
  94   text-decoration: none;
  95   color: blue;
  96 }
  97
  98 li.toc_first {
  99   // margin-top: 10px;
 100   font-size: 16px;
 101   color: blue;
 102 }
 103
 104 li.toc_second {
 105   font-size: 14px;
 106   margin-left: 15px;
 107   color: blue;
 108 }
 109
 110 /* reset style from prettify.css, so that line numbers appear in each line */
 111 li.L0,li.L1,li.L2,li.L3,li.L5,li.L6,li.L7,li.L8 {
 112   list-style-type:decimal
 113 }
 114
 115 table.code_description td {
 116   vertical-align: top;
 117 }
 118
 119 </style>
 120
 121 <body>
 122 <h1 class="title">
 123         Apache Lucene<br>
 124         Faceted Search<br>
 125         User's Guide</h1>
 126
 127 <div class="toc">
 128 <h1 class="toc">Table of Contents</h1>
 129 <ol class="toc">
 130 <li class="toc_first"><a href="#intro">Introduction</a></li>
 131 <li class="toc_first"><a href="#facet_features">Facet Features</a></li>
 132 <li class="toc_first"><a href="#facet_indexing">Indexing Categories Illustrated</a></li>
 133 <li class="toc_first"><a href="#facet_accumulation">Accumulating Facets Illustrated</a></li>
 134 <li class="toc_first"><a href="#indexed_facet_info">Indexed Facet Information</a></li>
 135 <li class="toc_first"><a href="#taxonomy_index">Taxonomy Index</a></li>
 136 <li class="toc_first"><a href="#facet_params">Facet Parameters</a></li>
 137 <li class="toc_first"><a href="#advanced">Advanced Faceted Examples</a></li>
 138 <li class="toc_first"><a href="#optimizations">Optimizations</a></li>
 139 <li class="toc_first"><a href="#concurrent_indexing_search">Concurrent Indexing and Search</a></li>
 140 </ol>
 141
 142 <h1 class="section"><a name="intro">Introduction</a></h1>
 143 <p>
 144 A category is an aspect of indexed documents which can be used to classify the
 145 documents. For example, in a collection of books at an online bookstore, categories of
 146 a book can be its price, author, publication date, binding type, and so on.
 147 <p>
 148 In faceted search, in addition to the standard set of search results, we also get facet
 149 results, which are lists of subcategories for certain categories. For example, for the
 150 price facet, we get a list of relevant price ranges; for the author facet, we get a list of
 151 relevant authors; and so on. In most UIs, when users click one of these subcategories,
 152 the search is narrowed, or drilled down, and a new search limited to this subcategory
 153 (e.g., to a specific price range or author) is performed.
 154 <p>
 155 Note that faceted search is more than just the ordinary fielded search. In fielded
 156 search, users can add search keywords like price:10 or author:"Mark
 157 Twain" to the query to narrow the search, but this requires knowledge of which
 158 fields are available, and which values are worth trying. This is where faceted search
 159 comes in: it provides a list of useful subcategories, which ensures that the user only
 160 drills down into useful subcategories and never into a category for which there are no
 161 results. In essence, faceted search makes it easy to navigate through the search results.
 162 The list of subcategories provided for each facet is also useful to the user in itself,
 163 even when the user never drills down. This list allows the user to see at one glance
 164 some statistics on the search results, e.g., what price ranges and which authors are
 165 most relevant to the given query.
 166 <p>
 167 In recent years, faceted search has become a very common UI feature in search
 168 engines, especially in e-commerce websites. Faceted search makes it easy for
 169 untrained users to find the specific item they are interested in, whereas manually
 170 adding search keywords (as in the examples above) proved too cumbersome for
 171 ordinary users, and required too much guesswork, trial-and-error, or the reading of
 172 lengthy help pages.
 173 <p>
 174 See <a href="http://en.wikipedia.org/wiki/Faceted_search">http://en.wikipedia.org/wiki/Faceted_search</a> for more information on faceted
 175 search.
 176
 177 <h1 class="section"><a name="facet_features">Facet Features</a></h1>
 178 First and main faceted search capability that comes to mind is counting, but in fact
 179 faceted search is more than facet counting. We now briefly discuss the available
 180 faceted search features.
 181
 182 <h2 class="subsection">Facet Counting</h2>
 183 <p>
 184 Which of the available subcategories of a facet should a UI display? A query in a
 185 book store might yield books by a hundred different authors, but normally we'd want
 186 do display only, say, ten of those.
 187 <p>
 188 Most available faceted search implementations use counts to determine the
 189 importance of each subcategory. These implementations go over all search results for
 190 the given query, and count how many results are in each subcategory. Finally, the
 191 subcategories with the most results can be displayed. So the user sees the price ranges,
 192 authors, and so on, for which there are most results. Often, the count is displayed next
 193 to the subcategory name, in parentheses, telling the user how many results he can
 194 expect to see if he drills down into this subcategory.
 195 <p>
 196 The main API for obtaining facet counting is <code>CountFacetRequest</code>, as in the
 197 following code snippet:
 198 <pre class="prettyprint lang-java">
 199 new CountFacetRequest(new CategoryPath("author"), 10));
 200 </pre>
 201 A detailed code example using count facet requests is shown below - see
 202 <a href="#facet_accumulation">Accumulating Facets</a>.
 203
 204 <h2 class="subsection"><a name="facet_association">Facet Associations</a></h2>
 205 <p>
 206 So far we've discussed categories as binary features, where a document either belongs
 207 to a category, or not.
 208 <p>
 209 While counts are useful in most situations, they are sometimes not sufficiently
 210 informative for the user, with respect to deciding which subcategory is more
 211 important to display.
 212 <p>
 213 For this, the facets package allows to associate a value with a category. The search
 214 time interpretation of the associated value is application dependent. For example, a
 215 possible interpretation is as a <i>match level</i> (e.g., confidence level). This value can
 216 then be used so that a document that is very weakly associated with a certain category
 217 will only contribute little to this category's aggregated weight.
 218
 219 <h2 class="subsection"><a name="multiple_requests">Multiple Facet Requests</a></h2>
 220 <p>
 221 A single faceted accumulation is capable of servicing multiple facet requests.
 222 Programmatic, this is quite simple - wrap all the facet requests of interest into the
 223 facet-search-parameters which are passed to a facets accumulator/collector (more on
 224 these objects below). The results would be comprised of as many facet results as there
 225 were facet requests.
 226 <p>
 227 However there is a delicate <b>limitation</b>: all facets maintained in the same location in
 228 the index are required to be treated the same. See the section on <a href="#indexing_params">Indexing Parameters</a>
 229 for an explanation on maintaining certain facets at certain locations.
 230
 231 <h2 class="subsection"><a name="facet_labels">Facet Labels at Search Time</a></h2>
 232 <p>
 233 Facets results always contain the facet (internal) ID and (accumulated) value. Some of
 234 the results also contain the facet label, AKA the category name. We mention this here
 235 since computing the label is a time consuming task, and hence applications can
 236 specify with a facet request to return top 1000 facets but to compute the label only for
 237 the top 10 facets. In order to compute labels for more of the facet results it is not
 238 required to perform accumulation again.
 239 <p>
 240 See <code>FacetRequest.getNumResults()</code>, <code>FacetRequest.getNumLabel()</code> and
 241 <code>FacetResultNode.getLabel(TaxonomyReader)</code>.
 242
 243 <h1 class="section"><a name="facet_indexing">Indexing Categories Illustrated</a></h1>
 244 <p>
 245 In order to find facets at search time they must first be added to the index at indexing
 246 time. Recall that Lucene documents are made of fields for textual search. The addition
 247 of categories is performed by an appropriate <code>DocumentBuilder</code> - or
 248 <code>CategoryDocumentBuilder</code> in our case.
 249 <p>
 250 Indexing therefore usually goes like this:
 251 <ul>
 252 <li>For each input document:
 253 <ul>
 254 <li>Create a fresh (empty) Lucene Document</li>
 255 <li>Parse input text and add appropriate text search fields</li>
 256 <li><b>Gather all input categories associated with the document and create
 257 a CategoryDocumentBuilder with the list of categories</b></li>
 258 <li><b>Build the document - this actually adds the categories to the
 259 Lucene document.</b></li>
 260 <li>Add the document to the index</li>
 261 </ul></li>
 262 </ul>
 263 Following is a code snippet for indexing categories. The complete example can be
 264 found in package <code>org.apache.lucene.facet.example.simple.SimpleIndexer</code>.
 265 <pre class="prettyprint lang-java linenums">
 266 IndexWriter writer = ...
 267 TaxonomyWriter taxo = new DirectoryTaxonomyWriter(taxoDir, OpenMode.CREATE);
 268 ...
 269 Document doc = new Document();
 270 doc.add(new Field("title", titleText, Store.YES, Index.ANALYZED));
 271 ...
 272 List&lt;CategoryPath&gt; categories = new ArrayList&lt;CategoryPath&gt;();
 273 categories.add(new CategoryPath("author", "Mark Twain"));
 274 categories.add(new CategoryPath("year", "2010"));
 275 ...
 276 DocumentBuilder categoryDocBuilder = new CategoryDocumentBuilder(taxo);
 277 categoryDocBuilder.setCategoryPaths(categories);
 278 categoryDocBuilder.build(doc);
 279 writer.addDocument(doc);
 280 </pre>
 281 <p>
 282 We now explain the steps above, following the code line numbers:
 283 <table class="code_description">
 284 <tr>
 285         <td>(4)</td>
 286         <td>Document contains not only text search fields but also facet search
 287 information.</td>
 288 </tr>
 289 <tr>
 290         <td>(7)</td>
 291         <td>Prepare a container for document categories.</td>
 292 </tr>
 293 <tr>
 294         <td>(8)</td>
 295         <td>Categories that should be added to the document are accumulated in the
 296 categories list.</td>
 297 </tr>
 298 <tr>
 299         <td>(11)</td>
 300         <td>A <code>CategoryDocumentBuilder</code> is created, set with the appropriate list
 301 of categories, and invoked to build - that is, to populate the document
 302 with categories. It is in this step that the taxonomy is updated to contain the
 303 newly added categories (if not already there) - see more on this in the
 304 section about the <a href="#taxonomy_index">Taxonomy Index</a> below. This line could be made more
 305 compact: one can create a single <code>CategoryDocumentBuilder cBuilder</code> and reuse it like this:
 306 <pre class="prettyprint lang-java linenums">
 307 DocumentBuilder cBuilder = new CategoryDocumentBuilder(taxo);
 308 cBuilder.setCategoryPaths(categories).build(doc);
 309 </pre>
 310         </td>
 311 </tr>
 312 <tr>
 313         <td>(14)</td>
 314         <td>Add the document to the index. As a result, category info is saved also in
 315 the regular search index, for supporting facet aggregation at search time
 316 (e.g. facet counting) as well as facet drill-down. For more information on
 317 indexed facet information see below the section <a href="#indexed_facet_info">Indexed Facet Information</a>.</td>
 318 </tr>
 319 </table>
 320
 321 <h1 class="section"><a name="facet_accumulation">Accumulating Facets Illustrated</a></h1>
 322 <p>
 323 Facets accumulation reflects a set of documents over some facet requests:
 324 <ul>
 325 <li><code>Document set</code> - a subset of the index documents, usually documents
 326 matching a user query.</li>
 327 <li><code>Facet requests</code> - facet accumulation specification, e.g. count a certain facet
 328 <i>dimension</i>.</li>
 329 </ul>
 330 <p>
 331 <code>FacetRequest</code> is a basic component in faceted search - it describes the facet
 332 information need. Every facet <b>request</b> is made of at least two fields:
 333 <ul>
 334 <li><code>CategoryPath</code> - root category of the facet request. The categories that
 335 are returned as a result of the request will all be descendants of this root</li>
 336 <li><code>Number of Results</code> - number of sub-categories to return (at most).</li>
 337 </ul>
 338 <p>
 339 There are other parameters to a facet request, such as -how many facet results to
 340 label-, -how <b>deep</b> to go from the request root when serving the facet request- and
 341 more - see the API Javadocs for <code>FacetRequest</code> and its subclasses for more
 342 information on these parameters. For labels in particular, see the section <a href="#facet_labels">Facet Labels
 343 at Search Time</a>.
 344 <p>
 345 <code>FacetRequest</code> in an abstract class, open for extensions, and users may add their
 346 own requests. The most often used request is <code>CountFacetRequest</code> - used for
 347 counting facets.
 348 <p>
 349 Facets accumulation is - not surprisingly - driven by a <code>FacetsAccumulator</code>. The
 350 most used one is <code>StandardFacetsAccumulator</code>, however there are also accumulators
 351 that support sampling - to be used in huge collections, and there's an adaptive facets
 352 accumulator which applies sampling conditionally on the statistics of the data. While
 353 facets accumulators are very extendible and powerful, they might be too
 354 overwhelming for beginners. For this reason, the code offers a higher level interface
 355 for facets accumulating: the <code>FacetsCollector</code>. It extends <code>Collector</code>, and as such
 356 can be passed to the search() method of Lucene's <code>IndexSearcher</code>. In case the
 357 application also needs to collect documents (in addition to accumulating/collecting
 358 facets), it can wrap multiple collectors with <code>MultiCollector</code>. Most code samples
 359 below use <code>FacetsCollector</code> due to its simple interface. It is quite likely that
 360 <code>FacetsCollector</code> should suffice the needs of most applications, therefore we
 361 recommend to start with it, and only when needing more flexibility turn to directly
 362 use facets accumulators.
 363 <p>
 364 Following is a code snippet from the example code - the complete example can be
 365 found under <code>org.apache.lucene.facet.example.simple.Searcher</code>:
 366 <pre class="prettyprint lang-java linenums">
 367 IndexReader indexReader = IndexReader.open(indexDir);
 368 Searcher searcher = new IndexSearcher(indexReader);
 369 TaxonomyReader taxo = new DirectoryTaxonomyReader(taxoDir);
 370 ...
 371 Query q = new TermQuery(new Term(SimpleUtils.TEXT, "white"));
 372 TopScoreDocCollector tdc = TopScoreDocCollector.create(10, true);
 373 ...
 374 FacetSearchParams facetSearchParams = new FacetSearchParams();
 375 facetSearchParams.addFacetRequest(new CountFacetRequest(
 376     new CategoryPath("author"), 10));
 377 ...
 378 FacetsCollector facetsCollector = new FacetsCollector(facetSearchParams, indexReader, taxo);
 379 searcher.search(q, MultiCollector.wrap(topDocsCollector, facetsCollector));
 380 List&lt;FacetResult&gt; res = facetsCollector.getFacetResults();
 381 </pre>
 382 <p>
 383 We now explain the steps above, following the code line numbers:
 384 <table class="code_description">
 385 <tr>
 386         <td>(1)</td>
 387         <td>Index reader and Searcher are initialized as usual.</td>
 388 </tr>
 389 <tr>
 390         <td>(3)</td>
 391         <td>A taxonomy reader is opened - it provides access to the facet information
 392 which was stored by the Taxonomy Writer at indexing time.</td>
 393 </tr>
 394 <tr>
 395         <td>(5)</td>
 396         <td>Regular text query is created to find the documents matching user need, and
 397 a collector for collecting the top matching documents is created.</td>
 398 </tr>
 399 <tr>
 400         <td>(8)</td>
 401         <td>Facet-search-params is a container for facet requests.</td>
 402 </tr>
 403 <tr>
 404         <td>(10)</td>
 405         <td>A single facet-request - namely a count facet request - is created and added
 406 to the facet search params. The request should return top 10 Author
 407 subcategory counts.</td>
 408 </tr>
 409 <tr>
 410         <td>(12)</td>
 411         <td>Facets-Collector is the simplest interface for facets accumulation (counting
 412 in this example).</td>
 413 </tr>
 414 <tr>
 415         <td>(13)</td>
 416         <td>Lucene search takes both collectors - facets-collector and top-doccollector,
 417 both wrapped by a multi-collector. This way, a single search
 418 operation finds both top documents and top facets. Note however that facets
 419 aggregation takes place not only over the top documents, but rather over all
 420 documents matching the query.</td>
 421 </tr>
 422 <tr>
 423         <td>(14)</td>
 424         <td>Once search completes, facet-results can be obtained from the facetscollector.</td>
 425 </tr>
 426 </table>
 427
 428 <p>
 429 Returned facet results are organized in a list, conveniently ordered the same as the
 430 facet-requests in the facet-search-params. Each result however contains the request
 431 for which it was created.</li>
 432 <p>
 433 Here is the (recursive) structure of the facet result:
 434 <ul>
 435 <li><b>Facet Result</b>
 436 <ul>
 437 <li><b>Facet Request</b> - the request for which this result was obtained.</li>
 438 <li><b>Valid Descendants</b> - how many valid descendants were encountered
 439 over the set of matching documents (some of which might have been
 440 filtered out because e.g. only top 10 results were requested).</li>
 441 <li><b>Root Result Node</b> - root facet result for the request
 442 <ul>
 443 <li><b>Ordinal</b> - unique internal ID of the facet</li>
 444 <li><b>Label</b> - full label of the facet (possibly null)</li>
 445 <li><b>Value</b> - facet value, e.g. count</li>
 446 <li><b>Sub-results-nodes</b> - child result nodes (possibly null)</li>
 447 </ul></li>
 448 </ul></li>
 449 </ul>
 450 <p>
 451 Note that not always there would be sub result nodes - this depends on the
 452 requested result mode:
 453 <ul>
 454 <li><b>PER_NODE_IN_TREE</b> - a tree, and so there may be sub results.</li>
 455 <li><b>GLOBAL_FLAT</b> - here the results tree would be rather flat, with only (at
 456 most) leaves below the root result node.</li>
 457 </ul>
 458
 459 <h1 class="section"><a name="indexed_facet_info">Indexed Facet Information</a></h1>
 460 <p>
 461 When indexing a document to which categories were added, information on these
 462 categories is added to the search index, in two locations:
 463 <ul>
 464 <li><i>Category Tokens</i> are added to the document for each category attached to
 465 that document. These categories can be used at search time for drill-down.</li>
 466 <li>A special <i>Category List Token</i> is added to each document containing
 467 information on all the categories that were added to this document. This can
 468 be used at search time for facet accumulation, e.g. facet counting.</li>
 469 </ul>
 470 <p>
 471 When a category is added to the index (that is, when a document containing a
 472 category is indexed), all its parent categories are added as well. For example, indexing
 473 a document with the category <code>&lt;<span style="color: blue">"author"</span>,
 474 <span style="color: blue">"American-</span>, <span style="color: blue">"Mark Twain"</span>&gt;</code> results in
 475 creating three tokens: <code>"/author"</code>, <code>"/author/American"</code>, and
 476 <code>"/author/American/Mark Twain"</code> (the character <code>'/'</code> here is just a human
 477 readable separator - there's no such element in the actual index). This allows drilling down
 478 and counting any category in the taxonomy, and not just leaf nodes, enabling a
 479 UI application to show either how many books have authors, or how many books
 480 have American authors, or how many books have Mark Twain as their (American)
 481 author.
 482 <p>
 483 Similarly, Drill-down capabilities are this way possible also for node categories.
 484 <p>
 485 In order to keep the counting list compact, it is built using category ordinal - an
 486 ordinal is an integer number attached to a category when it is added for the first time
 487 into the taxonomy.
 488 <p>
 489 For ways to further alter facet index see the section below on <a href="#indexing_params">Facet Indexing
 490 Parameters</a>.
 491
 492 <h1 class="section"><a name="taxonomy_index">Taxonomy Index</a></h1>
 493 <p>
 494 The taxonomy is an auxiliary data-structure maintained side-by-side with the regular
 495 index to support faceted search operations. It contains information about all the
 496 categories that ever existed in any document in the index. Its API is open and allows
 497 simple usage, or more advanced for the interested users.
 498 <p>
 499 When a category is added to a document, a corresponding node is added to the
 500 taxonomy (unless already there). In fact, sometimes more than one node is added -
 501 each parent category is added as well, so that the taxonomy is maintained as a Tree,
 502 with a virtual root.
 503 <p>
 504 So, for the above example, adding the category the category <code>&lt;<span style="color: blue">"author"</span>,
 505 <span style="color: blue">"American-</span>, <span style="color: blue">"Mark Twain"</span>&gt;</code>
 506 actually added three nodes: one for <code>"/author"</code>, one for <code>"/author/American"</code> and one for
 507 <code>"/author/American/Mark Twain"</code>.
 508 <p>
 509 An integer number - called ordinal is attached to each category the first time the
 510 category is added to the taxonomy. This allows for a compact representation of
 511 category list tokens in the index, for facets accumulation.
 512 <p>
 513 One interesting fact about the taxonomy index is worth knowing: once a category
 514 is added to the taxonomy, it is never removed, even if all related documents are
 515 removed. This differs from a regular index, where if all documents containing a
 516 certain term are removed, and their segments are merged, the term will also be
 517 removed. This might cause a performance issue: large taxonomy means large ordinal
 518 numbers for categories, and hence large categories values arrays would be maintained
 519 during accumulation. It is probably not a real problem for most applications, but be
 520 aware of this. If, for example, an application at a certain point in time removes an
 521 index entirely in order to recreate it, or, if it removed all the documents from the index
 522 in order to re-populate it, it also makes sense in this opportunity to remove the
 523 taxonomy index and create a new, fresh one, without the unused categories.
 524
 525 <h1 class="section"><a name="facet_params">Facet Parameters</a></h1>
 526 <p>
 527 Facet parameters control how categories and facets are indexed and searched. Apart
 528 from specifying facet requests within facet search parameters, under default settings it
 529 is not required to provide any parameters, as there are ready to use working defaults
 530 for everything.
 531 <p>
 532 However many aspects are configurable and can be modified by providing altered
 533 facet parameters for either search or indexing.
 534
 535 <h2 class="subsection"><a name="indexing_params">Facet Indexing Parameters</a></h2>
 536 <p>
 537 Facet Indexing Parameters are consulted with during indexing. Among several
 538 parameters it defines, the following two are likely to interest many applications:
 539 <ul>
 540 <li><b>Category list definitions</b> - in the index, facets are maintained in two
 541 forms: category-tokens (for drill-down) and category-list-tokens (for
 542 accumulation). This parameter allows to specify, for each category, the
 543 Lucene term used for maintaining the category-list-tokens for that category.
 544 The default implementation in <code>DefaultFacetIndexingParams</code> maintains
 545 this information for all categories under the same special dedicated term.
 546 One case where it is needed to maintain two categories in separate category
 547 lists, is when it is known that at search time it would be required to use
 548 different types of accumulation logic for each, but at the same accumulation
 549 call.</li>
 550 <li><b>Partition size</b> - category lists can be maintained in a partitioned way. If,
 551 for example, the partition size is set to 1000, a distinct sub-term is used for
 552 maintaining each 1000 categories, e.g. term1 for categories 0 to 999, term2
 553 for categories 1000 to 1999, etc. The default implementation in
 554 <code>DefaultFacetIndexingParams</code> maintains category lists in a single
 555 partition, hence it defines the partition size as <code>Integer.MAX_VALUE</code>. The
 556 importance of this parameter is on allowing to handle very large
 557 taxonomies without exhausting RAM resources. This is because at facet
 558 accumulation time, facet values arrays are maintained in the size of the
 559 partition. With a single partition, the size of these arrays is as the size of the
 560 taxonomy, which might be OK for most applications. Limited partition
 561 sizes allow to perform the accumulation with less RAM, but with some
 562 runtime overhead, as the matching documents are processed for each of the
 563 partitions.</li>
 564 </ul>
 565 <p>
 566 See the API Javadocs of <code>FacetIndexingParams</code> for additional configuration
 567 capabilities which were not discussed here.
 568
 569 <h2 class="subsection"><a name="search_params">Facet Search Parameters</a></h2>
 570 <p>
 571 Facet Search Parameters, consulted at search time (during facets accumulation) are
 572 rather plain, providing the following:
 573 <ul>
 574 <li><b>Facet indexing parameters</b> - which were in effect at indexing time -
 575 allowing facets accumulation to understand how facets are maintained in
 576 the index.</li>
 577 <li><b>Container of facet requests</b> - the requests which should be accumulated.</li>
 578 </ul>
 579
 580 <h2 class="subsection"><a name="category_lists_multiple_dimensions">Category Lists, Multiple Dimensions</a></h2>
 581 <p>
 582 Category list parameters which are accessible through the facet indexing parameters
 583 provide the information about:
 584 <ul>
 585 <li>Lucene Term under which category information is maintained in the index.</li>
 586 <li>Encoding (and decoding) used for writing and reading the categories
 587 information in the index.</li>
 588 </ul>
 589 <p>
 590 For cases when certain categories should be maintained in different location than
 591 others, use <code>PerDimensionIndexingParams</code>, which returns a different
 592 <code>CategoryListParams</code> object for each <i>dimension</i>. This is a good opportunity to
 593 explain about dimensions. This is just a notion: the top element - or first element - in
 594 a category path is denoted as the dimension of that category. Indeed, the dimension
 595 stands out as a top important part of the category path, such as <code>"Location"</code> for the
 596 category <code>"Location/Europe/France/Paris"</code>.
 597
 598 <h1 class="section"><a name="advanced">Advanced Faceted Examples</a></h1>
 599 <p>
 600 We now provide examples for more advanced facet indexing and search, such as
 601 drilling-down on facet values and multiple category lists.
 602
 603 <h2 class="subsection"><a name="drill_down">Drill-Down with Regular Facets</a></h2>
 604 <p>
 605 Drill-down allows users to focus on part of the results. Assume a commercial sport
 606 equipment site where a user is searching for a tennis racquet. The user issues the
 607 query <i>tennis racquet</i> and as result is shown a page with 10 tennis racquets, by
 608 various providers, of various types and prices. In addition, the site UI shows to the
 609 user a break down of all available racquets by price and make. The user now decides
 610 to focus on racquets made by <i>Head</i>, and will now be shown a new page, with 10
 611 Head racquets, and new break down of the results into racquet types and prices.
 612 Additionally, the application can choose to display a new breakdown, by racquet
 613 weights. This step of moving from results (and facet statistics) of the entire (or larger)
 614 data set into a portion of it by specifying a certain category, is what we call <i>Drilldown</i>.
 615 We now show the required code lines for implementing such a drill-down.
 616 <pre class="prettyprint lang-java linenums">
 617 Query baseQuery = queryParser.parse("tennis racquet");
 618 Query q2 = DrillDown.query(baseQuery, new CategoryPath("make", "head"), 10));
 619 </pre>
 620 <p>
 621 In line 1 the original user query is created and then used to obtain information on
 622 all tennis racquets.
 623 <p>
 624 In line 2, a specific category from within the facet results was selected by the user,
 625 and is hence used for creating the drill-down query.
 626 <p>
 627 Please refer to <code>SimpleSearcher.searchWithDrillDown()</code> for a more detailed
 628 code example performing drill-down.
 629
 630 <h2 class="subsection"><a name="multi-category_list">Multiple Category Lists</a></h2>
 631 <p>
 632 The default is to maintain all categories information in a single list. While this will
 633 suit most applications, in some situations an application may wish to use multiple
 634 category lists, for example, when the distribution of some category values is different
 635 than that of other categories and calls for using a different encoding, more efficient
 636 for the specific distribution. Another example is when most facets are rarely used
 637 while some facets are used very heavily, so an application may opt to maintain the
 638 latter in memory - and in order to keep memory footprint lower it is useful to
 639 maintain only those heavily used facets in a separate category list.
 640 <p>
 641 First we define indexing parameters with multiple category lists:
 642 <pre class="prettyprint lang-java linenums">
 643 PerDimensionIndexingParams iParams = new PerDimensionIndexingParams();
 644 iParams.addCategoryListParams(new CategoryPath("Author"),
 645     new CategoryListParams(new Term("$RarelyUsed", "Facets")));
 646 iParams.addCategoryListParams(new CategoryPath("Language"),
 647     new CategoryListParams(new Term("$HeavilyUsed", "Ones")));
 648 </pre>
 649 <p>
 650 This will cause the Language categories to be maintained in one category list, and
 651 Author facets to be maintained in a another category list. Note that any other category,
 652 if encountered, will still be maintained in the default category list.
 653 <p>
 654 These non-default indexing parameters should now be used both at indexing and
 655 search time. As depicted below, at indexing time this is done when creating the
 656 category document builder, while at search time this is done when creating the search
 657 parameters. Other than that the faceted search code is unmodified.
 658 <pre class="prettyprint lang-java linenums">
 659 DocumentBuilder categoryDocBuilder = new CategoryDocumentBuilder(taxo, iParams);
 660 ...
 661 FacetSearchParams facetSearchParams = new FacetSearchParams(iParams);
 662 </pre>
 663 <p>
 664 A complete simple example can be found in package <code>org.apache.lucene.facet.example.multiCL</code>
 665 under the example code.
 666
 667 <h1 class="section"><a name="optimizations">Optimizations</a></h1>
 668 <p>
 669 Faceted search through a large collection of documents with large numbers of facets
 670 altogether and/or large numbers of facets per document is challenging performance
 671 wise, either in CPU, RAM, or both. A few ready to use optimizations exist to tackle
 672 these challenges.
 673
 674 <h2 class="subsection"><a name="sampling">Sampling</a></h2>
 675 <p>
 676 Facet sampling allows to accumulate facets over a sample of the matching
 677 documents set. In many cases, once top facets are found over the sample set, exact
 678 accumulations are computed for those facets only, this time over the entire matching
 679 document set.
 680 <p>
 681 Two kinds of sampling exist: complete support and wrapping support. The
 682 complete support is through <code>SamplingAccumulator</code> and is tied to an extension of the
 683 <code>StandardFacetsAccumulator</code> and has the benefit of automatically applying other
 684 optimizations, such as <a href="#complements">Complements</a>. The wrapping support is through
 685 <code>SamplingWrapper</code> and can wrap any accumulator, and as such, provides more
 686 freedom for applications.
 687
 688 <h2 class="subsection"><a name="complements">Complements</a></h2>
 689 <p>
 690 When accumulating facets over a very large matching documents set, possibly
 691 almost as large as the entire collection, it is possible to speed up accumulation by
 692 looking at the complement set of documents, and then obtaining the actual results by
 693 subtracting from the total results. It should be noted that this is available only for
 694 count requests, and that the first invocation that involves this optimization might take
 695 longer because the total counts have to be computed.
 696 <p>
 697 This optimization is applied automatically by <code>StandardFacetsAccumulator</code>.
 698
 699 <h2 class="subsection"><a name="partitions">Partitions</a></h2>
 700 <p>
 701 Partitions are also discussed in the section about <a href="#indexing_params">Facet Indexing parameters.</a>
 702 <p>
 703 Facets are internally accumulated by first accumulating all facets and later on
 704 extracting the results for the requested facets. During this process, accumulation
 705 arrays are maintained in the size of the taxonomy. For a very large taxonomy, with
 706 multiple simultaneous faceted search operations, this might lead to excessive memory
 707 footprint. Partitioning the faceted information allows to relax the memory usage, by
 708 maintaining the category lists in several partitions, and by processing one partition at
 709 a time. This is automatically done by <code>StandardFacetsAccumulator</code>. However the
 710 default partition size is <code>Integer.MAX_VALUE</code>, practically setting to a single partition,
 711 i.e. no partitions at all.
 712 <p>
 713 Decision to override this behavior and use multiple partitions must be taken at
 714 indexing time. Once the index is created and already contains category lists it is too
 715 late to modify this.
 716 <p>
 717 See <code>FacetIndexingParams.getPartitionSize()</code> for API to alter this default
 718 behavior.
 719
 720 <h1 class="section"><a name="concurrent_indexing_search">Concurrent Indexing and Search</a></h1>
 721 <p>
 722 Sometimes, indexing is done once, and when the index is fully prepared, searching
 723 starts. However, in most real applications indexing is <i>incremental</i> (new data comes in
 724 once in a while, and needs to be indexed), and indexing often needs to happen while
 725 searching is continuing at full steam.
 726 <p>
 727 Luckily, Lucene supports multiprocessing - one process writing to an index while
 728 another is reading from it. One of the key insights behind how Lucene allows multiprocessing
 729 is <i>Point In Time</i> semantics. The idea is that when an <code>IndexReader</code> is opened,
 730 it gets a view of the index at the <i>point in time</i> it was opened. If an <code>IndexWriter</code>
 731 in a different process or thread modifies the index, the reader does not know about it until a new
 732 <code>IndexReader</code> is opened (or the reopen() method of an existing <code>IndexReader</code> is called).
 733 <p>
 734 In faceted search, we complicate things somewhat by adding a second index - the
 735 taxonomy index. The taxonomy API also follows point-in-time semantics, but this is
 736 not quite enough. Some attention must be paid by the user to keep those two indexes
 737 consistently in sync:
 738 <p>
 739 The main index refers to category numbers defined in the taxonomy index.
 740 Therefore, it is important that we open the <code>TaxonomyReader</code> <i>after</i> opening the
 741 IndexReader. Moreover, every time an IndexReader is reopen()ed, the
 742 TaxonomyReader needs to be refresh()'ed as well.
 743 <p>
 744 But there is one extra caution: whenever the application deems it has written
 745 enough information worthy a commit, it must <b>first</b> call commit() for the
 746 <code>TaxonomyWriter</code> and only <b>after</b> that call commit() for the <code>IndexWriter</code>.
 747 Closing the indices should also be done in this order - <b>first</b> close the taxonomy, and only <b>after</b>
 748 that close the index.
 749 <p>
 750 To summarize, if you're writing a faceted search application where searching and
 751 indexing happens concurrently, please follow these guidelines (in addition to the usual
 752 guidelines on how to use Lucene correctly in the concurrent case):
 753 <ul>
 754 <li>In the indexing process:
 755 <ol>
 756 <li>Before a writer commit()s the IndexWriter, it must commit() the
 757 TaxonomyWriter. Nothing should be added to the index between these
 758 two commit()s.</li>
 759 <li>Similarly, before a writer close()s the IndexWriter, it must close() the
 760 TaxonomyWriter.</li>
 761 </ol></li>
 762 <li>In the searching process:
 763 <ol>
 764 <li>Open the IndexReader first, and then the TaxonomyReader.</li>
 765 <li>After a reopen() on the IndexReader, refresh() the TaxonomyReader.
 766 No search should be performed on the new IndexReader until refresh()
 767 has finished.</li>
 768 </ol></li>
 769 </ul>
 770 <p>
 771 Note that the above discussion assumes that the underlying file-system on which
 772 the index and the taxonomy are stored respects ordering: if index A is written before
 773 index B, then any reader finding a modified index B will also see a modified index A.
 774 <p>
 775 <b>Note:</b> <code>TaxonomyReader</code>'s refresh() is simpler than <code>IndexReader</code>'s reopen().
 776 While the latter keeps both the old and new reader open, the former keeps only the new reader. The reason
 777 is that a new <code>IndexReader</code> might have modified old information (old documents deleted, for
 778 example) so a thread which is in the middle of a search needs to continue using the old information. With
 779 <code>TaxonomyReader</code>, however, we are guaranteed that existing categories are never deleted or modified -
 780 the only thing that can happen is that new categories are added. Since search threads do not care if new categories
 781 are added in the middle of a search, there is no reason to keep around the old object, and the new one suffices.
 782 <br><b>However</b>, if the taxonomy index was recreated since the <code>TaxonomyReader</code> was opened or
 783 refreshed, this assumption (that categories are forevr) no longer holds, and <code>refresh()</code> will
 784 throw an <code>InconsistentTaxonomyException</code>, guiding the application to open
 785 a new <code>TaxonomyReader</code> for up-to-date taxonomy data. (Old one can
 786 be closed as soon as it is no more used.)
 787
 788
 789 </body>
 790 </html>