lucene-java-3.4.0/lucene/contrib/grouping/src/java/org/apache/lucene/search/grouping/package.html

   1 <html>
   2 <body>
   3
   4 <p>This module enables search result grouping with Lucene, where hits
   5 with the same value in the specified single-valued group field are
   6 grouped together.  For example, if you group by the <code>author</code>
   7 field, then all documents with the same value in the <code>author</code>
   8 field fall into a single group.</p>
   9
  10 <p>Grouping requires a number of inputs:</p>
  11
  12   <ul>
  13     <li> <code>groupField</code>: this is the field used for grouping.
  14       For example, if you use the <code>author</code> field then each
  15       group has all books by the same author.  Documents that don't
  16       have this field are grouped under a single group with
  17       a <code>null</code> group value.
  18
  19     <li> <code>groupSort</code>: how the groups are sorted.  For sorting
  20       purposes, each group is "represented" by the highest-sorted
  21       document according to the <code>groupSort</code> within it.  For
  22       example, if you specify "price" (ascending) then the first group
  23       is the one with the lowest price book within it.  Or if you
  24       specify relevance group sort, then the first group is the one
  25       containing the highest scoring book.
  26
  27     <li> <code>topNGroups</code>: how many top groups to keep.  For
  28       example, 10 means the top 10 groups are computed.
  29
  30     <li> <code>groupOffset</code>: which "slice" of top groups you want to
  31       retrieve.  For example, 3 means you'll get 7 groups back
  32       (assuming <code>topNGroups</code> is 10).  This is useful for
  33       paging, where you might show 5 groups per page.
  34
  35     <li> <code>withinGroupSort</code>: how the documents within each group
  36       are sorted.  This can be different from the group sort.
  37
  38     <li> <code>maxDocsPerGroup</code>: how many top documents within each
  39       group to keep.
  40
  41     <li> <code>withinGroupOffset</code>: which "slice" of top
  42       documents you want to retrieve from each group.
  43
  44   </ul>
  45
  46 <p>The implementation is two-pass: the first pass ({@link
  47   org.apache.lucene.search.grouping.TermFirstPassGroupingCollector})
  48   gathers the top groups, and the second pass ({@link
  49   org.apache.lucene.search.grouping.TermSecondPassGroupingCollector})
  50   gathers documents within those groups.  If the search is costly to
  51   run you may want to use the {@link
  52   org.apache.lucene.search.CachingCollector} class, which
  53   caches hits and can (quickly) replay them for the second pass.  This
  54   way you only run the query once, but you pay a RAM cost to (briefly)
  55   hold all hits.  Results are returned as a {@link
  56   org.apache.lucene.search.grouping.TopGroups} instance.</p>
  57
  58 <p>
  59   This module abstracts away what defines group and how it is collected. All grouping collectors
  60   are abstract and have currently term based implementations. One can implement
  61   collectors that for example group on multiple fields.
  62 </p>
  63
  64 <p>Known limitations:</p>
  65 <ul>
  66   <li> For the two-pass grouping collector, the group field must be a
  67     single-valued indexed field.
  68     {@link org.apache.lucene.search.FieldCache} is used to load the {@link org.apache.lucene.search.FieldCache.StringIndex} for this field.
  69   <li> Although Solr support grouping by function and this module has abstraction of what a group is, there are currently only
  70     implementations for grouping based on terms.
  71   <li> Sharding is not directly supported, though is not too
  72     difficult, if you can merge the top groups and top documents per
  73     group yourself.
  74 </ul>
  75
  76 <p>Typical usage for the generic two-pass collector looks like this
  77   (using the {@link org.apache.lucene.search.CachingCollector}):</p>
  78
  79 <pre class="prettyprint">
  80   TermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector("author", groupSort, groupOffset+topNGroups);
  81
  82   boolean cacheScores = true;
  83   double maxCacheRAMMB = 4.0;
  84   CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB);
  85   s.search(new TermQuery(new Term("content", searchTerm)), cachedCollector);
  86
  87   Collection&lt;SearchGroup&lt;BytesRef&gt;&gt; topGroups = c1.getTopGroups(groupOffset, fillFields);
  88
  89   if (topGroups == null) {
  90     // No groups matched
  91     return;
  92   }
  93
  94   boolean getScores = true;
  95   boolean getMaxScores = true;
  96   boolean fillFields = true;
  97   TermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector("author", topGroups, groupSort, docSort, docOffset+docsPerGroup, getScores, getMaxScores, fillFields);
  98
  99   //Optionally compute total group count
 100   TermAllGroupsCollector allGroupsCollector = null;
 101   if (requiredTotalGroupCount) {
 102     allGroupsCollector = new TermAllGroupsCollector("author");
 103     c2 = MultiCollector.wrap(c2, allGroupsCollector);
 104   }
 105
 106   if (cachedCollector.isCached()) {
 107     // Cache fit within maxCacheRAMMB, so we can replay it:
 108     cachedCollector.replay(c2);
 109   } else {
 110     // Cache was too large; must re-execute query:
 111     s.search(new TermQuery(new Term("content", searchTerm)), c2);
 112   }
 113
 114   TopGroups&lt;BytesRef&gt; groupsResult = c2.getTopGroups(docOffset);
 115   if (requiredTotalGroupCount) {
 116     groupsResult = new TopGroups&lt;BytesRef&gt;(groupsResult, allGroupsCollector.getGroupCount());
 117   }
 118
 119   // Render groupsResult...
 120 </pre>
 121
 122 <p>To use the single-pass <code>BlockGroupingCollector</code>,
 123    first, at indexing time, you must ensure all docs in each group
 124    are added as a block, and you have some way to find the last
 125    document of each group.  One simple way to do this is to add a
 126    marker binary field:</p>
 127
 128 <pre class="prettyprint">
 129   // Create Documents from your source:
 130   List&lt;Document&gt; oneGroup = ...;
 131
 132   Field groupEndField = new Field("groupEnd", "x", Field.Store.NO, Field.Index.NOT_ANALYZED);
 133   groupEndField.setOmitTermFreqAndPositions(true);
 134   groupEndField.setOmitNorms(true);
 135   oneGroup.get(oneGroup.size()-1).add(groupEndField);
 136
 137   // You can also use writer.updateDocuments(); just be sure you
 138   // replace an entire previous doc block with this new one.  For
 139   // example, each group could have a "groupID" field, with the same
 140   // value for all docs in this group:
 141   writer.addDocuments(oneGroup);
 142 </pre>
 143
 144 Then, at search time, do this up front:
 145
 146 <pre class="prettyprint">
 147   // Set this once in your app & save away for reusing across all queries:
 148   Filter groupEndDocs = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupEnd", "x"))));
 149 </pre>
 150
 151 Finally, do this per search:
 152
 153 <pre class="prettyprint">
 154   // Per search:
 155   BlockGroupingCollector c = new BlockGroupingCollector(groupSort, groupOffset+topNGroups, needsScores, groupEndDocs);
 156   s.search(new TermQuery(new Term("content", searchTerm)), c);
 157   TopGroups groupsResult = c.getTopGroups(withinGroupSort, groupOffset, docOffset, docOffset+docsPerGroup, fillFields);
 158
 159   // Render groupsResult...
 160 </pre>
 161
 162 Note that the <code>groupValue</code> of each <code>GroupDocs</code>
 163 will be <code>null</code>, so if you need to present this value you'll
 164 have to separately retrieve it (for example using stored
 165 fields, <code>FieldCache</code>, etc.).
 166
 167 <p>Another collector is the <code>TermAllGroupHeadsCollector</code> that can be used to retrieve all most relevant
 168    documents per group. Also known as group heads. This can be useful in situations when one wants to compute grouping
 169    based facets / statistics on the complete query result. The collector can be executed during the first or second
 170    phase.</p>
 171
 172 <pre class="prettyprint">
 173   AbstractAllGroupHeadsCollector c = TermAllGroupHeadsCollector.create(groupField, sortWithinGroup);
 174   s.search(new TermQuery(new Term("content", searchTerm)), c);
 175   // Return all group heads as int array
 176   int[] groupHeadsArray = c.retrieveGroupHeads()
 177   // Return all group heads as FixedBitSet.
 178   int maxDoc = s.maxDoc();
 179   FixedBitSet groupHeadsBitSet = c.retrieveGroupHeads(maxDoc)
 180 </pre>
 181
 182 </body>
 183 </html>