X-Git-Url: https://git.mdrn.pl/pylucene.git/blobdiff_plain/a2e61f0c04805cfcb8706176758d1283c7e3a55c..aaeed5504b982cf3545252ab528713250aa33eed:/lucene-java-3.5.0/lucene/contrib/grouping/src/java/org/apache/lucene/search/grouping/package.html?ds=sidebyside diff --git a/lucene-java-3.5.0/lucene/contrib/grouping/src/java/org/apache/lucene/search/grouping/package.html b/lucene-java-3.5.0/lucene/contrib/grouping/src/java/org/apache/lucene/search/grouping/package.html new file mode 100644 index 0000000..2de1b18 --- /dev/null +++ b/lucene-java-3.5.0/lucene/contrib/grouping/src/java/org/apache/lucene/search/grouping/package.html @@ -0,0 +1,183 @@ + + + +

This module enables search result grouping with Lucene, where hits +with the same value in the specified single-valued group field are +grouped together. For example, if you group by the author +field, then all documents with the same value in the author +field fall into a single group.

+ +

Grouping requires a number of inputs:

+ + + +

The implementation is two-pass: the first pass ({@link + org.apache.lucene.search.grouping.TermFirstPassGroupingCollector}) + gathers the top groups, and the second pass ({@link + org.apache.lucene.search.grouping.TermSecondPassGroupingCollector}) + gathers documents within those groups. If the search is costly to + run you may want to use the {@link + org.apache.lucene.search.CachingCollector} class, which + caches hits and can (quickly) replay them for the second pass. This + way you only run the query once, but you pay a RAM cost to (briefly) + hold all hits. Results are returned as a {@link + org.apache.lucene.search.grouping.TopGroups} instance.

+ +

+ This module abstracts away what defines group and how it is collected. All grouping collectors + are abstract and have currently term based implementations. One can implement + collectors that for example group on multiple fields. +

+ +

Known limitations:

+ + +

Typical usage for the generic two-pass collector looks like this + (using the {@link org.apache.lucene.search.CachingCollector}):

+ +
+  TermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector("author", groupSort, groupOffset+topNGroups);
+
+  boolean cacheScores = true;
+  double maxCacheRAMMB = 4.0;
+  CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB);
+  s.search(new TermQuery(new Term("content", searchTerm)), cachedCollector);
+
+  Collection<SearchGroup<BytesRef>> topGroups = c1.getTopGroups(groupOffset, fillFields);
+
+  if (topGroups == null) {
+    // No groups matched
+    return;
+  }
+
+  boolean getScores = true;
+  boolean getMaxScores = true;
+  boolean fillFields = true;
+  TermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector("author", topGroups, groupSort, docSort, docOffset+docsPerGroup, getScores, getMaxScores, fillFields);
+
+  //Optionally compute total group count
+  TermAllGroupsCollector allGroupsCollector = null;
+  if (requiredTotalGroupCount) {
+    allGroupsCollector = new TermAllGroupsCollector("author");
+    c2 = MultiCollector.wrap(c2, allGroupsCollector);
+  }
+
+  if (cachedCollector.isCached()) {
+    // Cache fit within maxCacheRAMMB, so we can replay it:
+    cachedCollector.replay(c2);
+  } else {
+    // Cache was too large; must re-execute query:
+    s.search(new TermQuery(new Term("content", searchTerm)), c2);
+  }
+
+  TopGroups<BytesRef> groupsResult = c2.getTopGroups(docOffset);
+  if (requiredTotalGroupCount) {
+    groupsResult = new TopGroups<BytesRef>(groupsResult, allGroupsCollector.getGroupCount());
+  }
+
+  // Render groupsResult...
+
+ +

To use the single-pass BlockGroupingCollector, + first, at indexing time, you must ensure all docs in each group + are added as a block, and you have some way to find the last + document of each group. One simple way to do this is to add a + marker binary field:

+ +
+  // Create Documents from your source:
+  List<Document> oneGroup = ...;
+  
+  Field groupEndField = new Field("groupEnd", "x", Field.Store.NO, Field.Index.NOT_ANALYZED);
+  groupEndField.setOmitTermFreqAndPositions(true);
+  groupEndField.setOmitNorms(true);
+  oneGroup.get(oneGroup.size()-1).add(groupEndField);
+
+  // You can also use writer.updateDocuments(); just be sure you
+  // replace an entire previous doc block with this new one.  For
+  // example, each group could have a "groupID" field, with the same
+  // value for all docs in this group:
+  writer.addDocuments(oneGroup);
+
+ +Then, at search time, do this up front: + +
+  // Set this once in your app & save away for reusing across all queries:
+  Filter groupEndDocs = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupEnd", "x"))));
+
+ +Finally, do this per search: + +
+  // Per search:
+  BlockGroupingCollector c = new BlockGroupingCollector(groupSort, groupOffset+topNGroups, needsScores, groupEndDocs);
+  s.search(new TermQuery(new Term("content", searchTerm)), c);
+  TopGroups groupsResult = c.getTopGroups(withinGroupSort, groupOffset, docOffset, docOffset+docsPerGroup, fillFields);
+
+  // Render groupsResult...
+
+ +Note that the groupValue of each GroupDocs +will be null, so if you need to present this value you'll +have to separately retrieve it (for example using stored +fields, FieldCache, etc.). + +

Another collector is the TermAllGroupHeadsCollector that can be used to retrieve all most relevant + documents per group. Also known as group heads. This can be useful in situations when one wants to compute grouping + based facets / statistics on the complete query result. The collector can be executed during the first or second + phase.

+ +
+  AbstractAllGroupHeadsCollector c = TermAllGroupHeadsCollector.create(groupField, sortWithinGroup);
+  s.search(new TermQuery(new Term("content", searchTerm)), c);
+  // Return all group heads as int array
+  int[] groupHeadsArray = c.retrieveGroupHeads()
+  // Return all group heads as FixedBitSet.
+  int maxDoc = s.maxDoc();
+  FixedBitSet groupHeadsBitSet = c.retrieveGroupHeads(maxDoc)
+
+ + +