X-Git-Url: https://git.mdrn.pl/pylucene.git/blobdiff_plain/a2e61f0c04805cfcb8706176758d1283c7e3a55c..aaeed5504b982cf3545252ab528713250aa33eed:/lucene-java-3.5.0/lucene/contrib/grouping/src/java/org/apache/lucene/search/grouping/package.html?ds=sidebyside diff --git a/lucene-java-3.5.0/lucene/contrib/grouping/src/java/org/apache/lucene/search/grouping/package.html b/lucene-java-3.5.0/lucene/contrib/grouping/src/java/org/apache/lucene/search/grouping/package.html new file mode 100644 index 0000000..2de1b18 --- /dev/null +++ b/lucene-java-3.5.0/lucene/contrib/grouping/src/java/org/apache/lucene/search/grouping/package.html @@ -0,0 +1,183 @@ + +
+ +This module enables search result grouping with Lucene, where hits
+with the same value in the specified single-valued group field are
+grouped together. For example, if you group by the author
+field, then all documents with the same value in the author
+field fall into a single group.
Grouping requires a number of inputs:
+ +groupField
: this is the field used for grouping.
+ For example, if you use the author
field then each
+ group has all books by the same author. Documents that don't
+ have this field are grouped under a single group with
+ a null
group value.
+
+ groupSort
: how the groups are sorted. For sorting
+ purposes, each group is "represented" by the highest-sorted
+ document according to the groupSort
within it. For
+ example, if you specify "price" (ascending) then the first group
+ is the one with the lowest price book within it. Or if you
+ specify relevance group sort, then the first group is the one
+ containing the highest scoring book.
+
+ topNGroups
: how many top groups to keep. For
+ example, 10 means the top 10 groups are computed.
+
+ groupOffset
: which "slice" of top groups you want to
+ retrieve. For example, 3 means you'll get 7 groups back
+ (assuming topNGroups
is 10). This is useful for
+ paging, where you might show 5 groups per page.
+
+ withinGroupSort
: how the documents within each group
+ are sorted. This can be different from the group sort.
+
+ maxDocsPerGroup
: how many top documents within each
+ group to keep.
+
+ withinGroupOffset
: which "slice" of top
+ documents you want to retrieve from each group.
+
+ The implementation is two-pass: the first pass ({@link + org.apache.lucene.search.grouping.TermFirstPassGroupingCollector}) + gathers the top groups, and the second pass ({@link + org.apache.lucene.search.grouping.TermSecondPassGroupingCollector}) + gathers documents within those groups. If the search is costly to + run you may want to use the {@link + org.apache.lucene.search.CachingCollector} class, which + caches hits and can (quickly) replay them for the second pass. This + way you only run the query once, but you pay a RAM cost to (briefly) + hold all hits. Results are returned as a {@link + org.apache.lucene.search.grouping.TopGroups} instance.
+ ++ This module abstracts away what defines group and how it is collected. All grouping collectors + are abstract and have currently term based implementations. One can implement + collectors that for example group on multiple fields. +
+ +Known limitations:
+Typical usage for the generic two-pass collector looks like this + (using the {@link org.apache.lucene.search.CachingCollector}):
+ ++ TermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector("author", groupSort, groupOffset+topNGroups); + + boolean cacheScores = true; + double maxCacheRAMMB = 4.0; + CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB); + s.search(new TermQuery(new Term("content", searchTerm)), cachedCollector); + + Collection<SearchGroup<BytesRef>> topGroups = c1.getTopGroups(groupOffset, fillFields); + + if (topGroups == null) { + // No groups matched + return; + } + + boolean getScores = true; + boolean getMaxScores = true; + boolean fillFields = true; + TermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector("author", topGroups, groupSort, docSort, docOffset+docsPerGroup, getScores, getMaxScores, fillFields); + + //Optionally compute total group count + TermAllGroupsCollector allGroupsCollector = null; + if (requiredTotalGroupCount) { + allGroupsCollector = new TermAllGroupsCollector("author"); + c2 = MultiCollector.wrap(c2, allGroupsCollector); + } + + if (cachedCollector.isCached()) { + // Cache fit within maxCacheRAMMB, so we can replay it: + cachedCollector.replay(c2); + } else { + // Cache was too large; must re-execute query: + s.search(new TermQuery(new Term("content", searchTerm)), c2); + } + + TopGroups<BytesRef> groupsResult = c2.getTopGroups(docOffset); + if (requiredTotalGroupCount) { + groupsResult = new TopGroups<BytesRef>(groupsResult, allGroupsCollector.getGroupCount()); + } + + // Render groupsResult... ++ +
To use the single-pass BlockGroupingCollector
,
+ first, at indexing time, you must ensure all docs in each group
+ are added as a block, and you have some way to find the last
+ document of each group. One simple way to do this is to add a
+ marker binary field:
+ // Create Documents from your source: + List<Document> oneGroup = ...; + + Field groupEndField = new Field("groupEnd", "x", Field.Store.NO, Field.Index.NOT_ANALYZED); + groupEndField.setOmitTermFreqAndPositions(true); + groupEndField.setOmitNorms(true); + oneGroup.get(oneGroup.size()-1).add(groupEndField); + + // You can also use writer.updateDocuments(); just be sure you + // replace an entire previous doc block with this new one. For + // example, each group could have a "groupID" field, with the same + // value for all docs in this group: + writer.addDocuments(oneGroup); ++ +Then, at search time, do this up front: + +
+ // Set this once in your app & save away for reusing across all queries: + Filter groupEndDocs = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupEnd", "x")))); ++ +Finally, do this per search: + +
+ // Per search: + BlockGroupingCollector c = new BlockGroupingCollector(groupSort, groupOffset+topNGroups, needsScores, groupEndDocs); + s.search(new TermQuery(new Term("content", searchTerm)), c); + TopGroups groupsResult = c.getTopGroups(withinGroupSort, groupOffset, docOffset, docOffset+docsPerGroup, fillFields); + + // Render groupsResult... ++ +Note that the
groupValue
of each GroupDocs
+will be null
, so if you need to present this value you'll
+have to separately retrieve it (for example using stored
+fields, FieldCache
, etc.).
+
+Another collector is the TermAllGroupHeadsCollector
that can be used to retrieve all most relevant
+ documents per group. Also known as group heads. This can be useful in situations when one wants to compute grouping
+ based facets / statistics on the complete query result. The collector can be executed during the first or second
+ phase.
+ AbstractAllGroupHeadsCollector c = TermAllGroupHeadsCollector.create(groupField, sortWithinGroup); + s.search(new TermQuery(new Term("content", searchTerm)), c); + // Return all group heads as int array + int[] groupHeadsArray = c.retrieveGroupHeads() + // Return all group heads as FixedBitSet. + int maxDoc = s.maxDoc(); + FixedBitSet groupHeadsBitSet = c.retrieveGroupHeads(maxDoc) ++ + +