1 Lucene Benchmark Contrib Change Log
3 The Benchmark contrib package contains code for benchmarking Lucene in a variety of ways.
5 For more information on past and future Lucene versions, please see:
6 http://s.apache.org/luceneversions
9 LUCENE-3262: Facet benchmarking - Benchmark tasks and sources were added for indexing
10 with facets, demonstrated in facets.alg. (Gilad Barkai, Doron Cohen)
13 LUCENE-3457: Upgrade commons-compress to 1.2 (and undo LUCENE-2980's workaround).
17 LUCENE-3137: ExtractReuters supports out-dir param suffixed by a slash. (Doron Cohen)
20 LUCENE-2977: WriteLineDocTask now automatically detects how to write -
21 GZip or BZip2 or Plain-text - according to the output file extension.
22 Property bzip.compression of WriteLineDocTask was canceled. (Doron Cohen)
25 LUCENE-2980: Benchmark's ContentSource no more requires lower case file suffixes
26 for detecting file type (gzip/bzip2/text). As part of this fix worked around an
27 issue with gzip input streams which were remaining open (See COMPRESS-127).
31 LUCENE-2978: Upgrade benchmark's commons-compress from 1.0 to 1.1 as
32 the move of gzip decompression in LUCENE-1540 from Java's GZipInputStream
33 to commons-compress 1.0 made it 15 times slower. In 1.1 no such slow-down
34 is observed. (Doron Cohen)
37 LUCENE-2958: WriteLineDocTask improvements - allow to emit line docs also for empty
38 docs, and be flexible about which fields are added to the line file. For this, a header
39 line was added to the line file. That header is examined by LineDocSource. Old line
40 files which have no header line are handled as before, imposing the default header.
41 (Doron Cohen, Shai Erera, Mike McCandless)
44 LUCENE-2964: Allow benchmark tasks from alternative packages,
45 specified through a new property "alt.tasks.packages".
46 (Doron Cohen, Shai Erera)
49 LUCENE-2963: Easier way to run benchmark, by calling Benmchmark.exec(alg-file).
53 LUCENE-2961: Removed lib/xml-apis.jar, since JVM 1.5+ already contains the
54 JAXP 1.3 interface classes it provides.
57 LUCENE-1540: Improvements to contrib.benchmark for TREC collections.
58 ContentSource can now process plain text files, gzip files, and bzip2 files.
59 TREC doc parsing now handles the TREC gov2 collection and TREC disks 4&5-CR
60 collection (both used by many TREC tasks). (Shai Erera, Doron Cohen)
63 LUCENE-1591: Rollback to xerces-2.9.1-patched-XERCESJ-1257.jar to workaround
64 XERCESJ-1257, which we hit on current Wikipedia XML export
65 (ENWIKI-20110115-pages-articles.xml) with xerces-2.10.0.jar. (Mike McCandless)
68 LUCENE-929: ExtractReuters first extracts to a tmp dir and then renames. That
69 way, if a previous extract attempt failed, "ant extract-reuters" will still
70 extract the files. (Shai Erera, Doron Cohen, Grant Ingersoll)
73 LUCENE-2885: Add WaitForMerges task (calls IndexWriter.waitForMerges()).
77 The locally built patched version of the Xerces-J jar introduced
78 as part of LUCENE-1591 is no longer required, because Xerces
79 2.10.0, which contains a fix for XERCESJ-1257 (see
80 http://svn.apache.org/viewvc?view=revision&revision=554069),
81 was released last year. Upgraded
82 xerces-2.9.1-patched-XERCESJ-1257.jar and xml-apis-2.9.0.jar
83 to xercesImpl-2.10.0.jar and xml-apis-2.10.0.jar. (Steven Rowe)
86 LUCENE-2416: WriteLineDocTask now supports multi-threading. Also,
87 StringBufferReader was renamed to StringBuilderReader and works on
88 StringBuilder now. In addition, LongToEnglishContentSource starts from 0
89 (instead of Long.MIN_VAL+10) and wraps around to MIN_VAL (if you ever hit
90 Long.MAX_VAL). (Shai Erera)
93 LUCENE-2377: Enable the use of NoMergePolicy and NoMergeScheduler by
94 CreateIndexTask. (Shai Erera)
97 LUCENE-2353: Fixed bug in Config where Windows absolute path property values
98 were incorrectly handled (Shai Erera)
101 LUCENE-2343: Added support for benchmarking collectors. (Grant Ingersoll, Shai Erera)
104 LUCENE-2254: Add support to the quality package for running
105 experiments with any combination of Title, Description, and Narrative.
109 LUCENE-2223: Add a benchmark for ShingleFilter. You can wrap any
110 analyzer with ShingleAnalyzerWrapper and specify shingle parameters
111 with the NewShingleAnalyzer task. (Steven Rowe via Robert Muir)
114 LUCENE-2210: TrecTopicsReader now properly reads descriptions and
115 narratives from trec topics files. (Robert Muir)
118 LUCENE-2181: Add a benchmark for collation. This adds NewLocaleTask,
119 which sets a Locale in the run data for collation to use, and can be
120 used in the future for benchmarking localized range queries and sorts.
121 Also add NewCollationAnalyzerTask, which works with both JDK and ICU
122 Collator implementations. Fix ReadTokensTask to not tokenize fields
123 unless they should be tokenized according to DocMaker config. The
124 easiest way to run the benchmark is to run 'ant collation'
125 (Steven Rowe via Robert Muir)
128 LUCENE-2178: Allow multiple locations to add to the class path with
129 -Dbenchmark.ext.classpath=... when running "ant run-task" (Steven
130 Rowe via Mike McCandless)
133 LUCENE-2168: Allow negative relative thread priority for BG tasks
137 LUCENE-2106: ReadTask does not close its Reader when
138 OpenReader/CloseReader are not used. (Mark Miller)
141 LUCENE-2079: Allow specifying delta thread priority after the "&";
142 added log.time.step.msec to print per-time-period counts; fixed
143 NearRealTimeTask to print reopen times (in msec) of each reopen, at
144 the end. (Mike McCandless)
147 LUCENE-2050: Added ability to run tasks within a serial sequence in
148 the background, by appending "&". The tasks are stopped & joined at
149 the end of the sequence. Also added Wait and RollbackIndex tasks.
150 Genericized NearRealTimeReaderTask to only reopen the reader
151 (previously it spawned its own thread, and also did searching).
152 Also changed the API of PerfRunData.getIndexReader: it now returns a
153 reference, and it's your job to decRef the reader when you're done
154 using it. (Mike McCandless)
157 LUCENE-2059: allow TrecContentSource not to change the docname.
158 Previously, it would always append the iteration # to the docname.
159 With the new option content.source.excludeIteration, you can disable this.
160 The resulting index can then be used with the quality package to measure
161 relevance. (Robert Muir)
164 LUCENE-2058: specify trec_eval submission output from the command line.
165 Previously, 4 arguments were required, but the third was unused. The
166 third argument is now the desired location of submission.txt (Robert Muir)
169 LUCENE-2044: Added delete.percent.rand.seed to seed the Random instance
170 used by DeleteByPercentTask. (Mike McCandless)
173 LUCENE-2043: Fix CommitIndexTask to also commit pending IndexReader
174 changes (Mike McCandless)
177 LUCENE-2042: Added print.hits.field, to print each hit from the
178 Search* tasks. (Mike McCandless)
181 LUCENE-2029: Added doc.body.stored and doc.body.tokenized; each
182 falls back to the non-body variant as its default. (Mike McCandless)
185 LUCENE-1994: Fix thread safety of EnwikiContentSource and DocMaker
186 when doc.reuse.fields is false. Also made docs.reuse.fields=true
187 thread safe. (Mark Miller, Shai Erera, Mike McCandless)
190 LUCENE-1770: Add EnwikiQueryMaker (Mark Miller)
193 LUCENE-1773: Add FastVectorHighlighter tasks. This change is a
194 non-backwards compatible change in how subclasses of ReadTask define
195 a highlighter. The methods doHighlight, isMergeContiguousFragments,
196 maxNumFragments and getHighlighter are no longer used and have been
197 mark deprecated and package protected private so there's a compile
198 time error. Instead, the new getBenchmarkHighlighter method should
199 return an appropriate highlighter for the task. The configuration of
200 the highlighter tasks (maxFrags, mergeContiguous, etc.) is now
201 accepted as params to the task. (Koji Sekiguchi via Mike McCandless)
204 LUCENE-1778: Add support for log.step setting per task type. Perviously, if
205 you included a log.step line in the .alg file, it had been applied to all
206 tasks. Now, you can include a log.step.AddDoc, or log.step.DeleteDoc (for
207 example) to control logging for just these tasks. If you want to ommit logging
208 for any other task, include log.step=-1. The syntax is "log.step." together
209 with the Task's 'short' name (i.e., without the 'Task' part).
210 (Shai Erera via Mark Miller)
213 LUCENE-1595: Deprecate LineDocMaker and EnwikiDocMaker in favor of
214 using DocMaker directly, with content.source = LineDocSource or
215 EnwikiContentSource. NOTE: with this change, the "id" field from
216 the Wikipedia XML export is now indexed as the "docname" field
217 (previously it was indexed as "docid"). Additionaly, the
218 SearchWithSort task now accepts all types that SortField can accept
219 and no longer falls back to SortField.AUTO, which has been
220 deprecated. (Mike McCandless)
223 LUCENE-1755: Fix WriteLineDocTask to output a document if it contains either
224 a title or body (or both). (Shai Erera via Mark Miller)
227 LUCENE-1725: Fix the example Sort algorithm - auto is now deprecated and no longer works
228 with Benchmark. Benchmark will now throw an exception if you specify sort fields without
229 a type. The example sort algorithm is now typed. (Mark Miller)
232 LUCENE-1730: Fix TrecContentSource to use ISO-8859-1 when reading the TREC files,
233 unless a different encoding is specified. Additionally, ContentSource now supports
234 a content.source.encoding parameter in the configuration file.
235 (Shai Erera via Mark Miller)
238 LUCENE-1716: Added the following support:
239 doc.tokenized.norms: specifies whether to store norms
240 doc.body.tokenized.norms: special attribute for the body field
241 doc.index.props: specifies whether DocMaker should index the properties set on
243 writer.info.stream: specifies the info stream to set on IndexWriter (supported
244 values are: SystemOut, SystemErr and a file name). (Shai Erera via Mike McCandless)
247 LUCENE-1714: WriteLineDocTask incorrectly normalized text, by replacing only
248 occurrences of "\t" with a space. It now replaces "\r\n" in addition to that,
249 so that LineDocMaker won't fail. (Shai Erera via Michael McCandless)
252 LUCENE-1595: This issue breaks previous external algorithms. DocMaker has been
253 replaced with a concrete class which accepts a ContentSource for iterating over
254 a content source's documents. Most of the old DocMakers were changed to a
255 ContentSource implementation, and DocMaker is now a default document creation impl
256 that provides an easy way for reusing fields. When [doc.maker] is not defined in
257 an algorithm, the new DocMaker is the default. If you have .alg files which
258 specify a DocMaker (like ReutersDocMaker), you should change the [doc.maker] line to:
259 [content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource]
262 doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
264 content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
266 doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
268 content.source=org.apache.lucene.benchmark.byTask.feeds.SingleDocSource
270 Also, PerfTask now logs a message in tearDown() rather than each Task doing its
271 own logging. A new setting called [log.step] is consulted to determine how often
272 to log. [doc.add.log.step] is no longer a valid setting. For easy migration of
273 current .alg files, rename [doc.add.log.step] to [log.step] and [doc.delete.log.step]
274 to [delete.log.step].
276 Additionally, [doc.maker.forever] should be changed to [content.source.forever].
277 (Shai Erera via Mark Miller)
280 LUCENE-1539: Added DeleteByPercentTask which enables deleting a
281 percentage of documents and searching on them. Changed CommitIndex
282 to optionally accept a label (recorded as userData=<label> in the
283 commit point). Added FlushReaderTask, and modified OpenReaderTask
284 to also optionally take a label referencing a commit point to open.
285 Also changed default autoCommit (when IndexWriter is opened) to
286 true. (Jason Rutherglen via Mike McCandless)
289 LUCENE-1495: Allow task sequence to run for specfied number of seconds by adding ": 2.7s" (for example).
292 LUCENE-1493: Stop using deprecated Hits API for searching; add new
293 param search.num.hits to set top N docs to collect.
296 LUCENE-1492: Added optional readOnly param (default true) to OpenReader task.
299 LUCENE-1243: Added new sorting benchmark capabilities. Also Reopen and commit tasks. (Mark Miller via Grant Ingersoll)
302 LUCENE-1090: remove relative paths assumptions from benchmark code.
303 Only build.xml was modified: work-dir definition must remain so
304 benchmark tests can run from both trunk-home and benchmark-home.
307 LUCENE-1209: Fixed DocMaker settings by round. Prior to this fix, DocMaker settings of
308 first round were used in all rounds. (E.g. term vectors.)
309 (Mark Miller via Doron Cohen)
312 LUCENE-1156: Fixed redirect problem in EnwikiDocMaker. Refactored ExtractWikipedia to use EnwikiDocMaker. Added property to EnwikiDocMaker to allow
313 for skipping image only documents.
316 LUCENE-1136: add ability to not count sub-task doLogic increment
319 LUCENE-1129: ReadTask properly uses the traversalSize value
320 LUCENE-1128: Added support for benchmarking the highlighter
323 LUCENE-1139: various fixes
324 - add merge.scheduler, merge.policy config properties
325 - refactor Open/CreateIndexTask to share setting config on IndexWriter
326 - added doc.reuse.fields=true|false for LineDocMaker
327 - OptimizeTask now takes int param to call optimize(int maxNumSegments)
328 - CloseIndexTask now takes bool param to call close(false) (abort running merges)
332 LUCENE-1116: quality package improvements:
333 - add MRR computation;
334 - allow control of max #queries to run;
335 - verify log & report are flushed.
336 - add TREC query reader for the 1MQ track.
339 LUCENE-1102: EnwikiDocMaker now indexes the docid field, so results might not be comparable with results prior to this change, although
340 it is doubted that this one small field makes much difference.
343 LUCENE-1086: DocMakers setup for the "docs.dir" property
344 fixed to properly handle absolute paths. (Shai Erera via Doron Cohen)
347 LUCENE-941: infinite loop for alg: {[AddDoc(4000)]: 4} : *
348 ResetInputsTask fixed to work also after exhaustion.
349 All Reset Tasks now subclas ResetInputsTask.
352 LUCENE-971: Change enwiki tasks to a doc maker (extending
353 LineDocMaker) that directly processes the Wikipedia XML and produces
354 documents. Intermediate files (one per document) are no longer
358 LUCENE-967: Add "ReadTokensTask" to allow for benchmarking just tokenization.
361 LUCENE-836: Add support for search quality benchmarking, running
362 a set of queries against a searcher, and, optionally produce a submission
363 report, and, if query judgements are available, compute quality measures:
364 recall, precision_at_N, average_precision, MAP. TREC specific Judge (based
365 on TREC QRels) and TREC Topics reader are included in o.a.l.benchmark.quality.trec
366 but any other format of queries and judgements can be implemented and used.
369 LUCENE-947: Add support for creating and index "one document per
370 line" from a large text file, which reduces per-document overhead of
371 opening a single file for each document.
374 LUCENE-848: Added support for Wikipedia benchmarking.
377 - LUCENE-940: Multi-threaded issues fixed: SimpleDateFormat; logging for addDoc/deleteDoc tasks.
378 - LUCENE-945: tests fail to find data dirs. Added sys-prop benchmark.work.dir and cfg-prop work.dir.
382 - LUCENE-863: Deprecated StandardBenchmarker in favour of byTask code.
387 Better error handling and javadocs around "exhaustive" doc making.
392 1. which HTML Parser is used is configurable with html.parser property.
393 2. External classes added to classpath with -Dbenchmark.ext.classpath=path.
394 3. '*' as repeating number now means "exhaust doc maker - no repetitions".
398 -Moved withRetrieve() call out of the loop in ReadTask
399 -Added SearchTravRetLoadFieldSelectorTask to help benchmark some of the FieldSelector capabilities
400 -Added options to store content bytes on the Reuters Doc (and others, but Reuters is the only one w/ it enabled)
404 Tests (for benchmarking code correctness) were added - LUCENE-840.
405 To be invoked by "ant test" from contrib/benchmark. (Doron Cohen)
409 1. Introduced an AbstractQueryMaker to hold common QueryMaker code. (GSI)
410 2. Added traversalSize parameter to SearchTravRetTask and SearchTravTask. Changed SearchTravRetTask to extend SearchTravTask. (GSI)
411 3. Added FileBasedQueryMaker to run queries from a File or resource. (GSI)
412 4. Modified query-maker generation for read related tasks to make further read tasks addition simpler and safer. (DC)
413 5. Changed Taks' setParams() to throw UnsupportedOperationException if that task does not suppot command line param. (DC)
414 6. Improved javadoc to specify all properties command line params currently supported. (DC)
415 7. Refactored ReportTasks so that it is easy/possible now to create new report tasks. (DC)
419 1. Committed Doron Cohen's benchmarking contribution, which provides an easily expandable task based approach to benchmarking. See the javadocs for information. (Doron Cohen via Grant Ingersoll)
423 3. 2/11/07: LUCENE-790 and 788: Fixed Locale issue with date formatter. Fixed some minor issues with benchmarking by task. Added a dependency
424 on the Lucene demo to the build classpath. (Doron Cohen, Grant Ingersoll)
426 4. 2/13/07: LUCENE-801: build.xml now builds Lucene core and Demo first and has classpath dependencies on the output of that build. (Doron Cohen, Grant Ingersoll)