1 Lucene Benchmark Contrib Change Log
3 The Benchmark contrib package contains code for benchmarking Lucene in a variety of ways.
5 For more information on past and future Lucene versions, please see:
6 http://s.apache.org/luceneversions
9 LUCENE-3137: ExtractReuters supports out-dir param suffixed by a slash. (Doron Cohen)
12 LUCENE-2977: WriteLineDocTask now automatically detects how to write -
13 GZip or BZip2 or Plain-text - according to the output file extension.
14 Property bzip.compression of WriteLineDocTask was canceled. (Doron Cohen)
17 LUCENE-2980: Benchmark's ContentSource no more requires lower case file suffixes
18 for detecting file type (gzip/bzip2/text). As part of this fix worked around an
19 issue with gzip input streams which were remaining open (See COMPRESS-127).
23 LUCENE-2978: Upgrade benchmark's commons-compress from 1.0 to 1.1 as
24 the move of gzip decompression in LUCENE-1540 from Java's GZipInputStream
25 to commons-compress 1.0 made it 15 times slower. In 1.1 no such slow-down
26 is observed. (Doron Cohen)
29 LUCENE-2958: WriteLineDocTask improvements - allow to emit line docs also for empty
30 docs, and be flexible about which fields are added to the line file. For this, a header
31 line was added to the line file. That header is examined by LineDocSource. Old line
32 files which have no header line are handled as before, imposing the default header.
33 (Doron Cohen, Shai Erera, Mike McCandless)
36 LUCENE-2964: Allow benchmark tasks from alternative packages,
37 specified through a new property "alt.tasks.packages".
38 (Doron Cohen, Shai Erera)
41 LUCENE-2963: Easier way to run benchmark, by calling Benmchmark.exec(alg-file).
45 LUCENE-2961: Removed lib/xml-apis.jar, since JVM 1.5+ already contains the
46 JAXP 1.3 interface classes it provides.
49 LUCENE-1540: Improvements to contrib.benchmark for TREC collections.
50 ContentSource can now process plain text files, gzip files, and bzip2 files.
51 TREC doc parsing now handles the TREC gov2 collection and TREC disks 4&5-CR
52 collection (both used by many TREC tasks). (Shai Erera, Doron Cohen)
55 LUCENE-1591: Rollback to xerces-2.9.1-patched-XERCESJ-1257.jar to workaround
56 XERCESJ-1257, which we hit on current Wikipedia XML export
57 (ENWIKI-20110115-pages-articles.xml) with xerces-2.10.0.jar. (Mike McCandless)
60 LUCENE-929: ExtractReuters first extracts to a tmp dir and then renames. That
61 way, if a previous extract attempt failed, "ant extract-reuters" will still
62 extract the files. (Shai Erera, Doron Cohen, Grant Ingersoll)
65 LUCENE-2885: Add WaitForMerges task (calls IndexWriter.waitForMerges()).
69 The locally built patched version of the Xerces-J jar introduced
70 as part of LUCENE-1591 is no longer required, because Xerces
71 2.10.0, which contains a fix for XERCESJ-1257 (see
72 http://svn.apache.org/viewvc?view=revision&revision=554069),
73 was released last year. Upgraded
74 xerces-2.9.1-patched-XERCESJ-1257.jar and xml-apis-2.9.0.jar
75 to xercesImpl-2.10.0.jar and xml-apis-2.10.0.jar. (Steven Rowe)
78 LUCENE-2416: WriteLineDocTask now supports multi-threading. Also,
79 StringBufferReader was renamed to StringBuilderReader and works on
80 StringBuilder now. In addition, LongToEnglishContentSource starts from 0
81 (instead of Long.MIN_VAL+10) and wraps around to MIN_VAL (if you ever hit
82 Long.MAX_VAL). (Shai Erera)
85 LUCENE-2377: Enable the use of NoMergePolicy and NoMergeScheduler by
86 CreateIndexTask. (Shai Erera)
89 LUCENE-2353: Fixed bug in Config where Windows absolute path property values
90 were incorrectly handled (Shai Erera)
93 LUCENE-2343: Added support for benchmarking collectors. (Grant Ingersoll, Shai Erera)
96 LUCENE-2254: Add support to the quality package for running
97 experiments with any combination of Title, Description, and Narrative.
101 LUCENE-2223: Add a benchmark for ShingleFilter. You can wrap any
102 analyzer with ShingleAnalyzerWrapper and specify shingle parameters
103 with the NewShingleAnalyzer task. (Steven Rowe via Robert Muir)
106 LUCENE-2210: TrecTopicsReader now properly reads descriptions and
107 narratives from trec topics files. (Robert Muir)
110 LUCENE-2181: Add a benchmark for collation. This adds NewLocaleTask,
111 which sets a Locale in the run data for collation to use, and can be
112 used in the future for benchmarking localized range queries and sorts.
113 Also add NewCollationAnalyzerTask, which works with both JDK and ICU
114 Collator implementations. Fix ReadTokensTask to not tokenize fields
115 unless they should be tokenized according to DocMaker config. The
116 easiest way to run the benchmark is to run 'ant collation'
117 (Steven Rowe via Robert Muir)
120 LUCENE-2178: Allow multiple locations to add to the class path with
121 -Dbenchmark.ext.classpath=... when running "ant run-task" (Steven
122 Rowe via Mike McCandless)
125 LUCENE-2168: Allow negative relative thread priority for BG tasks
129 LUCENE-2106: ReadTask does not close its Reader when
130 OpenReader/CloseReader are not used. (Mark Miller)
133 LUCENE-2079: Allow specifying delta thread priority after the "&";
134 added log.time.step.msec to print per-time-period counts; fixed
135 NearRealTimeTask to print reopen times (in msec) of each reopen, at
136 the end. (Mike McCandless)
139 LUCENE-2050: Added ability to run tasks within a serial sequence in
140 the background, by appending "&". The tasks are stopped & joined at
141 the end of the sequence. Also added Wait and RollbackIndex tasks.
142 Genericized NearRealTimeReaderTask to only reopen the reader
143 (previously it spawned its own thread, and also did searching).
144 Also changed the API of PerfRunData.getIndexReader: it now returns a
145 reference, and it's your job to decRef the reader when you're done
146 using it. (Mike McCandless)
149 LUCENE-2059: allow TrecContentSource not to change the docname.
150 Previously, it would always append the iteration # to the docname.
151 With the new option content.source.excludeIteration, you can disable this.
152 The resulting index can then be used with the quality package to measure
153 relevance. (Robert Muir)
156 LUCENE-2058: specify trec_eval submission output from the command line.
157 Previously, 4 arguments were required, but the third was unused. The
158 third argument is now the desired location of submission.txt (Robert Muir)
161 LUCENE-2044: Added delete.percent.rand.seed to seed the Random instance
162 used by DeleteByPercentTask. (Mike McCandless)
165 LUCENE-2043: Fix CommitIndexTask to also commit pending IndexReader
166 changes (Mike McCandless)
169 LUCENE-2042: Added print.hits.field, to print each hit from the
170 Search* tasks. (Mike McCandless)
173 LUCENE-2029: Added doc.body.stored and doc.body.tokenized; each
174 falls back to the non-body variant as its default. (Mike McCandless)
177 LUCENE-1994: Fix thread safety of EnwikiContentSource and DocMaker
178 when doc.reuse.fields is false. Also made docs.reuse.fields=true
179 thread safe. (Mark Miller, Shai Erera, Mike McCandless)
182 LUCENE-1770: Add EnwikiQueryMaker (Mark Miller)
185 LUCENE-1773: Add FastVectorHighlighter tasks. This change is a
186 non-backwards compatible change in how subclasses of ReadTask define
187 a highlighter. The methods doHighlight, isMergeContiguousFragments,
188 maxNumFragments and getHighlighter are no longer used and have been
189 mark deprecated and package protected private so there's a compile
190 time error. Instead, the new getBenchmarkHighlighter method should
191 return an appropriate highlighter for the task. The configuration of
192 the highlighter tasks (maxFrags, mergeContiguous, etc.) is now
193 accepted as params to the task. (Koji Sekiguchi via Mike McCandless)
196 LUCENE-1778: Add support for log.step setting per task type. Perviously, if
197 you included a log.step line in the .alg file, it had been applied to all
198 tasks. Now, you can include a log.step.AddDoc, or log.step.DeleteDoc (for
199 example) to control logging for just these tasks. If you want to ommit logging
200 for any other task, include log.step=-1. The syntax is "log.step." together
201 with the Task's 'short' name (i.e., without the 'Task' part).
202 (Shai Erera via Mark Miller)
205 LUCENE-1595: Deprecate LineDocMaker and EnwikiDocMaker in favor of
206 using DocMaker directly, with content.source = LineDocSource or
207 EnwikiContentSource. NOTE: with this change, the "id" field from
208 the Wikipedia XML export is now indexed as the "docname" field
209 (previously it was indexed as "docid"). Additionaly, the
210 SearchWithSort task now accepts all types that SortField can accept
211 and no longer falls back to SortField.AUTO, which has been
212 deprecated. (Mike McCandless)
215 LUCENE-1755: Fix WriteLineDocTask to output a document if it contains either
216 a title or body (or both). (Shai Erera via Mark Miller)
219 LUCENE-1725: Fix the example Sort algorithm - auto is now deprecated and no longer works
220 with Benchmark. Benchmark will now throw an exception if you specify sort fields without
221 a type. The example sort algorithm is now typed. (Mark Miller)
224 LUCENE-1730: Fix TrecContentSource to use ISO-8859-1 when reading the TREC files,
225 unless a different encoding is specified. Additionally, ContentSource now supports
226 a content.source.encoding parameter in the configuration file.
227 (Shai Erera via Mark Miller)
230 LUCENE-1716: Added the following support:
231 doc.tokenized.norms: specifies whether to store norms
232 doc.body.tokenized.norms: special attribute for the body field
233 doc.index.props: specifies whether DocMaker should index the properties set on
235 writer.info.stream: specifies the info stream to set on IndexWriter (supported
236 values are: SystemOut, SystemErr and a file name). (Shai Erera via Mike McCandless)
239 LUCENE-1714: WriteLineDocTask incorrectly normalized text, by replacing only
240 occurrences of "\t" with a space. It now replaces "\r\n" in addition to that,
241 so that LineDocMaker won't fail. (Shai Erera via Michael McCandless)
244 LUCENE-1595: This issue breaks previous external algorithms. DocMaker has been
245 replaced with a concrete class which accepts a ContentSource for iterating over
246 a content source's documents. Most of the old DocMakers were changed to a
247 ContentSource implementation, and DocMaker is now a default document creation impl
248 that provides an easy way for reusing fields. When [doc.maker] is not defined in
249 an algorithm, the new DocMaker is the default. If you have .alg files which
250 specify a DocMaker (like ReutersDocMaker), you should change the [doc.maker] line to:
251 [content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource]
254 doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
256 content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
258 doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
260 content.source=org.apache.lucene.benchmark.byTask.feeds.SingleDocSource
262 Also, PerfTask now logs a message in tearDown() rather than each Task doing its
263 own logging. A new setting called [log.step] is consulted to determine how often
264 to log. [doc.add.log.step] is no longer a valid setting. For easy migration of
265 current .alg files, rename [doc.add.log.step] to [log.step] and [doc.delete.log.step]
266 to [delete.log.step].
268 Additionally, [doc.maker.forever] should be changed to [content.source.forever].
269 (Shai Erera via Mark Miller)
272 LUCENE-1539: Added DeleteByPercentTask which enables deleting a
273 percentage of documents and searching on them. Changed CommitIndex
274 to optionally accept a label (recorded as userData=<label> in the
275 commit point). Added FlushReaderTask, and modified OpenReaderTask
276 to also optionally take a label referencing a commit point to open.
277 Also changed default autoCommit (when IndexWriter is opened) to
278 true. (Jason Rutherglen via Mike McCandless)
281 LUCENE-1495: Allow task sequence to run for specfied number of seconds by adding ": 2.7s" (for example).
284 LUCENE-1493: Stop using deprecated Hits API for searching; add new
285 param search.num.hits to set top N docs to collect.
288 LUCENE-1492: Added optional readOnly param (default true) to OpenReader task.
291 LUCENE-1243: Added new sorting benchmark capabilities. Also Reopen and commit tasks. (Mark Miller via Grant Ingersoll)
294 LUCENE-1090: remove relative paths assumptions from benchmark code.
295 Only build.xml was modified: work-dir definition must remain so
296 benchmark tests can run from both trunk-home and benchmark-home.
299 LUCENE-1209: Fixed DocMaker settings by round. Prior to this fix, DocMaker settings of
300 first round were used in all rounds. (E.g. term vectors.)
301 (Mark Miller via Doron Cohen)
304 LUCENE-1156: Fixed redirect problem in EnwikiDocMaker. Refactored ExtractWikipedia to use EnwikiDocMaker. Added property to EnwikiDocMaker to allow
305 for skipping image only documents.
308 LUCENE-1136: add ability to not count sub-task doLogic increment
311 LUCENE-1129: ReadTask properly uses the traversalSize value
312 LUCENE-1128: Added support for benchmarking the highlighter
315 LUCENE-1139: various fixes
316 - add merge.scheduler, merge.policy config properties
317 - refactor Open/CreateIndexTask to share setting config on IndexWriter
318 - added doc.reuse.fields=true|false for LineDocMaker
319 - OptimizeTask now takes int param to call optimize(int maxNumSegments)
320 - CloseIndexTask now takes bool param to call close(false) (abort running merges)
324 LUCENE-1116: quality package improvements:
325 - add MRR computation;
326 - allow control of max #queries to run;
327 - verify log & report are flushed.
328 - add TREC query reader for the 1MQ track.
331 LUCENE-1102: EnwikiDocMaker now indexes the docid field, so results might not be comparable with results prior to this change, although
332 it is doubted that this one small field makes much difference.
335 LUCENE-1086: DocMakers setup for the "docs.dir" property
336 fixed to properly handle absolute paths. (Shai Erera via Doron Cohen)
339 LUCENE-941: infinite loop for alg: {[AddDoc(4000)]: 4} : *
340 ResetInputsTask fixed to work also after exhaustion.
341 All Reset Tasks now subclas ResetInputsTask.
344 LUCENE-971: Change enwiki tasks to a doc maker (extending
345 LineDocMaker) that directly processes the Wikipedia XML and produces
346 documents. Intermediate files (one per document) are no longer
350 LUCENE-967: Add "ReadTokensTask" to allow for benchmarking just tokenization.
353 LUCENE-836: Add support for search quality benchmarking, running
354 a set of queries against a searcher, and, optionally produce a submission
355 report, and, if query judgements are available, compute quality measures:
356 recall, precision_at_N, average_precision, MAP. TREC specific Judge (based
357 on TREC QRels) and TREC Topics reader are included in o.a.l.benchmark.quality.trec
358 but any other format of queries and judgements can be implemented and used.
361 LUCENE-947: Add support for creating and index "one document per
362 line" from a large text file, which reduces per-document overhead of
363 opening a single file for each document.
366 LUCENE-848: Added support for Wikipedia benchmarking.
369 - LUCENE-940: Multi-threaded issues fixed: SimpleDateFormat; logging for addDoc/deleteDoc tasks.
370 - LUCENE-945: tests fail to find data dirs. Added sys-prop benchmark.work.dir and cfg-prop work.dir.
374 - LUCENE-863: Deprecated StandardBenchmarker in favour of byTask code.
379 Better error handling and javadocs around "exhaustive" doc making.
384 1. which HTML Parser is used is configurable with html.parser property.
385 2. External classes added to classpath with -Dbenchmark.ext.classpath=path.
386 3. '*' as repeating number now means "exhaust doc maker - no repetitions".
390 -Moved withRetrieve() call out of the loop in ReadTask
391 -Added SearchTravRetLoadFieldSelectorTask to help benchmark some of the FieldSelector capabilities
392 -Added options to store content bytes on the Reuters Doc (and others, but Reuters is the only one w/ it enabled)
396 Tests (for benchmarking code correctness) were added - LUCENE-840.
397 To be invoked by "ant test" from contrib/benchmark. (Doron Cohen)
401 1. Introduced an AbstractQueryMaker to hold common QueryMaker code. (GSI)
402 2. Added traversalSize parameter to SearchTravRetTask and SearchTravTask. Changed SearchTravRetTask to extend SearchTravTask. (GSI)
403 3. Added FileBasedQueryMaker to run queries from a File or resource. (GSI)
404 4. Modified query-maker generation for read related tasks to make further read tasks addition simpler and safer. (DC)
405 5. Changed Taks' setParams() to throw UnsupportedOperationException if that task does not suppot command line param. (DC)
406 6. Improved javadoc to specify all properties command line params currently supported. (DC)
407 7. Refactored ReportTasks so that it is easy/possible now to create new report tasks. (DC)
411 1. Committed Doron Cohen's benchmarking contribution, which provides an easily expandable task based approach to benchmarking. See the javadocs for information. (Doron Cohen via Grant Ingersoll)
415 3. 2/11/07: LUCENE-790 and 788: Fixed Locale issue with date formatter. Fixed some minor issues with benchmarking by task. Added a dependency
416 on the Lucene demo to the build classpath. (Doron Cohen, Grant Ingersoll)
418 4. 2/13/07: LUCENE-801: build.xml now builds Lucene core and Demo first and has classpath dependencies on the output of that build. (Doron Cohen, Grant Ingersoll)