lucene-java-3.4.0/lucene/contrib/benchmark/CHANGES.txt

   1 Lucene Benchmark Contrib Change Log
   2
   3 The Benchmark contrib package contains code for benchmarking Lucene in a variety of ways.
   4
   5 For more information on past and future Lucene versions, please see:
   6 http://s.apache.org/luceneversions
   7
   8 05/25/2011
   9   LUCENE-3137: ExtractReuters supports out-dir param suffixed by a slash. (Doron Cohen)
  10
  11 03/24/2011
  12   LUCENE-2977: WriteLineDocTask now automatically detects how to write -
  13   GZip or BZip2 or Plain-text - according to the output file extension.
  14   Property bzip.compression of WriteLineDocTask was canceled. (Doron Cohen)
  15
  16 03/23/2011
  17   LUCENE-2980: Benchmark's ContentSource no more requires lower case file suffixes
  18   for detecting file type (gzip/bzip2/text). As part of this fix worked around an
  19   issue with gzip input streams which were remaining open (See COMPRESS-127).
  20   (Doron Cohen)
  21
  22 03/22/2011
  23   LUCENE-2978: Upgrade benchmark's commons-compress from 1.0 to 1.1 as
  24   the move of gzip decompression in LUCENE-1540 from Java's GZipInputStream
  25   to commons-compress 1.0 made it 15 times slower. In 1.1 no such slow-down
  26   is observed. (Doron Cohen)
  27
  28 03/21/2011
  29   LUCENE-2958: WriteLineDocTask improvements - allow to emit line docs also for empty
  30   docs, and be flexible about which fields are added to the line file. For this, a header
  31   line was added to the line file. That header is examined by LineDocSource. Old line
  32   files which have no header line are handled as before, imposing the default header.
  33   (Doron Cohen, Shai Erera, Mike McCandless)
  34
  35 03/21/2011
  36   LUCENE-2964: Allow benchmark tasks from alternative packages,
  37   specified through a new property "alt.tasks.packages".
  38   (Doron Cohen, Shai Erera)
  39
  40 03/20/2011
  41   LUCENE-2963: Easier way to run benchmark, by calling Benmchmark.exec(alg-file).
  42   (Doron Cohen)
  43
  44 03/10/2011
  45   LUCENE-2961: Removed lib/xml-apis.jar, since JVM 1.5+ already contains the
  46   JAXP 1.3 interface classes it provides.
  47
  48 02/03/2011
  49   LUCENE-1540: Improvements to contrib.benchmark for TREC collections.
  50   ContentSource can now process plain text files, gzip files, and bzip2 files.
  51   TREC doc parsing now handles the TREC gov2 collection and TREC disks 4&5-CR
  52   collection (both used by many TREC tasks). (Shai Erera, Doron Cohen)
  53
  54 01/31/2011
  55   LUCENE-1591: Rollback to xerces-2.9.1-patched-XERCESJ-1257.jar to workaround
  56   XERCESJ-1257, which we hit on current Wikipedia XML export
  57   (ENWIKI-20110115-pages-articles.xml) with xerces-2.10.0.jar.   (Mike McCandless)
  58
  59 01/26/2011
  60   LUCENE-929: ExtractReuters first extracts to a tmp dir and then renames. That
  61   way, if a previous extract attempt failed, "ant extract-reuters" will still
  62   extract the files. (Shai Erera, Doron Cohen, Grant Ingersoll)
  63
  64 01/24/2011
  65   LUCENE-2885: Add WaitForMerges task (calls IndexWriter.waitForMerges()).
  66   (Mike McCandless)
  67
  68 10/10/2010
  69   The locally built patched version of the Xerces-J jar introduced
  70   as part of LUCENE-1591 is no longer required, because Xerces
  71   2.10.0, which contains a fix for XERCESJ-1257 (see
  72   http://svn.apache.org/viewvc?view=revision&revision=554069),
  73   was released last year.  Upgraded
  74   xerces-2.9.1-patched-XERCESJ-1257.jar and xml-apis-2.9.0.jar
  75   to xercesImpl-2.10.0.jar and xml-apis-2.10.0.jar. (Steven Rowe)
  76
  77 4/27/2010
  78   LUCENE-2416: WriteLineDocTask now supports multi-threading. Also,
  79   StringBufferReader was renamed to StringBuilderReader and works on
  80   StringBuilder now. In addition, LongToEnglishContentSource starts from 0
  81   (instead of Long.MIN_VAL+10) and wraps around to MIN_VAL (if you ever hit
  82   Long.MAX_VAL). (Shai Erera)
  83
  84 4/07/2010
  85   LUCENE-2377: Enable the use of NoMergePolicy and NoMergeScheduler by
  86   CreateIndexTask. (Shai Erera)
  87
  88 3/28/2010
  89   LUCENE-2353: Fixed bug in Config where Windows absolute path property values
  90   were incorrectly handled (Shai Erera)
  91
  92 3/24/2010
  93   LUCENE-2343: Added support for benchmarking collectors. (Grant Ingersoll, Shai Erera)
  94
  95 2/21/2010
  96   LUCENE-2254: Add support to the quality package for running
  97   experiments with any combination of Title, Description, and Narrative.
  98   (Robert Muir)
  99
 100 1/28/2010
 101   LUCENE-2223: Add a benchmark for ShingleFilter. You can wrap any
 102   analyzer with ShingleAnalyzerWrapper and specify shingle parameters
 103   with the NewShingleAnalyzer task.  (Steven Rowe via Robert Muir)
 104
 105 1/14/2010
 106   LUCENE-2210: TrecTopicsReader now properly reads descriptions and
 107   narratives from trec topics files.  (Robert Muir)
 108
 109 1/11/2010
 110   LUCENE-2181: Add a benchmark for collation. This adds NewLocaleTask,
 111   which sets a Locale in the run data for collation to use, and can be
 112   used in the future for benchmarking localized range queries and sorts.
 113   Also add NewCollationAnalyzerTask, which works with both JDK and ICU
 114   Collator implementations. Fix ReadTokensTask to not tokenize fields
 115   unless they should be tokenized according to DocMaker config. The
 116   easiest way to run the benchmark is to run 'ant collation'
 117   (Steven Rowe via Robert Muir)
 118
 119 12/22/2009
 120   LUCENE-2178: Allow multiple locations to add to the class path with
 121   -Dbenchmark.ext.classpath=... when running "ant run-task" (Steven
 122   Rowe via Mike McCandless)
 123
 124 12/17/2009
 125   LUCENE-2168: Allow negative relative thread priority for BG tasks
 126   (Mike McCandless)
 127
 128 12/07/2009
 129   LUCENE-2106: ReadTask does not close its Reader when
 130   OpenReader/CloseReader are not used. (Mark Miller)
 131
 132 11/17/2009
 133   LUCENE-2079: Allow specifying delta thread priority after the "&";
 134   added log.time.step.msec to print per-time-period counts; fixed
 135   NearRealTimeTask to print reopen times (in msec) of each reopen, at
 136   the end.  (Mike McCandless)
 137
 138 11/13/2009
 139   LUCENE-2050: Added ability to run tasks within a serial sequence in
 140   the background, by appending "&".  The tasks are stopped & joined at
 141   the end of the sequence.  Also added Wait and RollbackIndex tasks.
 142   Genericized NearRealTimeReaderTask to only reopen the reader
 143   (previously it spawned its own thread, and also did searching).
 144   Also changed the API of PerfRunData.getIndexReader: it now returns a
 145   reference, and it's your job to decRef the reader when you're done
 146   using it.  (Mike McCandless)
 147
 148 11/12/2009
 149   LUCENE-2059: allow TrecContentSource not to change the docname.
 150   Previously, it would always append the iteration # to the docname.
 151   With the new option content.source.excludeIteration, you can disable this.
 152   The resulting index can then be used with the quality package to measure
 153   relevance. (Robert Muir)
 154
 155 11/12/2009
 156   LUCENE-2058: specify trec_eval submission output from the command line.
 157   Previously, 4 arguments were required, but the third was unused. The
 158   third argument is now the desired location of submission.txt  (Robert Muir)
 159
 160 11/08/2009
 161   LUCENE-2044: Added delete.percent.rand.seed to seed the Random instance
 162   used by DeleteByPercentTask.  (Mike McCandless)
 163
 164 11/07/2009
 165   LUCENE-2043: Fix CommitIndexTask to also commit pending IndexReader
 166   changes (Mike McCandless)
 167
 168 11/07/2009
 169   LUCENE-2042: Added print.hits.field, to print each hit from the
 170   Search* tasks.  (Mike McCandless)
 171
 172 11/04/2009
 173   LUCENE-2029: Added doc.body.stored and doc.body.tokenized; each
 174   falls back to the non-body variant as its default.  (Mike McCandless)
 175
 176 10/28/2009
 177   LUCENE-1994: Fix thread safety of EnwikiContentSource and DocMaker
 178   when doc.reuse.fields is false.  Also made docs.reuse.fields=true
 179   thread safe.  (Mark Miller, Shai Erera, Mike McCandless)
 180
 181 8/4/2009
 182   LUCENE-1770: Add EnwikiQueryMaker (Mark Miller)
 183
 184 8/04/2009
 185   LUCENE-1773: Add FastVectorHighlighter tasks.  This change is a
 186   non-backwards compatible change in how subclasses of ReadTask define
 187   a highlighter.  The methods doHighlight, isMergeContiguousFragments,
 188   maxNumFragments and getHighlighter are no longer used and have been
 189   mark deprecated and package protected private so there's a compile
 190   time error.  Instead, the new getBenchmarkHighlighter method should
 191   return an appropriate highlighter for the task. The configuration of
 192   the highlighter tasks (maxFrags, mergeContiguous, etc.) is now
 193   accepted as params to the task.  (Koji Sekiguchi via Mike McCandless)
 194
 195 8/03/2009
 196   LUCENE-1778: Add support for log.step setting per task type. Perviously, if
 197   you included a log.step line in the .alg file, it had been applied to all
 198   tasks. Now, you can include a log.step.AddDoc, or log.step.DeleteDoc (for
 199   example) to control logging for just these tasks. If you want to ommit logging
 200   for any other task, include log.step=-1. The syntax is "log.step." together
 201   with the Task's 'short' name (i.e., without the 'Task' part).
 202   (Shai Erera via Mark Miller)
 203
 204 7/24/2009
 205   LUCENE-1595: Deprecate LineDocMaker and EnwikiDocMaker in favor of
 206   using DocMaker directly, with content.source = LineDocSource or
 207   EnwikiContentSource.  NOTE: with this change, the "id" field from
 208   the Wikipedia XML export is now indexed as the "docname" field
 209   (previously it was indexed as "docid").  Additionaly, the
 210   SearchWithSort task now accepts all types that SortField can accept
 211   and no longer falls back to SortField.AUTO, which has been
 212   deprecated. (Mike McCandless)
 213
 214 7/20/2009
 215   LUCENE-1755: Fix WriteLineDocTask to output a document if it contains either
 216   a title or body (or both).  (Shai Erera via Mark Miller)
 217
 218 7/14/2009
 219   LUCENE-1725: Fix the example Sort algorithm - auto is now deprecated and no longer works
 220   with Benchmark. Benchmark will now throw an exception if you specify sort fields without
 221   a type. The example sort algorithm is now typed.  (Mark Miller)
 222
 223 7/6/2009
 224   LUCENE-1730: Fix TrecContentSource to use ISO-8859-1 when reading the TREC files,
 225   unless a different encoding is specified. Additionally, ContentSource now supports
 226   a content.source.encoding parameter in the configuration file.
 227   (Shai Erera via Mark Miller)
 228
 229 6/26/2009
 230   LUCENE-1716: Added the following support:
 231   doc.tokenized.norms: specifies whether to store norms
 232   doc.body.tokenized.norms: special attribute for the body field
 233   doc.index.props: specifies whether DocMaker should index the properties set on
 234   DocData
 235   writer.info.stream: specifies the info stream to set on IndexWriter (supported
 236   values are: SystemOut, SystemErr and a file name). (Shai Erera via Mike McCandless)
 237
 238 6/23/09
 239   LUCENE-1714: WriteLineDocTask incorrectly  normalized text, by replacing only
 240   occurrences of "\t" with a space. It now replaces "\r\n" in addition to that,
 241   so that LineDocMaker won't fail. (Shai Erera via Michael McCandless)
 242
 243 6/17/09
 244   LUCENE-1595: This issue breaks previous external algorithms. DocMaker has been
 245   replaced with a concrete class which accepts a ContentSource for iterating over
 246   a content source's documents. Most of the old DocMakers were changed to a
 247   ContentSource implementation, and DocMaker is now a default document creation impl
 248   that provides an easy way for reusing fields. When [doc.maker] is not defined in
 249   an algorithm, the new DocMaker is the default. If you have .alg files which
 250   specify a DocMaker (like ReutersDocMaker), you should change the [doc.maker] line to:
 251   [content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource]
 252
 253   i.e.
 254   doc.maker=org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker
 255   becomes
 256   content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
 257
 258   doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
 259   becomes
 260   content.source=org.apache.lucene.benchmark.byTask.feeds.SingleDocSource
 261
 262   Also, PerfTask now logs a message in tearDown() rather than each Task doing its
 263   own logging. A new setting called [log.step] is consulted to determine how often
 264   to log. [doc.add.log.step] is no longer a valid setting. For easy migration of
 265   current .alg files, rename [doc.add.log.step] to [log.step] and [doc.delete.log.step]
 266   to [delete.log.step].
 267
 268   Additionally, [doc.maker.forever] should be changed to [content.source.forever].
 269   (Shai Erera via Mark Miller)
 270
 271 6/12/09
 272   LUCENE-1539: Added DeleteByPercentTask which enables deleting a
 273   percentage of documents and searching on them.  Changed CommitIndex
 274   to optionally accept a label (recorded as userData=<label> in the
 275   commit point).  Added FlushReaderTask, and modified OpenReaderTask
 276   to also optionally take a label referencing a commit point to open.
 277   Also changed default autoCommit (when IndexWriter is opened) to
 278   true. (Jason Rutherglen via Mike McCandless)
 279
 280 12/20/08
 281   LUCENE-1495: Allow task sequence to run for specfied number of seconds by adding ": 2.7s" (for example).
 282
 283 12/16/08
 284   LUCENE-1493: Stop using deprecated Hits API for searching; add new
 285   param search.num.hits to set top N docs to collect.
 286
 287 12/16/08
 288   LUCENE-1492: Added optional readOnly param (default true) to OpenReader task.
 289
 290 9/9/08
 291  LUCENE-1243: Added new sorting benchmark capabilities.  Also Reopen and commit tasks.  (Mark Miller via Grant Ingersoll)
 292
 293 5/10/08
 294   LUCENE-1090: remove relative paths assumptions from benchmark code.
 295   Only build.xml was modified: work-dir definition must remain so
 296   benchmark tests can run from both trunk-home and benchmark-home.
 297
 298 3/9/08
 299   LUCENE-1209: Fixed DocMaker settings by round. Prior to this fix, DocMaker settings of
 300   first round were used in all rounds.  (E.g. term vectors.)
 301   (Mark Miller via Doron Cohen)
 302
 303 1/30/08
 304   LUCENE-1156: Fixed redirect problem in EnwikiDocMaker.  Refactored ExtractWikipedia to use EnwikiDocMaker.  Added property to EnwikiDocMaker to allow
 305   for skipping image only documents.
 306
 307 1/24/2008
 308   LUCENE-1136: add ability to not count sub-task doLogic increment
 309
 310 1/23/2008
 311   LUCENE-1129: ReadTask properly uses the traversalSize value
 312   LUCENE-1128: Added support for benchmarking the highlighter
 313
 314 01/20/08
 315   LUCENE-1139: various fixes
 316   - add merge.scheduler, merge.policy config properties
 317   - refactor Open/CreateIndexTask to share setting config on IndexWriter
 318   - added doc.reuse.fields=true|false for LineDocMaker
 319   - OptimizeTask now takes int param to call optimize(int maxNumSegments)
 320   - CloseIndexTask now takes bool param to call close(false) (abort running merges)
 321
 322
 323 01/03/08
 324   LUCENE-1116: quality package improvements:
 325   - add MRR computation;
 326   - allow control of max #queries to run;
 327   - verify log & report are flushed.
 328   - add TREC query reader for the 1MQ track.
 329
 330 12/31/07
 331   LUCENE-1102: EnwikiDocMaker now indexes the docid field, so results might not be comparable with results prior to this change, although
 332   it is doubted that this one small field makes much difference.
 333
 334 12/13/07
 335   LUCENE-1086: DocMakers setup for the "docs.dir" property
 336   fixed to properly handle absolute paths. (Shai Erera via Doron Cohen)
 337
 338 9/18/07
 339   LUCENE-941: infinite loop for alg: {[AddDoc(4000)]: 4} : *
 340   ResetInputsTask fixed to work also after exhaustion.
 341   All Reset Tasks now subclas ResetInputsTask.
 342
 343 8/9/07
 344   LUCENE-971: Change enwiki tasks to a doc maker (extending
 345   LineDocMaker) that directly processes the Wikipedia XML and produces
 346   documents.  Intermediate files (one per document) are no longer
 347   created.
 348
 349 8/1/07
 350   LUCENE-967: Add "ReadTokensTask" to allow for benchmarking just tokenization.
 351
 352 7/27/07
 353   LUCENE-836: Add support for search quality benchmarking, running
 354   a set of queries against a searcher, and, optionally produce a submission
 355   report, and, if query judgements are available, compute quality measures:
 356   recall, precision_at_N, average_precision, MAP. TREC specific Judge (based
 357   on TREC QRels) and TREC Topics reader are included in o.a.l.benchmark.quality.trec
 358   but any other format of queries and judgements can be implemented and used.
 359
 360 7/24/07
 361   LUCENE-947: Add support for creating and index "one document per
 362   line" from a large text file, which reduces per-document overhead of
 363   opening a single file for each document.
 364
 365 6/30/07
 366   LUCENE-848: Added support for Wikipedia benchmarking.
 367
 368 6/25/07
 369 - LUCENE-940: Multi-threaded issues fixed: SimpleDateFormat; logging for addDoc/deleteDoc tasks.
 370 - LUCENE-945: tests fail to find data dirs. Added sys-prop benchmark.work.dir and cfg-prop work.dir.
 371 (Doron Cohen)
 372
 373 4/17/07
 374 - LUCENE-863: Deprecated StandardBenchmarker in favour of byTask code.
 375   (Otis Gospodnetic)
 376
 377 4/13/07
 378
 379 Better error handling and javadocs around "exhaustive" doc making.
 380
 381 3/25/07
 382
 383 LUCENE-849:
 384 1. which HTML Parser is used is configurable with html.parser property.
 385 2. External classes added to classpath with -Dbenchmark.ext.classpath=path.
 386 3. '*' as repeating number now means "exhaust doc maker - no repetitions".
 387
 388 3/22/07
 389
 390 -Moved withRetrieve() call out of the loop in ReadTask
 391 -Added SearchTravRetLoadFieldSelectorTask to help benchmark some of the FieldSelector capabilities
 392 -Added options to store content bytes on the Reuters Doc (and others, but Reuters is the only one w/ it enabled)
 393
 394 3/21/07
 395
 396 Tests (for benchmarking code correctness) were added - LUCENE-840.
 397 To be invoked by "ant test" from contrib/benchmark. (Doron Cohen)
 398
 399 3/19/07
 400
 401 1. Introduced an AbstractQueryMaker to hold common QueryMaker code. (GSI)
 402 2. Added traversalSize parameter to SearchTravRetTask and SearchTravTask.  Changed SearchTravRetTask to extend SearchTravTask. (GSI)
 403 3. Added FileBasedQueryMaker to run queries from a File or resource. (GSI)
 404 4. Modified query-maker generation for read related tasks to make further read tasks addition simpler and safer. (DC)
 405 5. Changed Taks' setParams() to throw UnsupportedOperationException if that task does not suppot command line param. (DC)
 406 6. Improved javadoc to specify all properties command line params currently supported. (DC)
 407 7. Refactored ReportTasks so that it is easy/possible now to create new report tasks. (DC)
 408
 409 01/09/07
 410
 411 1. Committed Doron Cohen's benchmarking contribution, which provides an easily expandable task based approach to benchmarking.  See the javadocs for information. (Doron Cohen via Grant Ingersoll)
 412
 413 2. Added this file.
 414
 415 3. 2/11/07: LUCENE-790 and 788:  Fixed Locale issue with date formatter. Fixed some minor issues with benchmarking by task.  Added a dependency
 416  on the Lucene demo to the build classpath.  (Doron Cohen, Grant Ingersoll)
 417
 418 4. 2/13/07: LUCENE-801: build.xml now builds Lucene core and Demo first and has classpath dependencies on the output of that build.  (Doron Cohen, Grant Ingersoll)