X-Git-Url: https://git.mdrn.pl/pylucene.git/blobdiff_plain/a2e61f0c04805cfcb8706176758d1283c7e3a55c..aaeed5504b982cf3545252ab528713250aa33eed:/lucene-java-3.4.0/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/package.html?ds=sidebyside diff --git a/lucene-java-3.4.0/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/package.html b/lucene-java-3.4.0/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/package.html deleted file mode 100644 index b92da02..0000000 --- a/lucene-java-3.4.0/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/package.html +++ /dev/null @@ -1,728 +0,0 @@ - - - - - Benchmarking Lucene By Tasks - - -
-Benchmarking Lucene By Tasks. -

-This package provides "task based" performance benchmarking of Lucene. -One can use the predefined benchmarks, or create new ones. -

-

-Contained packages: -

- - - - - - - - - - - - - - - - - - - - - - - - - - -
PackageDescription
statsStatistics maintained when running benchmark tasks.
tasksBenchmark tasks.
feedsSources for benchmark inputs: documents and queries.
utilsUtilities used for the benchmark, and for the reports.
programmaticSample performance test written programatically.
- -

Table Of Contents

-

-

    -
  1. Benchmarking By Tasks
  2. -
  3. How to use
  4. -
  5. Benchmark "algorithm"
  6. -
  7. Supported tasks/commands
  8. -
  9. Benchmark properties
  10. -
  11. Example input algorithm and the result benchmark - report.
  12. -
  13. Results record counting clarified
  14. -
-

- -

Benchmarking By Tasks

-

-Benchmark Lucene using task primitives. -

- -

-A benchmark is composed of some predefined tasks, allowing for creating an -index, adding documents, -optimizing, searching, generating reports, and more. A benchmark run takes an -"algorithm" file -that contains a description of the sequence of tasks making up the run, and some -properties defining a few -additional characteristics of the benchmark run. -

- - -

How to use

-

-Easiest way to run a benchmarks is using the predefined ant task: -

-

- -

-You may find existing tasks sufficient for defining the benchmark you -need, otherwise, you can extend the framework to meet your needs, as explained -herein. -

- -

-Each benchmark run has a DocMaker and a QueryMaker. These two should usually -match, so that "meaningful" queries are used for a certain collection. -Properties set at the header of the alg file define which "makers" should be -used. You can also specify your own makers, extending DocMaker and implementing -QureyMaker. -

- Note: since 2.9, DocMaker is a concrete class which accepts a - ContentSource. In most cases, you can use the DocMaker class to create - Documents, while providing your own ContentSource implementation. For - example, the current Benchmark package includes ContentSource - implementations for TREC, Enwiki and Reuters collections, as well as - others like LineDocSource which reads a 'line' file produced by - WriteLineDocTask. -
-

- -

-Benchmark .alg file contains the benchmark "algorithm". The syntax is described -below. Within the algorithm, you can specify groups of commands, assign them -names, specify commands that should be repeated, -do commands in serial or in parallel, -and also control the speed of "firing" the commands. -

- -

-This allows, for instance, to specify -that an index should be opened for update, -documents should be added to it one by one but not faster than 20 docs a minute, -and, in parallel with this, -some N queries should be searched against that index, -again, no more than 2 queries a second. -You can have the searches all share an index reader, -or have them each open its own reader and close it afterwords. -

- -

-If the commands available for use in the algorithm do not meet your needs, -you can add commands by adding a new task under -org.apache.lucene.benchmark.byTask.tasks - -you should extend the PerfTask abstract class. -Make sure that your new task class name is suffixed by Task. -Assume you added the class "WonderfulTask" - doing so also enables the -command "Wonderful" to be used in the algorithm. -

- -

-External classes: It is sometimes useful to invoke the benchmark -package with your external alg file that configures the use of your own -doc/query maker and or html parser. You can work this out without -modifying the benchmark package code, by passing your class path -with the benchmark.ext.classpath property: -

-External tasks: When writing your own tasks under a package other than -org.apache.lucene.benchmark.byTask.tasks specify that package thru the -alt.tasks.packages property. -

- - -

Benchmark "algorithm"

- -

-The following is an informal description of the supported syntax. -

- -
    -
  1. - Measuring: When a command is executed, statistics for the elapsed - execution time and memory consumption are collected. - At any time, those statistics can be printed, using one of the - available ReportTasks. -
  2. -
  3. - Comments start with '#'. -
  4. -
  5. - Serial sequences are enclosed within '{ }'. -
  6. -
  7. - Parallel sequences are enclosed within - '[ ]' -
  8. -
  9. - Sequence naming: To name a sequence, put - '"name"' just after - '{' or '['. -
    Example - { "ManyAdds" AddDoc } : 1000000 - - would - name the sequence of 1M add docs "ManyAdds", and this name would later appear - in statistic reports. - If you don't specify a name for a sequence, it is given one: you can see it as - the algorithm is printed just before benchmark execution starts. -
  10. -
  11. - Repeating: - To repeat sequence tasks N times, add ': N' just - after the - sequence closing tag - '}' or - ']' or '>'. -
    Example - [ AddDoc ] : 4 - would do 4 addDoc - in parallel, spawning 4 threads at once. -
    Example - [ AddDoc AddDoc ] : 4 - would do - 8 addDoc in parallel, spawning 8 threads at once. -
    Example - { AddDoc } : 30 - would do addDoc - 30 times in a row. -
    Example - { AddDoc AddDoc } : 30 - would do - addDoc 60 times in a row. -
    Exhaustive repeating: use * instead of - a number to repeat exhaustively. - This is sometimes useful, for adding as many files as a doc maker can create, - without iterating over the same file again, especially when the exact - number of documents is not known in advance. For insance, TREC files extracted - from a zip file. Note: when using this, you must also set - doc.maker.forever to false. -
    Example - { AddDoc } : * - would add docs - until the doc maker is "exhausted". -
  12. -
  13. - Command parameter: a command can optionally take a single parameter. - If the certain command does not support a parameter, or if the parameter is of - the wrong type, - reading the algorithm will fail with an exception and the test would not start. - Currently the following tasks take optional parameters: -
      -
    • AddDoc takes a numeric parameter, indicating the required size of - added document. Note: if the DocMaker implementation used in the test - does not support makeDoc(size), an exception would be thrown and the test - would fail. -
    • -
    • DeleteDoc takes numeric parameter, indicating the docid to be - deleted. The latter is not very useful for loops, since the docid is - fixed, so for deletion in loops it is better to use the - doc.delete.step property. -
    • -
    • SetProp takes a name,value mandatory param, - ',' used as a separator. -
    • -
    • SearchTravRetTask and SearchTravTask take a numeric - parameter, indicating the required traversal size. -
    • -
    • SearchTravRetLoadFieldSelectorTask takes a string - parameter: a comma separated list of Fields to load. -
    • -
    • SearchTravRetHighlighterTask takes a string - parameter: a comma separated list of parameters to define highlighting. See that - tasks javadocs for more information -
    • -
    -
    Example - AddDoc(2000) - would add a document - of size 2000 (~bytes). -
    See conf/task-sample.alg for how this can be used, for instance, to check - which is faster, adding - many smaller documents, or few larger documents. - Next candidates for supporting a parameter may be the Search tasks, - for controlling the qurey size. -
  14. -
  15. - Statistic recording elimination: - a sequence can also end with - '>', - in which case child tasks would not store their statistics. - This can be useful to avoid exploding stats data, for adding say 1M docs. -
    Example - { "ManyAdds" AddDoc > : 1000000 - - would add million docs, measure that total, but not save stats for each addDoc. -
    Notice that the granularity of System.currentTimeMillis() (which is used - here) is system dependant, - and in some systems an operation that takes 5 ms to complete may show 0 ms - latency time in performance measurements. - Therefore it is sometimes more accurate to look at the elapsed time of a larger - sequence, as demonstrated here. -
  16. -
  17. - Rate: - To set a rate (ops/sec or ops/min) for a sequence, add - ': N : R' just after sequence closing tag. - This would specify repetition of N with rate of R operations/sec. - Use 'R/sec' or - 'R/min' - to explicitely specify that the rate is per second or per minute. - The default is per second, -
    Example - [ AddDoc ] : 400 : 3 - would do 400 - addDoc in parallel, starting up to 3 threads per second. -
    Example - { AddDoc } : 100 : 200/min - would - do 100 addDoc serially, - waiting before starting next add, if otherwise rate would exceed 200 adds/min. -
  18. -
  19. - Disable Counting: Each task executed contributes to the records count. - This count is reflected in reports under recs/s and under recsPerRun. - Most tasks count 1, some count 0, and some count more. - (See Results record counting clarified for more details.) - It is possible to disable counting for a task by preceding it with -. -
    Example - -CreateIndex - would count 0 while - the default behavior for CreateIndex is to count 1. -
  20. -
  21. - Command names: Each class "AnyNameTask" in the - package org.apache.lucene.benchmark.byTask.tasks, - that extends PerfTask, is supported as command "AnyName" that can be - used in the benchmark "algorithm" description. - This allows to add new commands by just adding such classes. -
  22. -
- - - -

Supported tasks/commands

- -

-Existing tasks can be divided into a few groups: -regular index/search work tasks, report tasks, and control tasks. -

- -
    - -
  1. - Report tasks: There are a few Report commands for generating reports. - Only task runs that were completed are reported. - (The 'Report tasks' themselves are not measured and not reported.) -
      -
    • - RepAll - all (completed) task runs. -
    • -
    • - RepSumByName - all statistics, - aggregated by name. So, if AddDoc was executed 2000 times, - only 1 report line would be created for it, aggregating all those - 2000 statistic records. -
    • -
    • - RepSelectByPref   prefixWord - all - records for tasks whose name start with - prefixWord. -
    • -
    • - RepSumByPref   prefixWord - all - records for tasks whose name start with - prefixWord, - aggregated by their full task name. -
    • -
    • - RepSumByNameRound - all statistics, - aggregated by name and by Round. - So, if AddDoc was executed 2000 times in each of 3 - rounds, 3 report lines would be - created for it, - aggregating all those 2000 statistic records in each round. - See more about rounds in the NewRound - command description below. -
    • -
    • - RepSumByPrefRound   prefixWord - - similar to RepSumByNameRound, - just that only tasks whose name starts with - prefixWord are included. -
    • -
    - If needed, additional reports can be added by extending the abstract class - ReportTask, and by - manipulating the statistics data in Points and TaskStats. -
  2. - -
  3. Control tasks: Few of the tasks control the benchmark algorithm - all over: -
      -
    • - ClearStats - clears the entire statistics. - Further reports would only include task runs that would start after this - call. -
    • -
    • - NewRound - virtually start a new round of - performance test. - Although this command can be placed anywhere, it mostly makes sense at - the end of an outermost sequence. -
      This increments a global "round counter". All task runs that - would start now would - record the new, updated round counter as their round number. - This would appear in reports. - In particular, see RepSumByNameRound above. -
      An additional effect of NewRound, is that numeric and boolean - properties defined (at the head - of the .alg file) as a sequence of values, e.g. - merge.factor=mrg:10:100:10:100 would - increment (cyclic) to the next value. - Note: this would also be reflected in the reports, in this case under a - column that would be named "mrg". -
    • -
    • - ResetInputs - DocMaker and the - various QueryMakers - would reset their counters to start. - The way these Maker interfaces work, each call for makeDocument() - or makeQuery() creates the next document or query - that it "knows" to create. - If that pool is "exhausted", the "maker" start over again. - The resetInpus command - therefore allows to make the rounds comparable. - It is therefore useful to invoke ResetInputs together with NewRound. -
    • -
    • - ResetSystemErase - reset all index - and input data and call gc. - Does NOT reset statistics. This contains ResetInputs. - All writers/readers are nullified, deleted, closed. - Index is erased. - Directory is erased. - You would have to call CreateIndex once this was called... -
    • -
    • - ResetSystemSoft - reset all - index and input data and call gc. - Does NOT reset statistics. This contains ResetInputs. - All writers/readers are nullified, closed. - Index is NOT erased. - Directory is NOT erased. - This is useful for testing performance on an existing index, - for instance if the construction of a large index - took a very long time and now you would to test - its search or update performance. -
    • -
    -
  4. - -
  5. - Other existing tasks are quite straightforward and would - just be briefly described here. -
      -
    • - CreateIndex and - OpenIndex both leave the - index open for later update operations. - CloseIndex would close it. -
    • - OpenReader, similarly, would - leave an index reader open for later search operations. - But this have further semantics. - If a Read operation is performed, and an open reader exists, - it would be used. - Otherwise, the read operation would open its own reader - and close it when the read operation is done. - This allows testing various scenarios - sharing a reader, - searching with "cold" reader, with "warmed" reader, etc. - The read operations affected by this are: - Warm, - Search, - SearchTrav (search and traverse), - and SearchTravRet (search - and traverse and retrieve). - Notice that each of the 3 search task types maintains - its own queryMaker instance. -
    • - CommitIndex and - Optimize can be used to commit - changes to the index and/or optimize the index created thus - far. -
    • - WriteLineDoc prepares a 'line' - file where each line holds a document with title, - date and body elements, seperated by [TAB]. - A line file is useful if one wants to measure pure indexing - performance, without the overhead of parsing the data.
      - You can use LineDocSource as a ContentSource over a 'line' - file. -
    • - ConsumeContentSource consumes - a ContentSource. Useful for e.g. testing a ContentSource - performance, without the overhead of preparing a Document - out of it. -
    -
  6. -
- - -

Benchmark properties

- -

-Properties are read from the header of the .alg file, and -define several parameters of the performance test. -As mentioned above for the NewRound task, -numeric and boolean properties that are defined as a sequence -of values, e.g. merge.factor=mrg:10:100:10:100 -would increment (cyclic) to the next value, -when NewRound is called, and would also -appear as a named column in the reports (column -name would be "mrg" in this example). -

- -

-Some of the currently defined properties are: -

- -
    -
  1. - analyzer - full - class name for the analyzer to use. - Same analyzer would be used in the entire test. -
  2. - -
  3. - directory - valid values are - This tells which directory to use for the performance test. -
  4. - -
  5. - Index work parameters: - Multi int/boolean values would be iterated with calls to NewRound. - There would be also added as columns in the reports, first string in the - sequence is the column name. - (Make sure it is no shorter than any value in the sequence). -
      -
    • max.buffered -
      Example: max.buffered=buf:10:10:100:100 - - this would define using maxBufferedDocs of 10 in iterations 0 and 1, - and 100 in iterations 2 and 3. -
    • -
    • - merge.factor - which - merge factor to use. -
    • -
    • - compound - whether the index is - using the compound format or not. Valid values are "true" and "false". -
    • -
    -
- -

-Here is a list of currently defined properties: -

-
    - -
  1. Root directory for data and indexes:
  2. -
    • work.dir (default is System property "benchmark.work.dir" or "work".) -
    - - -
  3. Docs and queries creation:
  4. -
    • analyzer -
    • doc.maker -
    • doc.maker.forever -
    • html.parser -
    • doc.stored -
    • doc.tokenized -
    • doc.term.vector -
    • doc.term.vector.positions -
    • doc.term.vector.offsets -
    • doc.store.body.bytes -
    • docs.dir -
    • query.maker -
    • file.query.maker.file -
    • file.query.maker.default.field -
    • search.num.hits -
    - - -
  5. Logging: -
    • log.step -
    • log.step.[class name]Task ie log.step.DeleteDoc (e.g. log.step.Wonderful for the WonderfulTask example above). -
    • log.queries -
    • task.max.depth.log -
    -
  6. - -
  7. Index writing: -
    • compound -
    • merge.factor -
    • max.buffered -
    • directory -
    • ram.flush.mb -
    -
  8. - -
  9. Doc deletion: -
    • doc.delete.step -
    -
  10. - -
  11. Task alternative packages: -
    • alt.tasks.packages - - comma separated list of additional packages where tasks classes will be looked for - when not found in the default package (that of PerfTask). If the same task class - appears in more than one package, the package indicated first in this list will be used. -
    -
  12. - -
- -

-For sample use of these properties see the *.alg files under conf. -

- - -

Example input algorithm and the result benchmark report

-

-The following example is in conf/sample.alg: -

-# --------------------------------------------------------
-#
-# Sample: what is the effect of doc size on indexing time?
-#
-# There are two parts in this test:
-# - PopulateShort adds 2N documents of length  L
-# - PopulateLong  adds  N documents of length 2L
-# Which one would be faster?
-# The comparison is done twice.
-#
-# --------------------------------------------------------
-
-# -------------------------------------------------------------------------------------
-# multi val params are iterated by NewRound's, added to reports, start with column name.
-merge.factor=mrg:10:20
-max.buffered=buf:100:1000
-compound=true
-
-analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
-directory=FSDirectory
-
-doc.stored=true
-doc.tokenized=true
-doc.term.vector=false
-doc.add.log.step=500
-
-docs.dir=reuters-out
-
-doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker
-
-query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker
-
-# task at this depth or less would print when they start
-task.max.depth.log=2
-
-log.queries=false
-# -------------------------------------------------------------------------------------
-{
-
-    { "PopulateShort"
-        CreateIndex
-        { AddDoc(4000) > : 20000
-        Optimize
-        CloseIndex
-    >
-
-    ResetSystemErase
-
-    { "PopulateLong"
-        CreateIndex
-        { AddDoc(8000) > : 10000
-        Optimize
-        CloseIndex
-    >
-
-    ResetSystemErase
-
-    NewRound
-
-} : 2
-
-RepSumByName
-RepSelectByPref Populate
-
-
-

- -

-The command line for running this sample: -
ant run-task -Dtask.alg=conf/sample.alg -

- -

-The output report from running this test contains the following: -

-Operation     round mrg  buf   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
-PopulateShort     0  10  100        1        20003        119.6      167.26    12,959,120     14,241,792
-PopulateLong -  - 0  10  100 -  -   1 -  -   10003 -  -  - 74.3 -  - 134.57 -  17,085,208 -   20,635,648
-PopulateShort     1  20 1000        1        20003        143.5      139.39    63,982,040     94,756,864
-PopulateLong -  - 1  20 1000 -  -   1 -  -   10003 -  -  - 77.0 -  - 129.92 -  87,309,608 -  100,831,232
-
-

- - -

Results record counting clarified

-

-Two columns in the results table indicate records counts: records-per-run and -records-per-second. What does it mean? -

-Almost every task gets 1 in this count just for being executed. -Task sequences aggregate the counts of their child tasks, -plus their own count of 1. -So, a task sequence containing 5 other task sequences, each running a single -other task 10 times, would have a count of 1 + 5 * (1 + 10) = 56. -

-The traverse and retrieve tasks "count" more: a traverse task -would add 1 for each traversed result (hit), and a retrieve task would -additionally add 1 for each retrieved doc. So, regular Search would -count 1, SearchTrav that traverses 10 hits would count 11, and a -SearchTravRet task that retrieves (and traverses) 10, would count 21. -

-Confusing? this might help: always examine the elapsedSec column, -and always compare "apples to apples", .i.e. it is interesting to check how the -rec/s changed for the same task (or sequence) between two -different runs, but it is not very useful to know how the rec/s -differs between Search and SearchTrav tasks. For -the latter, elapsedSec would bring more insight. -

- -
-
 
- -