add --shared
[pylucene.git] / lucene-java-3.4.0 / lucene / src / site / src / documentation / content / xdocs / fileformats.xml
1 <?xml version="1.0"?>
2
3 <document>
4     <header>
5         <title>
6             Apache Lucene - Index File Formats
7         </title>
8     </header>
9
10     <body>
11         <section id="Index File Formats"><title>Index File Formats</title>
12
13             <p>
14                 This document defines the index file formats used
15                 in this version of Lucene. If you are using a different
16                 version of Lucene, please consult the copy of
17                 <code>docs/fileformats.html</code>
18                 that was distributed
19                 with the version you are using.
20             </p>
21
22             <p>
23                 Apache Lucene is written in Java, but several
24                 efforts are underway to write
25                 <a href="http://wiki.apache.org/lucene-java/LuceneImplementations">versions
26                     of Lucene in other programming
27                 languages</a>.  If these versions are to remain compatible with Apache
28                 Lucene, then a language-independent definition of the Lucene index
29                 format is required.  This document thus attempts to provide a
30                 complete and independent definition of the Apache Lucene file
31                 formats.
32             </p>
33
34             <p>
35                 As Lucene evolves, this document should evolve.
36                 Versions of Lucene in different programming languages should endeavor
37                 to agree on file formats, and generate new versions of this document.
38             </p>
39
40             <p>
41                 Compatibility notes are provided in this document,
42                 describing how file formats have changed from prior versions.
43             </p>
44
45             <p>
46                 In version 2.1, the file format was changed to allow
47                 lock-less commits (ie, no more commit lock). The
48                 change is fully backwards compatible: you can open a
49                 pre-2.1 index for searching or adding/deleting of
50                 docs. When the new segments file is saved
51                 (committed), it will be written in the new file format
52                 (meaning no specific "upgrade" process is needed).
53                 But note that once a commit has occurred, pre-2.1
54                 Lucene will not be able to read the index.
55             </p>
56
57             <p>
58                 In version 2.3, the file format was changed to allow
59                 segments to share a single set of doc store (vectors &amp;
60                 stored fields) files.  This allows for faster indexing
61                 in certain cases.  The change is fully backwards
62                 compatible (in the same way as the lock-less commits
63                 change in 2.1).
64             </p>
65
66             <p>
67                 In version 2.4, Strings are now written as true UTF-8
68                 byte sequence, not Java's modified UTF-8.  See issue
69                 LUCENE-510 for details.
70             </p>
71
72             <p>
73                 In version 2.9, an optional opaque Map&lt;String,String&gt;
74                 CommitUserData may be passed to IndexWriter's commit
75                 methods (and later retrieved), which is recorded in
76                 the segments_N file.  See issue LUCENE-1382 for
77                 details.  Also, diagnostics were added to each segment
78                 written recording details about why it was written
79                 (due to flush, merge; which OS/JRE was used; etc.).
80                 See issue LUCENE-1654 for details.
81             </p>
82             
83             <p>
84                 In version 3.0, compressed fields are no longer
85                 written to the index (they can still be read, but on
86                 merge the new segment will write them,
87                 uncompressed). See issue LUCENE-1960 for details.
88             </p>
89
90         <p>
91             In version 3.1, segments records the code version
92             that created them. See LUCENE-2720 for details.
93             
94             Additionally segments track explicitly whether or
95             not they have term vectors. See LUCENE-2811 for details.
96            </p>
97         <p>
98             In version 3.2, numeric fields are written as natively
99             to stored fields file, previously they were stored in
100             text format only.
101            </p>
102         <p>
103             In version 3.4, fields can omit position data while
104             still indexing term frequencies.
105         </p>
106         </section>
107
108         <section id="Definitions"><title>Definitions</title>
109
110             <p>
111                 The fundamental concepts in Lucene are index,
112                 document, field and term.
113             </p>
114
115
116             <p>
117                 An index contains a sequence of documents.
118             </p>
119
120             <ul>
121                 <li>
122                     <p>
123                         A document is a sequence of fields.
124                     </p>
125                 </li>
126
127                 <li>
128                     <p>
129                         A field is a named sequence of terms.
130                     </p>
131                 </li>
132
133                 <li>
134                     A term is a string.
135                 </li>
136             </ul>
137
138             <p>
139                 The same string in two different fields is
140                 considered a different term.  Thus terms are represented as a pair of
141                 strings, the first naming the field, and the second naming text
142                 within the field.
143             </p>
144
145             <section id="Inverted Indexing"><title>Inverted Indexing</title>
146
147                 <p>
148                     The index stores statistics about terms in order
149                     to make term-based search more efficient.  Lucene's
150                     index falls into the family of indexes known as an <i>inverted
151                         index.</i> This is because it can list, for a term, the documents that contain
152                     it.  This is the inverse of the natural relationship, in which
153                     documents list terms.
154                 </p>
155             </section>
156             <section id="Types of Fields">
157                 <title>Types of Fields</title>
158                 <p>
159                     In Lucene, fields may be <i>stored</i>, in which
160                     case their text is stored in the index literally, in a non-inverted
161                     manner.  Fields that are inverted are called <i>indexed</i>. A field
162                     may be both stored and indexed.</p>
163
164                 <p>The text of a field may be <i>tokenized</i> into terms to be
165                     indexed, or the text of a field may be used literally as a term to be indexed.
166                     Most fields are
167                     tokenized, but sometimes it is useful for certain identifier fields
168                     to be indexed literally.
169                 </p>
170                 <p>See the <a href="api/core/org/apache/lucene/document/Field.html">Field</a> java docs for more information on Fields.</p>
171             </section>
172
173             <section id="Segments"><title>Segments</title>
174
175                 <p>
176                     Lucene indexes may be composed of multiple sub-indexes, or
177                     <i>segments</i>. Each segment is a fully independent index, which could be searched
178                     separately. Indexes evolve by:
179                 </p>
180
181                 <ol>
182                     <li>
183                         <p>Creating new segments for newly added documents.</p>
184                     </li>
185                     <li>
186                         <p>Merging existing segments.</p>
187                     </li>
188                 </ol>
189
190                 <p>
191                     Searches may involve multiple segments and/or multiple indexes, each
192                     index potentially composed of a set of segments.
193                 </p>
194             </section>
195
196             <section id="Document Numbers"><title>Document Numbers</title>
197
198                 <p>
199                     Internally, Lucene refers to documents by an integer <i>document
200                         number</i>. The first document added to an index is numbered zero, and each
201                     subsequent document added gets a number one greater than the previous.
202                 </p>
203
204                 <p>
205                     <br/>
206                 </p>
207
208                 <p>
209                     Note that a document's number may change, so caution should be taken
210                     when storing these numbers outside of Lucene. In particular, numbers may
211                     change in the following situations:
212                 </p>
213
214
215                 <ul>
216                     <li>
217                         <p>
218                             The
219                             numbers stored in each segment are unique only within the segment,
220                             and must be converted before they can be used in a larger context.
221                             The standard technique is to allocate each segment a range of
222                             values, based on the range of numbers used in that segment.  To
223                             convert a document number from a segment to an external value, the
224                             segment's <i>base</i> document
225                             number is added.  To convert an external value back to a
226                             segment-specific value, the  segment is identified by the range that
227                             the external value is in, and the segment's base value is
228                             subtracted.  For example two five document segments might be
229                             combined, so that the first segment has a base value of zero, and
230                             the second of five.  Document three from the second segment would
231                             have an external value of eight.
232                         </p>
233                     </li>
234                     <li>
235                         <p>
236                             When documents are deleted, gaps are created
237                             in the numbering. These are eventually removed as the index evolves
238                             through merging. Deleted documents are dropped when segments are
239                             merged. A freshly-merged segment thus has no gaps in its numbering.
240                         </p>
241                     </li>
242                 </ul>
243
244             </section>
245
246         </section>
247
248         <section id="Overview"><title>Overview</title>
249
250             <p>
251                 Each segment index maintains the following:
252             </p>
253             <ul>
254                 <li>
255                     <p>Field names. This
256                         contains the set of field names used in the index.
257
258                     </p>
259                 </li>
260                 <li>
261                     <p>Stored Field
262                         values. This contains, for each document, a list of attribute-value
263                         pairs, where the attributes are field names. These are used to
264                         store auxiliary information about the document, such as its title,
265                         url, or an identifier to access a
266                         database. The set of stored fields are what is returned for each hit
267                         when searching. This is keyed by document number.
268                     </p>
269                 </li>
270                 <li>
271                     <p>Term dictionary.
272                         A dictionary containing all of the terms used in all of the indexed
273                         fields of all of the documents. The dictionary also contains the
274                         number of documents which contain the term, and pointers to the
275                         term's frequency and proximity data.
276                     </p>
277                 </li>
278
279                 <li>
280                     <p>Term Frequency
281                         data. For each term in the dictionary, the numbers of all the
282                         documents that contain that term, and the frequency of the term in
283                         that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)
284                     </p>
285                 </li>
286
287                 <li>
288                     <p>Term Proximity
289                         data. For each term in the dictionary, the positions that the term
290                         occurs in each document.  Note that this will
291                         not exist if all fields in all documents omit position data.
292                     </p>
293                 </li>
294
295                 <li>
296                     <p>Normalization
297                         factors. For each field in each document, a value is stored that is
298                         multiplied into the score for hits on that field.
299                     </p>
300                 </li>
301                 <li>
302                     <p>Term Vectors. For each field in each document, the term vector
303                         (sometimes called document vector) may be stored. A term vector consists
304                         of term text and term frequency. To add Term Vectors to your index see the
305                         <a href="api/core/org/apache/lucene/document/Field.html">Field</a>
306                         constructors
307                     </p>
308                 </li>
309                 <li>
310                     <p>Deleted documents.
311                         An optional file indicating which documents are deleted.
312                     </p>
313                 </li>
314             </ul>
315
316             <p>Details on each of these are provided in subsequent sections.
317             </p>
318         </section>
319
320         <section id="File Naming"><title>File Naming</title>
321
322             <p>
323                 All files belonging to a segment have the same name with varying
324                 extensions. The extensions correspond to the different file formats
325                 described below. When using the Compound File format (default in 1.4 and greater) these files are
326                 collapsed into a single .cfs file (see below for details)
327             </p>
328
329             <p>
330                 Typically, all segments
331                 in an index are stored in a single directory, although this is not
332                 required.
333             </p>
334
335             <p>
336                 As of version 2.1 (lock-less commits), file names are
337                 never re-used (there is one exception, "segments.gen",
338                 see below). That is, when any file is saved to the
339                 Directory it is given a never before used filename.
340                 This is achieved using a simple generations approach.
341                 For example, the first segments file is segments_1,
342                 then segments_2, etc. The generation is a sequential
343                 long integer represented in alpha-numeric (base 36)
344                 form.
345             </p>
346
347         </section>
348       <section id="file-names"><title>Summary of File Extensions</title>
349         <p>The following table summarizes the names and extensions of the files in Lucene:
350           <table>
351             <tr>
352               <th>Name</th>
353               <th>Extension</th>
354               <th>Brief Description</th>
355             </tr>
356             <tr>
357               <td><a href="#Segments File">Segments File</a></td>
358               <td>segments.gen, segments_N</td>
359               <td>Stores information about segments</td>
360             </tr>
361             <tr>
362               <td><a href="#Lock File">Lock File</a></td>
363               <td>write.lock</td>
364               <td>The Write lock prevents multiple IndexWriters from writing to the same file.</td>
365             </tr>
366             <tr>
367               <td><a href="#Compound Files">Compound File</a></td>
368               <td>.cfs</td>
369               <td>An optional "virtual" file consisting of all the other index files for systems
370               that frequently run out of file handles.</td>
371             </tr>
372               <tr>
373               <td><a href="#Compound File">Compound File Entry table</a></td>
374               <td>.cfe</td>
375               <td>The "virtual" compound file's entry table holding all entries in the corresponding .cfs file (Since 3.4)</td>
376             </tr>
377             <tr>
378               <td><a href="#Fields">Fields</a></td>
379               <td>.fnm</td>
380               <td>Stores information about the fields</td>
381             </tr>
382             <tr>
383               <td><a href="#field_index">Field Index</a></td>
384               <td>.fdx</td>
385               <td>Contains pointers to field data</td>
386             </tr>
387             <tr>
388               <td><a href="#field_data">Field Data</a></td>
389               <td>.fdt</td>
390               <td>The stored fields for documents</td>
391             </tr>
392             <tr>
393               <td><a href="#tis">Term Infos</a></td>
394               <td>.tis</td>
395               <td>Part of the term dictionary, stores term info</td>
396             </tr>
397             <tr>
398               <td><a href="#tii">Term Info Index</a></td>
399               <td>.tii</td>
400               <td>The index into the Term Infos file</td>
401             </tr>
402             <tr>
403               <td><a href="#Frequencies">Frequencies</a></td>
404               <td>.frq</td>
405               <td>Contains the list of docs which contain each term along with frequency</td>
406             </tr>
407             <tr>
408               <td><a href="#Positions">Positions</a></td>
409               <td>.prx</td>
410               <td>Stores position information about where a term occurs in the index</td>
411             </tr>
412             <tr>
413               <td><a href="#Normalization Factors">Norms</a></td>
414               <td>.nrm</td>
415               <td>Encodes length and boost factors for docs and fields</td>
416             </tr>
417             <tr>
418               <td><a href="#tvx">Term Vector Index</a></td>
419               <td>.tvx</td>
420               <td>Stores offset into the document data file</td>
421             </tr>
422             <tr>
423               <td><a href="#tvd">Term Vector Documents</a></td>
424               <td>.tvd</td>
425               <td>Contains information about each document that has term vectors</td>
426             </tr>
427             <tr>
428               <td><a href="#tvf">Term Vector Fields</a></td>
429               <td>.tvf</td>
430               <td>The field level info about term vectors</td>
431             </tr>
432             <tr>
433               <td><a href="#Deleted Documents">Deleted Documents</a></td>
434               <td>.del</td>
435               <td>Info about what files are deleted</td>
436             </tr>
437           </table>
438
439         </p>
440       </section>
441
442         <section id="Primitive Types"><title>Primitive Types</title>
443
444             <section id="Byte"><title>Byte</title>
445
446                 <p>
447                     The most primitive type
448                     is an eight-bit byte. Files are accessed as sequences of bytes. All
449                     other data types are defined as sequences
450                     of bytes, so file formats are byte-order independent.
451                 </p>
452
453             </section>
454
455             <section id="UInt32"><title>UInt32</title>
456
457                 <p>
458                     32-bit unsigned integers are written as four
459                     bytes, high-order bytes first.
460                 </p>
461                 <p>
462                     UInt32    --&gt; &lt;Byte&gt;<sup>4</sup>
463                 </p>
464
465             </section>
466
467             <section id="Uint64"><title>Uint64</title>
468
469                 <p>
470                     64-bit unsigned integers are written as eight
471                     bytes, high-order bytes first.
472                 </p>
473
474                 <p>UInt64    --&gt; &lt;Byte&gt;<sup>8</sup>
475                 </p>
476
477             </section>
478
479             <section id="VInt"><title>VInt</title>
480
481                 <p>
482                     A variable-length format for positive integers is
483                     defined where the high-order bit of each byte indicates whether more
484                     bytes remain to be read. The low-order seven bits are appended as
485                     increasingly more significant bits in the resulting integer value.
486                     Thus values from zero to 127 may be stored in a single byte, values
487                     from 128 to 16,383 may be stored in two bytes, and so on.
488                 </p>
489
490                 <p>
491                     <b>VInt Encoding Example</b>
492                 </p>
493
494                 <table width="100%" border="0" cellpadding="4" cellspacing="0">
495                     <col width="64*"/>
496                     <col width="64*"/>
497                     <col width="64*"/>
498                     <col width="64*"/>
499                     <tr valign="TOP">
500                         <td width="25%">
501                             <p align="RIGHT">
502                                 <b>Value</b>
503                             </p>
504                         </td>
505                         <td width="25%">
506                             <p align="RIGHT">
507                                 <b>First byte</b>
508                             </p>
509                         </td>
510                         <td width="25%">
511                             <p align="RIGHT">
512                                 <b>Second byte</b>
513                             </p>
514                         </td>
515                         <td width="25%">
516                             <p align="RIGHT">
517                                 <b>Third byte</b>
518                             </p>
519                         </td>
520                     </tr>
521                     <tr valign="BOTTOM">
522                         <td width="25%" sdval="0" sdnum="1033;0;#,##0">
523                             <p align="RIGHT">0
524                             </p>
525                         </td>
526                         <td width="25%" sdval="0" sdnum="1033;0;00000000">
527                             <p class="western" align="RIGHT" style="margin-left: 0.11cm;
528                                margin-right: 0.01cm">
529                                 00000000
530                             </p>
531                         </td>
532                         <td width="25%" sdnum="1033;0;00000000">
533                             <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
534                                0.01cm">
535                                 <br/>
536
537                             </p>
538                         </td>
539                         <td width="25%" sdnum="1033;0;00000000">
540                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
541                                0.01cm">
542                                 <br/>
543
544                             </p>
545                         </td>
546                     </tr>
547                     <tr valign="BOTTOM">
548                         <td width="25%" sdval="1" sdnum="1033;0;#,##0">
549                             <p align="RIGHT">1
550                             </p>
551                         </td>
552                         <td width="25%" sdval="1" sdnum="1033;0;00000000">
553                             <p class="western" align="RIGHT" style="margin-left: 0.11cm;
554                                margin-right: 0.01cm">
555                                 00000001
556                             </p>
557                         </td>
558                         <td width="25%" sdnum="1033;0;00000000">
559                             <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
560                                0.01cm">
561                                 <br/>
562
563                             </p>
564                         </td>
565                         <td width="25%" sdnum="1033;0;00000000">
566                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
567                                0.01cm">
568                                 <br/>
569
570                             </p>
571                         </td>
572                     </tr>
573                     <tr valign="BOTTOM">
574                         <td width="25%" sdval="2" sdnum="1033;0;#,##0">
575                             <p align="RIGHT">2
576                             </p>
577                         </td>
578                         <td width="25%" sdval="10" sdnum="1033;0;00000000">
579                             <p class="western" align="RIGHT" style="margin-left: 0.11cm;
580                                margin-right: 0.01cm">
581                                 00000010
582                             </p>
583                         </td>
584                         <td width="25%" sdnum="1033;0;00000000">
585                             <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
586                                0.01cm">
587                                 <br/>
588
589                             </p>
590                         </td>
591                         <td width="25%" sdnum="1033;0;00000000">
592                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
593                                0.01cm">
594                                 <br/>
595
596                             </p>
597                         </td>
598                     </tr>
599                     <tr>
600                         <td width="25%" valign="TOP">
601                             <p align="RIGHT">...
602                             </p>
603                         </td>
604                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
605                             <p align="RIGHT" style="margin-left: 0.11cm; margin-right:
606                                0.01cm">
607                                 <br/>
608
609                             </p>
610                         </td>
611                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
612                             <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
613                                0.01cm">
614                                 <br/>
615
616                             </p>
617                         </td>
618                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
619                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
620                                0.01cm">
621                                 <br/>
622
623                             </p>
624                         </td>
625                     </tr>
626                     <tr valign="BOTTOM">
627                         <td width="25%" sdval="127" sdnum="1033;0;#,##0">
628                             <p align="RIGHT">127
629                             </p>
630                         </td>
631                         <td width="25%" sdval="1111111" sdnum="1033;0;00000000">
632                             <p class="western" align="RIGHT" style="margin-left: 0.11cm;
633                                margin-right: 0.01cm">
634                                 01111111
635                             </p>
636                         </td>
637                         <td width="25%" sdnum="1033;0;00000000">
638                             <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
639                                0.01cm">
640                                 <br/>
641
642                             </p>
643                         </td>
644                         <td width="25%" sdnum="1033;0;00000000">
645                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
646                                0.01cm">
647                                 <br/>
648
649                             </p>
650                         </td>
651                     </tr>
652                     <tr valign="BOTTOM">
653                         <td width="25%" sdval="128" sdnum="1033;0;#,##0">
654                             <p align="RIGHT">128
655                             </p>
656                         </td>
657                         <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
658                             <p class="western" align="RIGHT" style="margin-left: 0.11cm;
659                                margin-right: 0.01cm">
660                                 10000000
661                             </p>
662                         </td>
663                         <td width="25%" sdval="1" sdnum="1033;0;00000000">
664                             <p class="western" align="RIGHT" style="margin-left: -0.07cm;
665                                margin-right: 0.01cm">
666                                 00000001
667                             </p>
668                         </td>
669                         <td width="25%" sdnum="1033;0;00000000">
670                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
671                                0.01cm">
672                                 <br/>
673
674                             </p>
675                         </td>
676                     </tr>
677                     <tr valign="BOTTOM">
678                         <td width="25%" sdval="129" sdnum="1033;0;#,##0">
679                             <p align="RIGHT">129
680                             </p>
681                         </td>
682                         <td width="25%" sdval="10000001" sdnum="1033;0;00000000">
683                             <p class="western" align="RIGHT" style="margin-left: 0.11cm;
684                                margin-right: 0.01cm">
685                                 10000001
686                             </p>
687                         </td>
688                         <td width="25%" sdval="1" sdnum="1033;0;00000000">
689                             <p class="western" align="RIGHT" style="margin-left: -0.07cm;
690                                margin-right: 0.01cm">
691                                 00000001
692                             </p>
693                         </td>
694                         <td width="25%" sdnum="1033;0;00000000">
695                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
696                                0.01cm">
697                                 <br/>
698
699                             </p>
700                         </td>
701                     </tr>
702                     <tr valign="BOTTOM">
703                         <td width="25%" sdval="130" sdnum="1033;0;#,##0">
704                             <p align="RIGHT">130
705                             </p>
706                         </td>
707                         <td width="25%" sdval="10000010" sdnum="1033;0;00000000">
708                             <p class="western" align="RIGHT" style="margin-left: 0.11cm;
709                                margin-right: 0.01cm">
710                                 10000010
711                             </p>
712                         </td>
713                         <td width="25%" sdval="1" sdnum="1033;0;00000000">
714                             <p class="western" align="RIGHT" style="margin-left: -0.07cm;
715                                margin-right: 0.01cm">
716                                 00000001
717                             </p>
718                         </td>
719                         <td width="25%" sdnum="1033;0;00000000">
720                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
721                                0.01cm">
722                                 <br/>
723
724                             </p>
725                         </td>
726                     </tr>
727                     <tr>
728                         <td width="25%" valign="TOP">
729                             <p align="RIGHT">...
730                             </p>
731                         </td>
732                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
733                             <p align="RIGHT" style="margin-left: 0.11cm; margin-right:
734                                0.01cm">
735                                 <br/>
736
737                             </p>
738                         </td>
739                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
740                             <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
741                                0.01cm">
742                                 <br/>
743
744                             </p>
745                         </td>
746                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
747                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
748                                0.01cm">
749                                 <br/>
750
751                             </p>
752                         </td>
753                     </tr>
754                     <tr valign="BOTTOM">
755                         <td width="25%" sdval="16383" sdnum="1033;0;#,##0">
756                             <p align="RIGHT">16,383
757                             </p>
758                         </td>
759                         <td width="25%" sdval="11111111" sdnum="1033;0;00000000">
760                             <p class="western" align="RIGHT" style="margin-left: 0.11cm;
761                                margin-right: 0.01cm">
762                                 11111111
763                             </p>
764                         </td>
765                         <td width="25%" sdval="1111111" sdnum="1033;0;00000000">
766                             <p class="western" align="RIGHT" style="margin-left: -0.07cm;
767                                margin-right: 0.01cm">
768                                 01111111
769                             </p>
770                         </td>
771                         <td width="25%" sdnum="1033;0;00000000">
772                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
773                                0.01cm">
774                                 <br/>
775
776                             </p>
777                         </td>
778                     </tr>
779                     <tr valign="BOTTOM">
780                         <td width="25%" sdval="16384" sdnum="1033;0;#,##0">
781                             <p align="RIGHT">16,384
782                             </p>
783                         </td>
784                         <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
785                             <p class="western" align="RIGHT" style="margin-left: 0.11cm;
786                                margin-right: 0.01cm">
787                                 10000000
788                             </p>
789                         </td>
790                         <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
791                             <p class="western" align="RIGHT" style="margin-left: -0.07cm;
792                                margin-right: 0.01cm">
793                                 10000000
794                             </p>
795                         </td>
796                         <td width="25%" sdval="1" sdnum="1033;0;00000000">
797                             <p class="western" align="RIGHT" style="margin-left: -0.47cm;
798                                margin-right: 0.01cm">
799                                 00000001
800                             </p>
801                         </td>
802                     </tr>
803                     <tr valign="BOTTOM">
804                         <td width="25%" sdval="16385" sdnum="1033;0;#,##0">
805                             <p align="RIGHT">16,385
806                             </p>
807                         </td>
808                         <td width="25%" sdval="10000001" sdnum="1033;0;00000000">
809                             <p class="western" align="RIGHT" style="margin-left: 0.11cm;
810                                margin-right: 0.01cm">
811                                 10000001
812                             </p>
813                         </td>
814                         <td width="25%" sdval="10000000" sdnum="1033;0;00000000">
815                             <p class="western" align="RIGHT" style="margin-left: -0.07cm;
816                                margin-right: 0.01cm">
817                                 10000000
818                             </p>
819                         </td>
820                         <td width="25%" sdval="1" sdnum="1033;0;00000000">
821                             <p class="western" align="RIGHT" style="margin-left: -0.47cm;
822                                margin-right: 0.01cm">
823                                 00000001
824                             </p>
825                         </td>
826                     </tr>
827                     <tr>
828                         <td width="25%" valign="TOP">
829                             <p align="RIGHT">...
830                             </p>
831                         </td>
832                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
833                             <p class="western" align="RIGHT" style="margin-left: 0.11cm;
834                                margin-right: 0.01cm">
835                                 <br/>
836
837                             </p>
838                         </td>
839                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
840                             <p class="western" align="RIGHT" style="margin-left: -0.07cm;
841                                margin-right: 0.01cm">
842                                 <br/>
843
844                             </p>
845                         </td>
846                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
847                             <p class="western" align="RIGHT" style="margin-left: -0.47cm;
848                                margin-right: 0.01cm">
849                                 <br/>
850
851                             </p>
852                         </td>
853                     </tr>
854                 </table>
855
856                 <p>
857                     This provides compression while still being
858                     efficient to decode.
859                 </p>
860
861             </section>
862
863             <section id="Chars"><title>Chars</title>
864
865                 <p>
866                     Lucene writes unicode
867                     character sequences as UTF-8 encoded bytes.
868                 </p>
869
870
871             </section>
872
873             <section id="String"><title>String</title>
874
875                 <p>
876                     Lucene writes strings as UTF-8 encoded bytes.
877                     First the length, in bytes, is written as a VInt,
878                     followed by the bytes.
879                 </p>
880
881                 <p>
882                     String --&gt; VInt, Chars
883                 </p>
884
885             </section>
886         </section>
887
888         <section id="Compound Types"><title>Compound Types</title>
889             <section id="MapStringString"><title>Map&lt;String,String&gt;</title>
890
891                 <p>
892                     In a couple places Lucene stores a Map
893                     String-&gt;String.
894                 </p>
895
896                 <p>
897                     Map&lt;String,String&gt; --&gt; Count&lt;String,String&gt;<sup>Count</sup>
898                 </p>
899
900             </section>
901
902         </section>
903
904         <section id="Per-Index Files"><title>Per-Index Files</title>
905
906             <p>
907                 The files in this section exist one-per-index.
908             </p>
909
910             <section id="Segments File"><title>Segments File</title>
911
912                 <p>
913                     The active segments in the index are stored in the
914                     segment info file,
915                     <tt>segments_N</tt>.
916                     There may
917                     be one or more
918                     <tt>segments_N</tt>
919                     files in the
920                     index; however, the one with the largest
921                     generation is the active one (when older
922                     segments_N files are present it's because they
923                     temporarily cannot be deleted, or, a writer is in
924                     the process of committing, or a custom
925                     <a href="api/core/org/apache/lucene/index/IndexDeletionPolicy.html">IndexDeletionPolicy</a>
926                     is in use). This file lists each
927                     segment by name, has details about the separate
928                     norms and deletion files, and also contains the
929                     size of each segment.
930                 </p>
931
932                 <p>
933                     As of 2.1, there is also a file
934                     <tt>segments.gen</tt>.
935                     This file contains the
936                     current generation (the
937                     <tt>_N</tt>
938                     in
939                     <tt>segments_N</tt>)
940                     of the index. This is
941                     used only as a fallback in case the current
942                     generation cannot be accurately determined by
943                     directory listing alone (as is the case for some
944                     NFS clients with time-based directory cache
945                     expiraation). This file simply contains an Int32
946                     version header (SegmentInfos.FORMAT_LOCKLESS =
947                     -2), followed by the generation recorded as Int64,
948                     written twice.
949                 </p>
950                 <p>
951                     <b>3.1</b>
952                     Segments --&gt; Format, Version, NameCounter, SegCount, &lt;SegVersion, SegName, SegSize, DelGen, DocStoreOffset, [DocStoreSegment, DocStoreIsCompoundFile], HasSingleNormFile, NumField,
953                     NormGen<sup>NumField</sup>,
954                     IsCompoundFile, DeletionCount, HasProx, Diagnostics, HasVectors&gt;<sup>SegCount</sup>, CommitUserData, Checksum
955                 </p>
956
957                 <p>
958                     Format, NameCounter, SegCount, SegSize, NumField,
959                     DocStoreOffset, DeletionCount --&gt; Int32
960                 </p>
961
962                <p>
963                     Version, DelGen, NormGen, Checksum --&gt; Int64
964                 </p>
965
966                 <p>
967                    SegVersion, SegName, DocStoreSegment --&gt; String
968                 </p>
969
970                 <p>
971                    Diagnostics --&gt; Map&lt;String,String&gt;
972                 </p>
973
974                 <p>
975                     IsCompoundFile, HasSingleNormFile,
976                     DocStoreIsCompoundFile, HasProx, HasVectors --&gt; Int8
977                 </p>
978
979                 <p>
980                     CommitUserData --&gt; Map&lt;String,String&gt;
981                 </p>
982
983                 <p>
984                     Format is -9 (SegmentInfos.FORMAT_DIAGNOSTICS).
985                 </p>
986
987                 <p>
988                     Version counts how often the index has been
989                     changed by adding or deleting documents.
990                 </p>
991
992                 <p>
993                     NameCounter is used to generate names for new segment files.
994                 </p>
995
996                 <p>
997                     SegVersion is the code version that created the segment.
998                 </p>
999
1000                 <p>
1001                     SegName is the name of the segment, and is used as the file name prefix
1002                     for all of the files that compose the segment's index.
1003                 </p>
1004
1005                 <p>
1006                     SegSize is the number of documents contained in the segment index.
1007                 </p>
1008
1009                 <p>
1010                     DelGen is the generation count of the separate
1011                     deletes file. If this is -1, there are no
1012                     separate deletes. If it is 0, this is a pre-2.1
1013                     segment and you must check filesystem for the
1014                     existence of _X.del. Anything above zero means
1015                     there are separate deletes (_X_N.del).
1016                 </p>
1017
1018                 <p>
1019                     NumField is the size of the array for NormGen, or
1020                     -1 if there are no NormGens stored.
1021                 </p>
1022
1023                 <p>
1024                     NormGen records the generation of the separate
1025                     norms files. If NumField is -1, there are no
1026                     normGens stored and they are all assumed to be 0
1027                     when the segment file was written pre-2.1 and all
1028                     assumed to be -1 when the segments file is 2.1 or
1029                     above. The generation then has the same meaning
1030                     as delGen (above).
1031                 </p>
1032
1033                 <p>
1034                     IsCompoundFile records whether the segment is
1035                     written as a compound file or not. If this is -1,
1036                     the segment is not a compound file. If it is 1,
1037                     the segment is a compound file. Else it is 0,
1038                     which means we check filesystem to see if _X.cfs
1039                     exists.
1040                 </p>
1041
1042                 <p>
1043                     If HasSingleNormFile is 1, then the field norms are
1044                     written as a single joined file (with extension
1045                     <tt>.nrm</tt>); if it is 0 then each field's norms
1046                     are stored as separate <tt>.fN</tt> files.  See
1047                     "Normalization Factors" below for details.
1048                 </p>
1049
1050                 <p>
1051                     DocStoreOffset, DocStoreSegment,
1052                     DocStoreIsCompoundFile: If DocStoreOffset is -1,
1053                     this segment has its own doc store (stored fields
1054                     values and term vectors) files and DocStoreSegment
1055                     and DocStoreIsCompoundFile are not stored.  In
1056                     this case all files for stored field values
1057                     (<tt>*.fdt</tt> and <tt>*.fdx</tt>) and term
1058                     vectors (<tt>*.tvf</tt>, <tt>*.tvd</tt> and
1059                     <tt>*.tvx</tt>) will be stored with this segment.
1060                     Otherwise, DocStoreSegment is the name of the
1061                     segment that has the shared doc store files;
1062                     DocStoreIsCompoundFile is 1 if that segment is
1063                     stored in compound file format (as a <tt>.cfx</tt>
1064                     file); and DocStoreOffset is the starting document
1065                     in the shared doc store files where this segment's
1066                     documents begin.  In this case, this segment does
1067                     not store its own doc store files but instead
1068                     shares a single set of these files with other
1069                     segments.
1070                 </p>
1071
1072                 <p>
1073                     Checksum contains the CRC32 checksum of all bytes
1074                     in the segments_N file up until the checksum.
1075                     This is used to verify integrity of the file on
1076                     opening the index.
1077                 </p>
1078
1079                 <p>
1080                     DeletionCount records the number of deleted
1081                     documents in this segment.
1082                 </p>
1083
1084                 <p>
1085                     HasProx is 1 if any fields in this segment have
1086                     position data (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS); else, it's 0.
1087                 </p>
1088
1089                 <p>
1090                     CommitUserData stores an optional user-supplied
1091                     opaque Map&lt;String,String&gt; that was passed to
1092                     IndexWriter's commit or prepareCommit, or
1093                     IndexReader's flush methods.
1094                 </p>
1095                 <p>
1096                     The Diagnostics Map is privately written by
1097                     IndexWriter, as a debugging aid, for each segment
1098                     it creates.  It includes metadata like the current
1099                     Lucene version, OS, Java version, why the segment
1100                     was created (merge, flush, addIndexes), etc.
1101                 </p>
1102          
1103         <p> HasVectors is 1 if this segment stores term vectors,
1104             else it's 0.
1105                 </p>
1106
1107             </section>
1108
1109             <section id="Lock File"><title>Lock File</title>
1110
1111                 <p>
1112                     The write lock, which is stored in the index
1113                     directory by default, is named "write.lock".  If
1114                     the lock directory is different from the index
1115                     directory then the write lock will be named
1116                     "XXXX-write.lock" where XXXX is a unique prefix
1117                     derived from the full path to the index directory.
1118                     When this file is present, a writer is currently
1119                     modifying the index (adding or removing
1120                     documents).  This lock file ensures that only one
1121                     writer is modifying the index at a time.
1122                 </p>
1123             </section>
1124
1125             <section id="Deletable File"><title>Deletable File</title>
1126
1127                 <p>
1128                     A writer dynamically computes
1129                     the files that are deletable, instead, so no file
1130                     is written.
1131                 </p>
1132
1133             </section>
1134
1135             <section id="Compound Files"><title>Compound Files</title>
1136
1137                 <p>Starting with Lucene 1.4 the compound file format became default. This
1138                     is simply a container for all files described in the next section
1139                                         (except for the .del file).</p>
1140                                                                 <p>Compound Entry Table (.cfe) --&gt; Version,  FileCount, &lt;FileName, DataOffset, DataLength&gt;
1141                     <sup>FileCount</sup>
1142                 </p>
1143
1144                 <p>Compound (.cfs) --&gt; FileData <sup>FileCount</sup>
1145                 </p>
1146                 
1147                                                                 <p>Version --&gt; Int</p>
1148                                                                 
1149                 <p>FileCount --&gt; VInt</p>
1150
1151                 <p>DataOffset --&gt; Long</p>
1152                 
1153                 <p>DataLength --&gt; Long</p>
1154
1155                 <p>FileName --&gt; String</p>
1156
1157                 <p>FileData --&gt; raw file data</p>
1158                 <p>The raw file data is the data from the individual files named above.</p>
1159
1160                 <p>Starting with Lucene 2.3, doc store files (stored
1161                 field values and term vectors) can be shared in a
1162                 single set of files for more than one segment.  When
1163                 compound file is enabled, these shared files will be
1164                 added into a single compound file (same format as
1165                 above) but with the extension <tt>.cfx</tt>.
1166                 </p>
1167
1168             </section>
1169
1170         </section>
1171
1172         <section id="Per-Segment Files"><title>Per-Segment Files</title>
1173
1174             <p>
1175                 The remaining files are all per-segment, and are
1176                 thus defined by suffix.
1177             </p>
1178             <section id="Fields"><title>Fields</title>
1179                 <p>
1180                     <br/>
1181                     <b>Field Info</b>
1182                     <br/>
1183                 </p>
1184
1185                 <p>
1186                     Field names are
1187                     stored in the field info file, with suffix .fnm.
1188                 </p>
1189                 <p>
1190                     FieldInfos
1191                     (.fnm) --&gt; FNMVersion,FieldsCount, &lt;FieldName,
1192                     FieldBits&gt;
1193                     <sup>FieldsCount</sup>
1194                 </p>
1195
1196                 <p>
1197                     FNMVersion, FieldsCount --&gt; VInt
1198                 </p>
1199
1200                 <p>
1201                     FieldName --&gt; String
1202                 </p>
1203
1204                 <p>
1205                     FieldBits --&gt; Byte
1206                 </p>
1207
1208                 <p>
1209                     <ul>
1210                         <li>
1211                             The low-order bit is one for
1212                             indexed fields, and zero for non-indexed fields.
1213                         </li>
1214                         <li>
1215                             The second lowest-order
1216                             bit is one for fields that have term vectors stored, and zero for fields
1217                             without term vectors.
1218                         </li>
1219                         <li>If the third lowest-order bit is set (0x04), term positions are stored with the term vectors.</li>
1220                         <li>If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors.</li>
1221                         <li>If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field.</li>
1222                         <li>If the sixth lowest-order bit is set (0x20), payloads are stored for the indexed field.</li>
1223                         <li>If the seventh lowest-order bit is set (0x40), term frequencies and positions omitted for the indexed field.</li>
1224                         <li>If the eighth lowest-order bit is set (0x80), positions are omitted for the indexed field.</li>
1225                     </ul>
1226                 </p>
1227
1228                 <p>
1229                    FNMVersion (added in 2.9) is -2 for indexes from 2.9 - 3.3. It is -3 for indexes in Lucene 3.4+
1230                 </p>
1231
1232                 <p>
1233                     Fields are numbered by their order in this file. Thus field zero is
1234                     the
1235                     first field in the file, field one the next, and so on. Note that,
1236                     like document numbers, field numbers are segment relative.
1237                 </p>
1238
1239
1240
1241                 <p>
1242                     <br/>
1243                     <b>Stored Fields</b>
1244                     <br/>
1245                 </p>
1246
1247                 <p>
1248                     Stored fields are represented by two files:
1249                 </p>
1250
1251                 <ol>
1252                     <li><a name="field_index"/>
1253                         <p>
1254                             The field index, or .fdx file.
1255                         </p>
1256
1257                         <p>
1258                             This contains, for each document, a pointer to
1259                             its field data, as follows:
1260                         </p>
1261
1262                         <p>
1263                             FieldIndex
1264                             (.fdx) --&gt;
1265                             &lt;FieldValuesPosition&gt;
1266                             <sup>SegSize</sup>
1267                         </p>
1268                         <p>FieldValuesPosition
1269                             --&gt; Uint64
1270                         </p>
1271                         <p>This
1272                             is used to find the location within the field data file of the
1273                             fields of a particular document. Because it contains fixed-length
1274                             data, this file may be easily randomly accessed. The position of
1275                             document
1276                             <i>n</i>
1277                             's
1278                             <i></i>
1279                             field data is the Uint64 at
1280                             <i>n*8</i>
1281                             in
1282                             this file.
1283                         </p>
1284                     </li>
1285                     <li>
1286                         <p><a name="field_data"/>
1287                             The field data, or .fdt file.
1288
1289                         </p>
1290
1291                         <p>
1292                             This contains the stored fields of each document,
1293                             as follows:
1294                         </p>
1295
1296                         <p>
1297                             FieldData (.fdt) --&gt;
1298                             &lt;DocFieldData&gt;
1299                             <sup>SegSize</sup>
1300                         </p>
1301                         <p>DocFieldData --&gt;
1302                             FieldCount, &lt;FieldNum, Bits, Value&gt;
1303                             <sup>FieldCount</sup>
1304                         </p>
1305                         <p>FieldCount --&gt;
1306                             VInt
1307                         </p>
1308                         <p>FieldNum --&gt;
1309                             VInt
1310                         </p>
1311                         <p>Bits --&gt;
1312                             Byte
1313                         </p>
1314                         <p>
1315                             <ul>
1316                                 <li>low order bit is one for tokenized fields</li>
1317                                 <li>second bit is one for fields containing binary data</li>
1318                                 <li>third bit is one for fields with compression option enabled
1319                                     (if compression is enabled, the algorithm used is ZLIB),
1320                                     only available for indexes until Lucene version 2.9.x</li>
1321                                 <li>4th to 6th bit (mask: 0x7&lt;&lt;3) define the type of a
1322                                 numeric field: <ul>
1323                                   <li>all bits in mask are cleared if no numeric field at all</li>
1324                                   <li>1&lt;&lt;3: Value is Int</li>
1325                                   <li>2&lt;&lt;3: Value is Long</li>
1326                                   <li>3&lt;&lt;3: Value is Int as Float (as of Float.intBitsToFloat)</li>
1327                                   <li>4&lt;&lt;3: Value is Long as Double (as of Double.longBitsToDouble)</li>
1328                                 </ul></li>
1329                             </ul>
1330                         </p>
1331                         <p>Value --&gt;
1332                             String | BinaryValue | Int | Long (depending on Bits)
1333                         </p>
1334                         <p>BinaryValue --&gt;
1335                             ValueSize, &lt;Byte&gt;^ValueSize
1336                         </p>
1337                         <p>ValueSize --&gt;
1338                             VInt
1339                         </p>
1340
1341                     </li>
1342                 </ol>
1343
1344             </section>
1345             <section id="Term Dictionary"><title>Term Dictionary</title>
1346
1347                 <p>
1348                     The term dictionary is represented as two files:
1349                 </p>
1350                 <ol>
1351                     <li><a name="tis"/>
1352                         <p>
1353                             The term infos, or tis file.
1354                         </p>
1355
1356                         <p>
1357                             TermInfoFile (.tis)--&gt;
1358                             TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos
1359                         </p>
1360                         <p>TIVersion --&gt;
1361                             UInt32
1362                         </p>
1363                         <p>TermCount --&gt;
1364                             UInt64
1365                         </p>
1366                         <p>IndexInterval --&gt;
1367                             UInt32
1368                         </p>
1369                         <p>SkipInterval --&gt;
1370                             UInt32
1371                         </p>
1372                         <p>MaxSkipLevels --&gt;
1373                             UInt32
1374                         </p>
1375                         <p>TermInfos --&gt;
1376                             &lt;TermInfo&gt;
1377                             <sup>TermCount</sup>
1378                         </p>
1379                         <p>TermInfo --&gt;
1380                             &lt;Term, DocFreq, FreqDelta, ProxDelta, SkipDelta&gt;
1381                         </p>
1382                         <p>Term --&gt;
1383                             &lt;PrefixLength, Suffix, FieldNum&gt;
1384                         </p>
1385                         <p>Suffix --&gt;
1386                             String
1387                         </p>
1388                         <p>PrefixLength,
1389                             DocFreq, FreqDelta, ProxDelta, SkipDelta
1390                             <br/>
1391                             --&gt; VInt
1392                         </p>
1393                         <p>
1394                             This file is sorted by Term. Terms are
1395                             ordered first lexicographically (by UTF16
1396                             character code) by the term's field name,
1397                             and within that lexicographically (by
1398                             UTF16 character code) by the term's text.
1399                         </p>
1400                         <p>TIVersion names the version of the format
1401                             of this file and is equal to TermInfosWriter.FORMAT_CURRENT.
1402                         </p>
1403                         <p>Term
1404                             text prefixes are shared. The PrefixLength is the number of initial
1405                             characters from the previous term which must be pre-pended to a
1406                             term's suffix in order to form the term's text. Thus, if the
1407                             previous term's text was "bone" and the term is "boy",
1408                             the PrefixLength is two and the suffix is "y".
1409                         </p>
1410                         <p>FieldNumber
1411                             determines the term's field, whose name is stored in the .fdt file.
1412                         </p>
1413                         <p>DocFreq
1414                             is the count of documents which contain the term.
1415                         </p>
1416                         <p>FreqDelta
1417                             determines the position of this term's TermFreqs within the .frq
1418                             file. In particular, it is the difference between the position of
1419                             this term's data in that file and the position of the previous
1420                             term's data (or zero, for the first term in the file).
1421                         </p>
1422                         <p>ProxDelta
1423                             determines the position of this term's TermPositions within the .prx
1424                             file. In particular, it is the difference between the position of
1425                             this term's data in that file and the position of the previous
1426                             term's data (or zero, for the first term in the file.  For fields
1427                                         that omit position data, this will be 0 since
1428                             prox information is not stored.
1429                         </p>
1430                         <p>SkipDelta determines the position of this
1431                             term's SkipData within the .frq file. In
1432                             particular, it is the number of bytes
1433                             after TermFreqs that the SkipData starts.
1434                             In other words, it is the length of the
1435                             TermFreq data. SkipDelta is only stored 
1436                             if DocFreq is not smaller than SkipInterval.
1437                         </p>
1438                     </li>
1439                     <li>
1440                         <p><a name="tii"/>
1441                             The term info index, or .tii file.
1442                         </p>
1443
1444                         <p>
1445                             This contains every IndexInterval
1446                             <sup>th</sup>
1447                             entry from the .tis
1448                             file, along with its location in the &quot;tis&quot; file. This is
1449                             designed to be read entirely into memory and used to provide random
1450                             access to the &quot;tis&quot; file.
1451                         </p>
1452
1453                         <p>
1454                             The structure of this file is very similar to the
1455                             .tis file, with the addition of one item per record, the IndexDelta.
1456                         </p>
1457
1458                         <p>
1459                             TermInfoIndex (.tii)--&gt;
1460                             TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices
1461                         </p>
1462                         <p>TIVersion --&gt;
1463                             UInt32
1464                         </p>
1465                         <p>IndexTermCount --&gt;
1466                             UInt64
1467                         </p>
1468                         <p>IndexInterval --&gt;
1469                             UInt32
1470                         </p>
1471                         <p>SkipInterval --&gt;
1472                             UInt32
1473                         </p>
1474                         <p>TermIndices --&gt;
1475                             &lt;TermInfo, IndexDelta&gt;
1476                             <sup>IndexTermCount</sup>
1477                         </p>
1478                         <p>IndexDelta --&gt;
1479                             VLong
1480                         </p>
1481                         <p>IndexDelta
1482                             determines the position of this term's TermInfo within the .tis file. In
1483                             particular, it is the difference between the position of this term's
1484                             entry in that file and the position of the previous term's entry.
1485                         </p>
1486                         <p>SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int).
1487                             Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while
1488                             smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more
1489                             accelerable cases.</p>
1490                         <p>MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in 
1491                            smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration.
1492                            See format of .frq file for more information about skip levels.</p>
1493                     </li>
1494                 </ol>
1495             </section>
1496
1497             <section id="Frequencies"><title>Frequencies</title>
1498
1499                 <p>
1500                     The .frq file contains the lists of documents
1501                     which contain each term, along with the frequency of the term in that
1502                     document (except when frequencies are omitted: IndexOptions.DOCS_ONLY).
1503                 </p>
1504                 <p>FreqFile (.frq) --&gt;
1505                     &lt;TermFreqs, SkipData&gt;
1506                     <sup>TermCount</sup>
1507                 </p>
1508                 <p>TermFreqs --&gt;
1509                     &lt;TermFreq&gt;
1510                     <sup>DocFreq</sup>
1511                 </p>
1512                 <p>TermFreq --&gt;
1513                     DocDelta[, Freq?]
1514                 </p>
1515                 <p>SkipData --&gt;
1516                     &lt;&lt;SkipLevelLength, SkipLevel&gt;
1517                     <sup>NumSkipLevels-1</sup>, SkipLevel&gt;
1518                     &lt;SkipDatum&gt;
1519                 </p>
1520                 <p>SkipLevel --&gt;
1521                     &lt;SkipDatum&gt;
1522                     <sup>DocFreq/(SkipInterval^(Level + 1))</sup>
1523                 </p>
1524                 <p>SkipDatum --&gt;
1525                     DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer?
1526                 </p>
1527                 <p>DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --&gt;
1528                     VInt
1529                 </p>
1530                 <p>SkipChildLevelPointer --&gt;
1531                     VLong
1532                 </p>
1533                 <p>TermFreqs
1534                     are ordered by term (the term is implicit, from the .tis file).
1535                 </p>
1536                 <p>TermFreq
1537                     entries are ordered by increasing document number.
1538                 </p>
1539                 <p>DocDelta: if frequencies are indexed, this determines both
1540                     the document number and the frequency. In
1541                     particular, DocDelta/2 is the difference between
1542                     this document number and the previous document
1543                     number (or zero when this is the first document in
1544                     a TermFreqs). When DocDelta is odd, the frequency
1545                     is one. When DocDelta is even, the frequency is
1546                     read as another VInt.  If frequencies are omitted, DocDelta
1547                     contains the gap (not multiplied by 2) between
1548                     document numbers and no frequency information is
1549                     stored.
1550                 </p>
1551                 <p>For example, the TermFreqs for a term which occurs
1552                     once in document seven and three times in document
1553                     eleven, with frequencies indexed, would be the following
1554                     sequence of VInts:
1555                 </p>
1556                 <p>15, 8, 3
1557                 </p>
1558                 <p> If frequencies were omitted (IndexOptions.DOCS_ONLY) it would be this sequence
1559                 of VInts instead:
1560                   </p>
1561                  <p>
1562                    7,4
1563                  </p>
1564                 <p>DocSkip records the document number before every
1565                     SkipInterval
1566                     <sup>th</sup>
1567                     document in TermFreqs.
1568                     If payloads are disabled for the term's field,
1569                     then DocSkip represents the difference from the
1570                     previous value in the sequence.
1571                     If payloads are enabled for the term's field, 
1572                     then DocSkip/2 represents the difference from the
1573                     previous value in the sequence. If payloads are enabled
1574                     and DocSkip is odd,
1575                     then PayloadLength is stored indicating the length 
1576                     of the last payload before the SkipInterval<sup>th</sup>
1577                     document in TermPositions.
1578                                         FreqSkip and ProxSkip record the position of every
1579                     SkipInterval
1580                     <sup>th</sup>
1581                     entry in FreqFile and
1582                     ProxFile, respectively. File positions are
1583                     relative to the start of TermFreqs and Positions,
1584                     to the previous SkipDatum in the sequence.
1585                 </p>
1586                 <p>For example, if DocFreq=35 and SkipInterval=16,
1587                     then there are two SkipData entries, containing
1588                     the 15
1589                     <sup>th</sup>
1590                     and 31
1591                     <sup>st</sup>
1592                     document
1593                     numbers in TermFreqs. The first FreqSkip names
1594                     the number of bytes after the beginning of
1595                     TermFreqs that the 16
1596                     <sup>th</sup>
1597                     SkipDatum
1598                     starts, and the second the number of bytes after
1599                     that that the 32
1600                     <sup>nd</sup>
1601                     starts. The first
1602                     ProxSkip names the number of bytes after the
1603                     beginning of Positions that the 16
1604                     <sup>th</sup>
1605                     SkipDatum starts, and the second the number of
1606                     bytes after that that the 32
1607                     <sup>nd</sup>
1608                     starts.
1609                 </p>
1610                 <p>Each term can have multiple skip levels.
1611                    The amount of skip levels for a term is NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval)))).
1612                    The number of SkipData entries for a skip level is DocFreq/(SkipInterval^(Level + 1)), whereas the lowest skip
1613                    level is Level=0. <br></br>
1614                    Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries,
1615                    containing the 3<sup>rd</sup>, 7<sup>th</sup>, 11<sup>th</sup>, 15<sup>th</sup>, 19<sup>th</sup>, 23<sup>rd</sup>,
1616                    27<sup>th</sup>, and 31<sup>st</sup> document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the
1617                    15<sup>th</sup> and 31<sup>st</sup> document numbers in TermFreqs. <br></br>
1618                    The SkipData entries on all upper levels &gt; 0 contain a SkipChildLevelPointer referencing the corresponding SkipData
1619                    entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer
1620                    to entry 31 on level 0.                   
1621                 </p>
1622
1623             </section>
1624             <section id="Positions"><title>Positions</title>
1625
1626                 <p>
1627                     The .prx file contains the lists of positions that
1628                     each term occurs at within documents.  Note that
1629                     fields omitting positional data do not store
1630                     anything into this file, and if all fields in the
1631                     index omit positional data then the .prx file will not
1632                     exist.
1633                 </p>
1634                 <p>ProxFile (.prx) --&gt;
1635                     &lt;TermPositions&gt;
1636                     <sup>TermCount</sup>
1637                 </p>
1638                 <p>TermPositions --&gt;
1639                     &lt;Positions&gt;
1640                     <sup>DocFreq</sup>
1641                 </p>
1642                 <p>Positions --&gt;
1643                     &lt;PositionDelta,Payload?&gt;
1644                     <sup>Freq</sup>
1645                 </p>
1646                 <p>Payload --&gt;
1647                     &lt;PayloadLength?,PayloadData&gt;
1648                 </p>
1649                 <p>PositionDelta --&gt;
1650                     VInt
1651                 </p>
1652                 <p>PayloadLength --&gt;
1653                     VInt
1654                 </p>
1655                 <p>PayloadData --&gt;
1656                     byte<sup>PayloadLength</sup>
1657                 </p>
1658                 <p>TermPositions
1659                     are ordered by term (the term is implicit, from the .tis file).
1660                 </p>
1661                 <p>Positions
1662                     entries are ordered by increasing document number (the document
1663                     number is implicit from the .frq file).
1664                 </p>
1665                 <p>PositionDelta
1666                     is, if payloads are disabled for the term's field, the difference 
1667                     between the position of the current occurrence in
1668                     the document and the previous occurrence (or zero, if this is the
1669                     first occurrence in this document).
1670                     If payloads are enabled for the term's field, then PositionDelta/2
1671                     is the difference between the current and the previous position. If
1672                     payloads are enabled and PositionDelta is odd, then PayloadLength is 
1673                     stored, indicating the length of the payload at the current term position.
1674                 </p>
1675                 <p>
1676                     For example, the TermPositions for a
1677                     term which occurs as the fourth term in one document, and as the
1678                     fifth and ninth term in a subsequent document, would be the following
1679                     sequence of VInts (payloads disabled):
1680                 </p>
1681                 <p>4,
1682                     5, 4
1683                 </p>
1684                 <p>PayloadData
1685                     is metadata associated with the current term position. If PayloadLength
1686                     is stored at the current position, then it indicates the length of this 
1687                     Payload. If PayloadLength is not stored, then this Payload has the same
1688                     length as the Payload at the previous position.
1689                 </p>
1690             </section>
1691             <section id="Normalization Factors"><title>Normalization Factors</title>
1692
1693                                         <p>There's a single .nrm file containing all norms:
1694                 </p>
1695                 <p>AllNorms
1696                     (.nrm) --&gt; NormsHeader,&lt;Norms&gt;
1697                     <sup>NumFieldsWithNorms</sup>
1698                 </p>
1699                 <p>Norms
1700                     --&gt; &lt;Byte&gt;
1701                     <sup>SegSize</sup>
1702                 </p>
1703                 <p>NormsHeader
1704                     --&gt; 'N','R','M',Version
1705                 </p>
1706                 <p>Version
1707                     --&gt; Byte
1708                 </p>
1709                 <p>NormsHeader 
1710                                         has 4 bytes, last of which is the format version for this file, currently -1.
1711                 </p>
1712                 <p>Each
1713                     byte encodes a floating point value. Bits 0-2 contain the 3-bit
1714                     mantissa, and bits 3-8 contain the 5-bit exponent.
1715                 </p>
1716                 <p>These
1717                     are converted to an IEEE single float value as follows:
1718                 </p>
1719                 <ol>
1720                     <li>
1721                         <p>If
1722                             the byte is zero, use a zero float.
1723                         </p>
1724                     </li>
1725                     <li>
1726                         <p>Otherwise,
1727                             set the sign bit of the float to zero;
1728                         </p>
1729                     </li>
1730                     <li>
1731                         <p>add
1732                             48 to the exponent and use this as the float's exponent;
1733                         </p>
1734                     </li>
1735                     <li>
1736                         <p>map
1737                             the mantissa to the high-order 3 bits of the float's mantissa; and
1738
1739                         </p>
1740                     </li>
1741                     <li>
1742                         <p>set
1743                             the low-order 21 bits of the float's mantissa to zero.
1744                         </p>
1745                     </li>
1746                 </ol>
1747                 <p>A separate norm file is created when the norm values of an existing segment are modified. 
1748                                         When field <em>N</em> is modified, a separate norm file <em>.sN</em> 
1749                                         is created, to maintain the norm values for that field.
1750                 </p>
1751                                 <p>Separate norm files are created (when adequate) for both compound and non compound segments.
1752                 </p>
1753
1754             </section>
1755             <section id="Term Vectors"><title>Term Vectors</title>
1756                 <p>
1757                   Term Vector support is an optional on a field by
1758                   field basis. It consists of 3 files.
1759                 </p>
1760                 <ol>
1761                     <li><a name="tvx"/>
1762                         <p>The Document Index or .tvx file.</p>
1763                         <p>For each document, this stores the offset
1764                            into the document data (.tvd) and field
1765                            data (.tvf) files.
1766                         </p>
1767                         <p>DocumentIndex (.tvx) --&gt; TVXVersion&lt;DocumentPosition,FieldPosition&gt;
1768                             <sup>NumDocs</sup>
1769                         </p>
1770                         <p>TVXVersion --&gt; Int (TermVectorsReader.CURRENT)</p>
1771                         <p>DocumentPosition --&gt; UInt64 (offset in
1772                         the .tvd file)</p>
1773                         <p>FieldPosition --&gt; UInt64 (offset in the
1774                         .tvf file)</p>
1775                     </li>
1776                     <li><a name="tvd"/>
1777                         <p>The Document or .tvd file.</p>
1778                         <p>This contains, for each document, the number of fields, a list of the fields with
1779                             term vector info and finally a list of pointers to the field information in the .tvf
1780                             (Term Vector Fields) file.</p>
1781                         <p>
1782                             Document (.tvd) --&gt; TVDVersion&lt;NumFields, FieldNums, FieldPositions&gt;
1783                             <sup>NumDocs</sup>
1784                         </p>
1785                         <p>TVDVersion --&gt; Int (TermVectorsReader.FORMAT_CURRENT)</p>
1786                         <p>NumFields --&gt; VInt</p>
1787                         <p>FieldNums --&gt; &lt;FieldNumDelta&gt;
1788                             <sup>NumFields</sup>
1789                         </p>
1790                         <p>FieldNumDelta --&gt; VInt</p>
1791                         <p>FieldPositions --&gt; &lt;FieldPositionDelta&gt;
1792                             <sup>NumFields-1</sup>
1793                         </p>
1794                         <p>FieldPositionDelta --&gt; VLong</p>
1795                         <p>The .tvd file is used to map out the fields that have term vectors stored and
1796                             where the field information is in the .tvf file.</p>
1797                     </li>
1798                     <li><a name="tvf"/>
1799                         <p>The Field or .tvf file.</p>
1800                         <p>This file contains, for each field that has a term vector stored, a list of
1801                             the terms, their frequencies and, optionally, position and offest information.</p>
1802                         <p>Field (.tvf) --&gt; TVFVersion&lt;NumTerms, Position/Offset, TermFreqs&gt;
1803                             <sup>NumFields</sup>
1804                         </p>
1805                         <p>TVFVersion --&gt; Int (TermVectorsReader.FORMAT_CURRENT)</p>
1806                         <p>NumTerms --&gt; VInt</p>
1807                         <p>Position/Offset --&gt; Byte</p>
1808                         <p>TermFreqs --&gt; &lt;TermText, TermFreq, Positions?, Offsets?&gt;
1809                             <sup>NumTerms</sup>
1810                         </p>
1811                         <p>TermText --&gt; &lt;PrefixLength, Suffix&gt;</p>
1812                         <p>PrefixLength --&gt; VInt</p>
1813                         <p>Suffix --&gt; String</p>
1814                         <p>TermFreq --&gt; VInt</p>
1815                         <p>Positions --&gt; &lt;VInt&gt;<sup>TermFreq</sup></p>
1816                         <p>Offsets --&gt; &lt;VInt, VInt&gt;<sup>TermFreq</sup></p>
1817                         <br/>
1818                         <p>Notes:</p>
1819                         <ul>
1820                             <li>Position/Offset byte stores whether this term vector has position or offset information stored.</li>
1821                             <li>Term
1822                                 text prefixes are shared. The PrefixLength is the number of initial
1823                                 characters from the previous term which must be pre-pended to a
1824                                 term's suffix in order to form the term's text. Thus, if the
1825                                 previous term's text was "bone" and the term is "boy",
1826                                 the PrefixLength is two and the suffix is "y".
1827                             </li>
1828                             <li>Positions are stored as delta encoded VInts. This means we only store the difference of the current position from the last position</li>
1829                             <li>Offsets are stored as delta encoded VInts. The first VInt is the startOffset, the second is the endOffset.</li>
1830                         </ul>
1831
1832
1833                     </li>
1834                 </ol>
1835             </section>
1836
1837             <section id="Deleted Documents"><title>Deleted Documents</title>
1838
1839                 <p>The .del file is
1840                     optional, and only exists when a segment contains deletions.
1841                 </p>
1842
1843                 <p>Although per-segment, this file is maintained exterior to compound segment files.
1844                 </p>
1845                 <p>
1846                 Deletions
1847                     (.del) --&gt; [Format],ByteCount,BitCount, Bits | DGaps (depending on Format)
1848                 </p>
1849
1850                 <p>Format,ByteSize,BitCount --&gt;
1851                     Uint32
1852                 </p>
1853
1854                 <p>Bits --&gt;
1855                     &lt;Byte&gt;
1856                     <sup>ByteCount</sup>
1857                 </p>
1858
1859                                 <p>DGaps --&gt;
1860                     &lt;DGap,NonzeroByte&gt;
1861                     <sup>NonzeroBytesCount</sup>
1862                 </p>
1863
1864                 <p>DGap --&gt;
1865                     VInt
1866                 </p>
1867
1868                 <p>NonzeroByte --&gt;
1869                     Byte
1870                 </p>
1871                                 
1872                 <p>Format
1873                     is Optional. -1 indicates DGaps. Non-negative value indicates Bits, and that Format is excluded.
1874                 </p>
1875
1876                 <p>ByteCount
1877                     indicates the number of bytes in Bits. It is typically
1878                     (SegSize/8)+1.
1879                 </p>
1880
1881                 <p>
1882                     BitCount
1883                     indicates the number of bits that are currently set in Bits.
1884                 </p>
1885
1886                 <p>Bits
1887                     contains one bit for each document indexed. When the bit
1888                     corresponding to a document number is set, that document is marked as
1889                     deleted. Bit ordering is from least to most significant. Thus, if
1890                     Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as
1891                     deleted.
1892                 </p>
1893
1894                                 <p>DGaps
1895                     represents sparse bit-vectors more efficiently than Bits.
1896                     It is made of DGaps on indexes of nonzero bytes in Bits,
1897                     and the nonzero bytes themselves. The number of nonzero bytes
1898                     in Bits (NonzeroBytesCount) is not stored.
1899                 </p>
1900                 <p>For example,
1901                     if there are 8000 bits and only bits 10,12,32 are set,
1902                     DGaps would be used:
1903                 </p>
1904                 <p>
1905                     (VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1
1906                 </p>
1907             </section>
1908         </section>
1909
1910         <section id="Limitations"><title>Limitations</title>
1911
1912             <p>
1913               When referring to term numbers, Lucene's current
1914               implementation uses a Java <code>int</code> to hold the
1915               term index, which means the maximum number of unique
1916               terms in any single index segment is ~2.1 billion times
1917               the term index interval (default 128) = ~274 billion.
1918               This is technically not a limitation of the index file
1919               format, just of Lucene's current implementation.
1920             </p>
1921             <p>
1922               Similarly, Lucene uses a Java <code>int</code> to refer
1923               to document numbers, and the index file format uses an
1924               <code>Int32</code> on-disk to store document numbers.
1925               This is a limitation of both the index file format and
1926               the current implementation.  Eventually these should be
1927               replaced with either <code>UInt64</code> values, or
1928               better yet, <code>VInt</code> values which have no
1929               limit.
1930             </p>
1931
1932         </section>
1933
1934     </body>
1935
1936 </document>