Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal - PowerPoint PPT Presentation

About This Presentation

Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal


inverted index, compressing inverted index and computing score in complete search system chintan mistry mrugank dalal – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 39
Provided by: chin144
Learn more at:


Transcript and Presenter's Notes

Title: Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal

Inverted index,Compressing inverted
indexAndComputing score in complete search
system Chintan Mistry Mrugank
Indexing in Search Engine
Linguistic Preprocessing
Normalized terms
User query
Already built Inverted Index Lookup the
documents that contain the terms
Rank the returned documents according to their
Forward index
  • What is INVERTED INDEX? First look at the FORWARD
  • Documents Words
  • Querying the forward index would require
    sequential iteration through each document and to
    each word to verify a matching document
  • Too much time, memory and resources required!

Document 1 Hat, dog, the, cow, is, now
Document 2 Cow, run, away, morning, in, tree
Document 3 What, family, at, some, is, take
What is inverted index?
Posting List
One posting
Opposed to forward index, store the list of
documents per each word Directly access the set
of documents containing the word
How to build inverted index? (1/3)
  • Build index in advance
  • 1. Collect the documents
  • 2. Turning each document into a list of tokens
  • 3. Do linguistic preprocessing, producing list
    of normalized tokens, which are the indexing
  • 4. Index the documents (i.e. postings) for each
  • word (i.e. dictionary)

How to build inverted index? (2/3)
  • Given two documents
  • Document1 Document2

This is first document. Microsofts products are
office, visio, and sql server
This is second document. Googles services are
gmail, google labs and google code.
How to build inverted index? (3/3)
  • Sort based indexing
  • 1. Sort the terms alphabetically
  • 2. Instances of the same term are grouped by word
    and then documentID
  • 3. The terms and documentIDs are then separated
  • Reduces storage requirement
  • Dictionary commonly kept in memory while postings
    list kept on disk

Blocked sort based indexing
  • Use termID instead of term
  • Main memory is insufficient to collect
    termID-docID pair, we need external sorting
    algorithm that uses disk
  • Segment the collection into parts of equal size
  • Sorts and group the termID-docID pairs of each
    part in memory
  • Store the intermediate result onto disk
  • Merges all intermediate results into the final
  • Running Time O (T log T)

Single-pass in-memory indexing
  • SPIMI uses term instead of termID
  • Writes each blocks dictionary to disk, and then
    starts a new dictionary for the next block
  • Assume we have stream of term-docID pairs,
  • Tokens are processed one by one, when a term
    occurs for the first time, it is added to the
    dictionary, and a new posting list is created.

Difference between BSBI and SPIMI
Add postings directly to postings list It is faster then BSBI because there is no Sorting necessary It saves memory because No termID needs to be stored Time complexity O( T ) Collect term-docID pairs , sort them and then create postings list Slower then SPIMI Require to store termID , so need more space Time complexity O( T logT)
Distributed Indexing (1/4)
  • We can not perform index construction on single
    computer, web search engine uses distributed
    indexing algorithms for index construction
  • Partitioned the work across several machine
  • Use MapReduce architecture
  • A general architecture for distributed computing
  • Divide the work into chunks that can easily
    assign and reassign.
  • Map and Reduce phase

Distributed Indexing (2/4)
Distributed Indexing (3/4)
  • Mapping the splits of the input data to key-value
  • Each parser writes its output to local segment
  • These machines are called parsers
  • Partition the keys into j term partitions and
    having the parsers write key-value pair for each
    term partition into a separate file.
  • The parser write the corresponding segment files,
    one for each term partition.

Distributed Indexing (4/4)
  • REDUCE PHASE (cont.)
  • Collecting all values (docIDs) for a given key
    (termID) into one list is the task of inverter
  • The master assigns each term partition to a
    different inverter
  • Finally, the list of values is sorted for each
    key and written to the final sorted postings

Dynamic indexing
  • Motivation what we have seen so far was static
    collection of documents, what if the document is
    added, updated or deleted?
  • Maintain 2 indexes Main and Auxiliary
  • Auxiliary index is kept in memory, searches are
    run across both indexes, and results are merged
  • When auxiliary index becomes too large, merge it
    into the main index
  • Deleted document can be filtered out while
    returning the results

Querying distributed indexes (1/2)
  • Partition by terms
  • Partition the dictionary of index terms into
    subsets, along with a postings list of those term
  • Query is routed to the nodes, allows greater
  • Sending a long lists of postings between set of
    nodes for merging cost is very high and it
    outweighs the greater concurrency
  • Partition by documents
  • Each node contains the index for a subset of all
  • Query is distributed to all nodes, then results
    are merged

Querying distributed indexes (2/2)
  • Partition by documents (cont.)
  • Problem idf must be calculated for an entire
    collection even though the index at single node
    contains only subset of documents
  • The query is broadcasted to each of the nodes,
    with top k results from each node being merged to
    find top k documents of the query.

Index compression (1/8)
  • Compression techniques for dictionary and posting
  • Advantages
  • Less disk space
  • Use of caching frequently used terms can be
    cached in memory for faster processing, and
    compression techniques allows more terms to be
    stored in memory
  • Faster data transfer from disk to memory total
    time of transferring a compressed data from disk
    and decompress it is less than transferring
    uncompressed data

Index compression (2/8)
  • Dictionary compression
  • Its small compared to posting lists, so why to
  • Because when large part (think of a millions of
    terms in it!) of dictionary is on disk, then many
    more disk seeks are necessary
  • Goal is to fit this dictionary into memory for
    high response time

Index compression (3/8)
  • 1. Dictionary as an array
  • Can be stored in an array of fixed width entries
  • For ex. We have 4,00,000 terms in dictionary
  • 4,00,000 (2044) 11.2 MB

Index compression (4/8)
  • Any problem in storing dictionary as an array?
  • 1. Average length of term in English language is
    about eight chars, so we are wasting 12 chars
  • 2. No way of storing terms of more than 20 chars
    like hydrochlorofluorocarbons
  • 2. Dictionary as a string
  • Store it as a one long string of characters
  • Pointer marks the end of the preceding term and
    the beginning of the next

Index compression (5/8)
  • 2. Dictionary as a string (cont.)
  • 4,00,000 (4438) 7.6 MB (compared to 11.2
    MB earlier)

Index compression (6/8)
  • 3. Blocked storage
  • Group the terms in the string into blocks of size
    k and keeping a term pointer only for the first
    term of each block.

k4 We save, (k-1)3 9 bytes for term
pointer But, Need additional 4 bytes for term
  • 4,00,000 (1/4) 5 7.1 MB (compared to 7.6

Index compression (7/8)
  • 4. Blocked storage with front coding
  • Common prefixes
  • According to experience conducted by author
    Size reduced to 5.9 MB (compared to 7.1 MB)

Index compression (8/8)
  • Posting file compression
  • By Encoding Gaps gaps between postings are
  • so we can store gaps rather than storing the
    posting itself

Review Scoring , term weighting
  • Meta data- information about document
  • Metadata generally consist of fields
  • E.g. date of creation , authors , title etc.
  • Zone - similar to fields
  • Difference zone is arbitrary free text
  • E.g. Abstract , overview

Review Scoring , term weighting
  • Term Frequency(tf) of occurrence of term in
  • Problem size of documents gt inappropriate
  • Document frequency(dft) of documents in
    collection which contain term from query.
  • Inverse Document Frequency(idft)
  • idft log( N / dft) N total of
  • Significance of idf
  • If low ? its a common term (e.g. stop word )
  • If high ? rare word ( e.g. apothecary )

Review Scoring , term weighting
  • Tf-idf weighting
  • tf-idft,d tft,d idft .
  • High when term occurs many time in small of
  • Low when it occurs fewer time in docs or
  • it occurs in many docs
  • Lowest when term is in almost all documents.
  • Score of document
  • Score(q,d) ? (tq)tf-idft,d

Computing score in complete search system
Inexact top K document retrieval
  • Motivation to reduce the cost of calculating
    score for all N documents
  • We calculate score ONLY for top K documents
    whose scores are likely to be high w.r.t given
  • How
  • Find set A of documents who are contenders
  • where K lt A ltlt N.
  • Return the K top scoring docs from A

Index Elimination
  • Idf preset threshold
  • Only traverse postings for terms with high idf
  • Benefit low idf postings are long so we remove
    them from
  • counting score.
  • Include all terms
  • Only traverse documents with many query terms in
  • Danger we may end up with less than K docs at

Champion lists
  • Champion list fancy list top docs
  • Set of r documents for each term t in dictionary
    which are pre-computed
  • The weights for t are high
  • How to create set A
  • Take a union of champion list for each term in
  • Compute score only for docs which are in union
  • How and when to decide r
  • Highly application dependent
  • Create list at the time of indexing documents
  • Problem ????????

Static quality scores and ordering
  • In many search engine we have
  • Measure of quality g(d) for each documents
  • The net score is calculated
  • Combination of g(d) and tf-idf score.
  • How to achieve this
  • Document posting list is in decreasing order for
  • So we just traversed first few documents in list
  • Global champion list
  • Chose r documents with highest value of

Cluster pruning (1/2)
  • We cluster document in preprocessing step
  • Pick vN documents call them leaders
  • For each document who is not leader we compute
    nearest leader
  • Followers docs which are not leaders
  • Each leader has approximately vN followers

Cluster pruning (2/2)
  • How does it help
  • Given a query q find leader L nearest to q
  • i.e calculating score for only root N docs
  • Set A contains leader L with root N followers
  • i.e calculating score for only root N docs

Tiered indexes
Doc 2
Tier 1
Doc 1
Doc 2
Doc 3
Doc 4
Preset threshold value set to 20
Doc 1
Tier 2
Doc 4
Preset threshold value set to 10
Addressing an issue of getting set A of
contenders less than K documents
A complete search system
Parsing Linguistics
Result Page
User Query
Free text query parser
Spell correction
Scoring and Ranking
Documents cache
Metadata in zone and field indexes Inexact top K retrieval Tiered inverted positional index K - gram
Indexes Indexes Indexes Indexes
Training set
Scoring parameters MLR
  • Questions
  • ?
  • Thank you
Write a Comment
User Comments (0)