Title: Index Terms
1Index Terms
2An Index
- Is a data structure allowing fast searching over
large volumes of text - Advantages of using index
- Easier and more efficient than pattern matching
- Disadvantages of using index
- Time and space which are amortized by querying
the retrieval system many times
3Index Terms
- Index terms are mainly nouns
- Not all terms in a document are equally useful
for describing its contents - Capturing the importance of a term for
summarizing the content of a document is critical - Term importance can be captured through
assignment of numerical weights to each index
term of a document
4Index Terms (II)
- Let t be the number of index terms in the system
and k_i be a generic index term. K k_1, ,
k_t is the set of all index terms. - A weight w_i,j gt 0 is associated with each index
term k_i of a document d_j - For an index term which does not appear in the
document text, w_i,j 0 - Occurrences of index terms in a document may not
be uncorrelated - Appearance of a term may attract the appearance
of the other
5Index Terms (III)
- Simplification
- Index term are usually assumed to be mutually
independent - Mutual independence simplifies the computation of
index term weights and allows fast ranking
computation - Experimental results do not show advantages of
using correlation in general collections
6Document Preprocessing
- Preprocessing of the documents can be viewed as a
process of controlling the size of the vocabulary - Expected that the use of a controlled vocabulary
leads to an improvement in retrieval performance - Sometimes it creates a negative effect
- Some unexpected documents retrieved and some
documents are missing
7Document Preprocessing (II)
- Can be divided into five text operations
- Lexical analysis
- Elimination of stopwords
- Stemming
- Selection of index terms
- Construction of term categorization structures
8Lexical Analysis of Text
- Identifies words in the text
- Cases
- Recognize space as word separators
- Numbers (1910, 510B.C., credit card number)
- Hyphens (state-of-the-art, glit-edge, B-49)
- Punctuation (.)
- Upper/lower cases
- No clear solution to this problem
- All these text operations can be implemented
easily, but careful thought should be given to
each one of them - It has profound impact at document retrieval time
9Elimination of Stopwords
- Stopwords
- Words which are too frequent among the documents
(say, 80) - Not good discriminators
- Articles, prepositions, conjunctions
- Some systems extend to include some verbs,
adverbs, and adjectives - e.g., a list of 425 stopwords is compiled
- Benefit
- Reduces the size of indexing structure
considerably - Drawback
- Might reduce recall (e.g., query like to be or
not to be)
10(No Transcript)
11Stemming
- A word may have a number of syntactic variants
- Plural, gerund forms, past tense form
- Possible that only a variant of the queried word
is present in a relevant document - This problem can be partially overcome with the
substitution of the words by their respective
stems - A stem
- The portion of word which is left after the
removal of its affixes (prefixes and suffixes) - Example
- Connected, connecting, connection, connections
12Stemming (II)
- Controversy on the benefit of stemming for
retrieval effectiveness - Many Web search engines do no adopt any stemming
algorithm whatsoever - Frakes distinguishes 4 types of stemming
strategies - Affix removal
- Intuitive, simple, can be implemented efficiently
- Table lookup
- Looking up the stem of a word in a table (large
and impractical) - Successor variety
- Determine morpheme boundaries (more complex than
affix removal) - n-grams
- Based on the identification of digrams and
trigrams
13Suffix Removal
- Most variants of a word are generated by the
introduction of suffixes -gt suffix removal - The most popular is algorithm is by Porter
- Simple, results comparable to more sophisticated
ones - Porter algorithm
- Applies a series of rules to the suffixes of the
words - Separating rules into five distinct phrases, the
Porter algorithm is able to provide effective
stemming while running fast
14Index Terms Selection
- Sometime we use a selected set of terms as
indices - In the area of bibliographic science, such a
selection is usually done by a specialist - Automatic approaches
- A good one is the identification of noun groups
(used in Inquery system)
15Index Terms Selection (II)
- A sentence is usually composed of nouns,
pronouns, articles, verbs, adjectives, adverbs,
and connectives - Argument
- Most of the semantics is carried by noun words
- An intuitively promising strategy for selecting
index terms is to use nouns in the text - Eliminate verbs, adjectives, adverbs,
connectives, articles, and pronouns
16Index Terms Selection (III)
- It is common to combine two or three nouns in a
single component (e.g., computer science) - It makes sense to cluster nouns which appear
nearby in the text into a single indexing
component (or concept) - A noun group
- A set of nouns whose semantic distance in the
text does not exceed a predefined threshold - Measured in terms of number of words between two
nouns, say 3 - When adopting noun groups as index terms, we
obtain a conceptual view of the documents in
terms of sets of non-elementary index terms
17Thesaurus
- Thesaurus refers to a treasury of words
- Its simplest form
- A precompiled list of important words in a given
domain of knowledge - For each word, a set of related words (mostly
synonyms) - Rogets thesaurus
18Thesaurus (II)
- Rogets thesaurus is generic in nature
- A thesaurus can be specific to a certain domain
of knowledge - Thesaurus of Engineering and Scientific Terms
- The main purposes of a thesaurus are
- To provide a standard vocabulary for indexing and
searching - To assist users with locating terms for proper
query formulation - To provide classified hierarchies that allow the
broadening and narrowing of the current query
request according to the needs of the user
19Thesaurus (III)
- The motivation for building a thesaurus is based
on the idea of using a controlled vocabulary for
indexing and searching - Advantages of controlled vocabulary
- Normalization of indexing concepts
- Reduction of noise
- Index terms with a clear semantic meaning
- Retrieval based on concepts rather than words
- These advantages are important in specific
domains, e.g., medical - For general domains, a well known body of
knowledge might not exist (like the Web) - Yahoo! presents to user with a term
classification hierarchy that can be used to
reduce the search space
20Thesaurus (IV)
- It is too early to reach a consensus on the
advantages of a thesaurus for the Web - Many search engines simply use all the words in
all the documents as index terms - The set of terms related to a given thesaurus
term is mostly composed of synonyms and
near-synonyms - Relationships include
- Broader, narrower, related terms
21Document Preprocessing
22Classical Retrieval Models
23IR Models
- A retrieval model specifies the details of
- Document representation
- Query representation
- Retrieval function (the notion of relevance)
- Three Classic IR models
- Boolean
- Vector
- Probabilistic
24Boolean Model
- Based on set theory and Boolean algebra
- Considers that index terms are present or absent
in a document - A document is represented as a set of keywords
- A query is a Boolean expression of keywords,
connected by operators such as AND, OR, and NOT - Cat and dog or ant
- Output a document is relevant or not relevant
(not partial match or ranking)
25Boolean Model (II)
Query goat AND (ink OR zebra)
Document 1
Ant bird cat. Dog elephant fish goat. Horse ink.
Ranking function
1
26Boolean Model (III)
- Popular in the past because
- Easy to understand (for simple queries)
- Neat formalism
- Drawbacks
- Exact match (very rigid) AND means all OR mean
any - Difficult to express complex queries
- Difficult to control the number of documents
retrieved - All matched documents will be returned
- Difficult to rank output
- All matched documents logically satisfy the query
- It is recognized that term weighting can lead to
a substantial improvement in retrieval
performance - Boolean model can be extended to include ranking
27Vector Model
- Proposed as a framework in which partial match is
possible - Assigning non-binary weights to index terms in
queries and in documents - These weights are used to compute the degree of
similarity between each document and the query - The main resultant effect is that the ranked
document answer set is recognized more relevant
than the document answer set retrieved by the
Boolean model
28Vector Model (II)
- A document dj and a user query q are represented
as t-dimensional vectors (where t is the number
of unique index terms) - dj (w1,j, w2,j, , wt,j), wi,j gt 0
- q (w1,q, w2,q, wt,q), wi,q gt 0
- where wij is a real-valued weight for term i in
a document j, and - wi,q is a real-valued weight for term i in a
query q
29Vector Model (III)
- Basis
- Given t distinct terms in the collection, each
called an index term (collectively called the
vocabulary) - Document representation
- A t-dimensional vector
- Query representation
- A t-dimensional vector
- Ranking function
- The correlation between and
ant
bird
8-Dimensional vector space
d1
d2
1 ant 2 bird 3 cat 4 dog 5 fish 6 ink 7 king 8
zebra
Ant bird cat. Dog fish.
Bird zebra. Ink king. Cat.
king
zebra
Query ant bird
Vocabulary
30Vector Model (IV)
- IR problem can be reduced to a clustering problem
- Given a query, we try to determine which
documents will be in the results set (a cluster)
and which ones will not (the other cluster) - Two measures are needed
- Intra-cluster similarity
- Inter-cluster dissimilarity
31Vector Model (V)
- In the vector model, intra-cluster similarity is
quantified by tf factor (term frequency) - The raw frequency of a term ki inside a document
dj - Idea measure how well that term describes the
document contents - Inter-cluster dissimilarity is quantified by idf
factor (inverse document frequency) - The inverse of the frequency of a term ki among
the documents in the collection - Idea terms appearing in many documents are not
very useful for distinguishing a relevant
document from a non-relevant one
32Vector Model (VI)
- The most effective term-weighting schemes try to
balance these two effects tf x idf - There are many variations of tf and idf factors,
some popular ones are -
-
-
- The best known term-weighing schemes is
-
33Vector Model (VII)
- For query term weights, Salton and Buckley
suggest -
34Vector Model (VIII)
- A similarity measure is a function that computes
the degree of similarity between two vectors - Using a similarity measure between the query and
each documents - It is possible to rank the retrieved documents
- It is possible to specify threshold for
similarity so that the size of the result set can
be controlled
35Vector Model (IX)
- Degree of similarity of dj and q can be computed
as the cosine of the angle between these two
vectors - Other measures are possible
36Vector Model (X)
37Vector Model (XI)
38Vector Model (XII)
- Advantages
- Simple, mathematical-based approach
- Ranks the documents according to their degree of
similarity to the query - Provides partial matching and ranked results
- Works well in practice
- Allows efficient implement for large document
collections
39Vector Model (XIII)
- Disadvantages
- Index terms are assumed to be mutually
independent - Missing syntactic information (e.g., phrase
structure, word order, proximity information) - Missing semantic information (e.g., word sense)
40Probabilistic Model
- Attempts to capture the IR problem within a
probabilistic framework - Concept
- Given a user query, there is the ideal answer set
- A set of documents which contains exactly the
relevant documents and no other - Searching
- A process of identifying/describing (specifying
the properties of) the ideal answer set - Problem
- We do not know exactly what those properties are
41Probabilistic Model (II)
- All we know is that there are index terms
- Index terms semantics should be used to
characterize these properties - Process
- These properties are not known at query time, we
should guess initially what they could be - Initial guess allows us to generate a preliminary
probabilistic description of the deal answer set
and retrieve a first set of documents - An interaction with the user is carried out to
improve the probabilistic description - User looks at the retrieved documents and decides
which one are relevant and which are not - The system uses this information to refine the
description - Repeating this process many times, such a
description will evolve and become closer to the
real description of the set
42Probabilistic Models (III)
- Model description
- Given a user query q and a document dj
- The model tries to estimate the probability that
the user will find the document dj relevant - Assumptions
- Probability of relevance depends on the query and
the document representations only - There is a subset R of all documents which the
user prefers as the ideal answer set for q
43Probabilistic Models (IV)
- Problems with the model description
- Does not state explicitly how to compute the
probabilities of relevance - An approach
- Assign to each document dj a measure of its
similarity to the query - It computes the odds of the documents dj being
relevant to q
44Probabilistic Models (V)
- Definition
- The index term weight variables are all binary
- Query q is a subset of index terms
- R set of guessed relevant documents
- set of the complement of R
- P(d_j is relevant to q)
- P(d_j is non-relevant to q)
45Probabilistic Models (VI)
are the same for all the documents in the
collection
Assuming independence of index terms
46Probabilistic Models (V)
We do not know the set R at the beginning we need
to initially compute
(assume it is constant for all index term k_i)
(assume distribution of index terms among
non-relevant documents can be approximated by
the distribution of index terms among all the
documents in the collection)
47Probabilistic Models (VI)
After retrieving the initial set of documents
according to the initial guess, the initial
ranking is improved as follows
- V subset of documents initially retrieved
- V_i subset of V composed of documents in V
which contain term k_i - Improve our guesses for
(approximated by the distribution of k_i among
documents retrieved so far)
(consider that all the non-relevant documents are
not relevant)
- Repeat the process recursively
- If we consider V to be the top r ranked documents
(r is previously defined), - we can improve the guesses without any
assistance from a human subject
Further improvements are discussed in the
textbook on page 33-34
48Probabilistic Models (VII)
- Advantages
- Documents are ranked in decreasing order of their
probability of being relevant - Disadvantages
- Need to guess the initial separation of documents
into relevant and non-relevant sets - The method does not take into account the
frequency with which a term occurs inside a
document - Adoption of the independence assumption for index
terms - As in the Vector model, it is not clear that
independence of terms is a bad assumption in
practice