Index Terms - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Index Terms

Description:

... engines do no adopt any stemming algorithm whatsoever ... When adopting noun groups as index terms, we obtain a conceptual view of the ... Cat and [dog or ant] ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 49
Provided by: ohmso
Category:
Tags: adopt | dog | index | terms

less

Transcript and Presenter's Notes

Title: Index Terms


1
Index Terms
2
An Index
  • Is a data structure allowing fast searching over
    large volumes of text
  • Advantages of using index
  • Easier and more efficient than pattern matching
  • Disadvantages of using index
  • Time and space which are amortized by querying
    the retrieval system many times

3
Index Terms
  • Index terms are mainly nouns
  • Not all terms in a document are equally useful
    for describing its contents
  • Capturing the importance of a term for
    summarizing the content of a document is critical
  • Term importance can be captured through
    assignment of numerical weights to each index
    term of a document

4
Index Terms (II)
  • Let t be the number of index terms in the system
    and k_i be a generic index term. K k_1, ,
    k_t is the set of all index terms.
  • A weight w_i,j gt 0 is associated with each index
    term k_i of a document d_j
  • For an index term which does not appear in the
    document text, w_i,j 0
  • Occurrences of index terms in a document may not
    be uncorrelated
  • Appearance of a term may attract the appearance
    of the other

5
Index Terms (III)
  • Simplification
  • Index term are usually assumed to be mutually
    independent
  • Mutual independence simplifies the computation of
    index term weights and allows fast ranking
    computation
  • Experimental results do not show advantages of
    using correlation in general collections

6
Document Preprocessing
  • Preprocessing of the documents can be viewed as a
    process of controlling the size of the vocabulary
  • Expected that the use of a controlled vocabulary
    leads to an improvement in retrieval performance
  • Sometimes it creates a negative effect
  • Some unexpected documents retrieved and some
    documents are missing

7
Document Preprocessing (II)
  • Can be divided into five text operations
  • Lexical analysis
  • Elimination of stopwords
  • Stemming
  • Selection of index terms
  • Construction of term categorization structures

8
Lexical Analysis of Text
  • Identifies words in the text
  • Cases
  • Recognize space as word separators
  • Numbers (1910, 510B.C., credit card number)
  • Hyphens (state-of-the-art, glit-edge, B-49)
  • Punctuation (.)
  • Upper/lower cases
  • No clear solution to this problem
  • All these text operations can be implemented
    easily, but careful thought should be given to
    each one of them
  • It has profound impact at document retrieval time

9
Elimination of Stopwords
  • Stopwords
  • Words which are too frequent among the documents
    (say, 80)
  • Not good discriminators
  • Articles, prepositions, conjunctions
  • Some systems extend to include some verbs,
    adverbs, and adjectives
  • e.g., a list of 425 stopwords is compiled
  • Benefit
  • Reduces the size of indexing structure
    considerably
  • Drawback
  • Might reduce recall (e.g., query like to be or
    not to be)

10
(No Transcript)
11
Stemming
  • A word may have a number of syntactic variants
  • Plural, gerund forms, past tense form
  • Possible that only a variant of the queried word
    is present in a relevant document
  • This problem can be partially overcome with the
    substitution of the words by their respective
    stems
  • A stem
  • The portion of word which is left after the
    removal of its affixes (prefixes and suffixes)
  • Example
  • Connected, connecting, connection, connections

12
Stemming (II)
  • Controversy on the benefit of stemming for
    retrieval effectiveness
  • Many Web search engines do no adopt any stemming
    algorithm whatsoever
  • Frakes distinguishes 4 types of stemming
    strategies
  • Affix removal
  • Intuitive, simple, can be implemented efficiently
  • Table lookup
  • Looking up the stem of a word in a table (large
    and impractical)
  • Successor variety
  • Determine morpheme boundaries (more complex than
    affix removal)
  • n-grams
  • Based on the identification of digrams and
    trigrams

13
Suffix Removal
  • Most variants of a word are generated by the
    introduction of suffixes -gt suffix removal
  • The most popular is algorithm is by Porter
  • Simple, results comparable to more sophisticated
    ones
  • Porter algorithm
  • Applies a series of rules to the suffixes of the
    words
  • Separating rules into five distinct phrases, the
    Porter algorithm is able to provide effective
    stemming while running fast

14
Index Terms Selection
  • Sometime we use a selected set of terms as
    indices
  • In the area of bibliographic science, such a
    selection is usually done by a specialist
  • Automatic approaches
  • A good one is the identification of noun groups
    (used in Inquery system)

15
Index Terms Selection (II)
  • A sentence is usually composed of nouns,
    pronouns, articles, verbs, adjectives, adverbs,
    and connectives
  • Argument
  • Most of the semantics is carried by noun words
  • An intuitively promising strategy for selecting
    index terms is to use nouns in the text
  • Eliminate verbs, adjectives, adverbs,
    connectives, articles, and pronouns

16
Index Terms Selection (III)
  • It is common to combine two or three nouns in a
    single component (e.g., computer science)
  • It makes sense to cluster nouns which appear
    nearby in the text into a single indexing
    component (or concept)
  • A noun group
  • A set of nouns whose semantic distance in the
    text does not exceed a predefined threshold
  • Measured in terms of number of words between two
    nouns, say 3
  • When adopting noun groups as index terms, we
    obtain a conceptual view of the documents in
    terms of sets of non-elementary index terms

17
Thesaurus
  • Thesaurus refers to a treasury of words
  • Its simplest form
  • A precompiled list of important words in a given
    domain of knowledge
  • For each word, a set of related words (mostly
    synonyms)
  • Rogets thesaurus

18
Thesaurus (II)
  • Rogets thesaurus is generic in nature
  • A thesaurus can be specific to a certain domain
    of knowledge
  • Thesaurus of Engineering and Scientific Terms
  • The main purposes of a thesaurus are
  • To provide a standard vocabulary for indexing and
    searching
  • To assist users with locating terms for proper
    query formulation
  • To provide classified hierarchies that allow the
    broadening and narrowing of the current query
    request according to the needs of the user

19
Thesaurus (III)
  • The motivation for building a thesaurus is based
    on the idea of using a controlled vocabulary for
    indexing and searching
  • Advantages of controlled vocabulary
  • Normalization of indexing concepts
  • Reduction of noise
  • Index terms with a clear semantic meaning
  • Retrieval based on concepts rather than words
  • These advantages are important in specific
    domains, e.g., medical
  • For general domains, a well known body of
    knowledge might not exist (like the Web)
  • Yahoo! presents to user with a term
    classification hierarchy that can be used to
    reduce the search space

20
Thesaurus (IV)
  • It is too early to reach a consensus on the
    advantages of a thesaurus for the Web
  • Many search engines simply use all the words in
    all the documents as index terms
  • The set of terms related to a given thesaurus
    term is mostly composed of synonyms and
    near-synonyms
  • Relationships include
  • Broader, narrower, related terms

21
Document Preprocessing
22
Classical Retrieval Models
23
IR Models
  • A retrieval model specifies the details of
  • Document representation
  • Query representation
  • Retrieval function (the notion of relevance)
  • Three Classic IR models
  • Boolean
  • Vector
  • Probabilistic

24
Boolean Model
  • Based on set theory and Boolean algebra
  • Considers that index terms are present or absent
    in a document
  • A document is represented as a set of keywords
  • A query is a Boolean expression of keywords,
    connected by operators such as AND, OR, and NOT
  • Cat and dog or ant
  • Output a document is relevant or not relevant
    (not partial match or ranking)

25
Boolean Model (II)
Query goat AND (ink OR zebra)
Document 1
Ant bird cat. Dog elephant fish goat. Horse ink.
Ranking function
1
26
Boolean Model (III)
  • Popular in the past because
  • Easy to understand (for simple queries)
  • Neat formalism
  • Drawbacks
  • Exact match (very rigid) AND means all OR mean
    any
  • Difficult to express complex queries
  • Difficult to control the number of documents
    retrieved
  • All matched documents will be returned
  • Difficult to rank output
  • All matched documents logically satisfy the query
  • It is recognized that term weighting can lead to
    a substantial improvement in retrieval
    performance
  • Boolean model can be extended to include ranking

27
Vector Model
  • Proposed as a framework in which partial match is
    possible
  • Assigning non-binary weights to index terms in
    queries and in documents
  • These weights are used to compute the degree of
    similarity between each document and the query
  • The main resultant effect is that the ranked
    document answer set is recognized more relevant
    than the document answer set retrieved by the
    Boolean model

28
Vector Model (II)
  • A document dj and a user query q are represented
    as t-dimensional vectors (where t is the number
    of unique index terms)
  • dj (w1,j, w2,j, , wt,j), wi,j gt 0
  • q (w1,q, w2,q, wt,q), wi,q gt 0
  • where wij is a real-valued weight for term i in
    a document j, and
  • wi,q is a real-valued weight for term i in a
    query q

29
Vector Model (III)
  • Basis
  • Given t distinct terms in the collection, each
    called an index term (collectively called the
    vocabulary)
  • Document representation
  • A t-dimensional vector
  • Query representation
  • A t-dimensional vector
  • Ranking function
  • The correlation between and

ant
bird
8-Dimensional vector space
d1
d2
1 ant 2 bird 3 cat 4 dog 5 fish 6 ink 7 king 8
zebra
Ant bird cat. Dog fish.
Bird zebra. Ink king. Cat.

king
zebra
Query ant bird
Vocabulary
30
Vector Model (IV)
  • IR problem can be reduced to a clustering problem
  • Given a query, we try to determine which
    documents will be in the results set (a cluster)
    and which ones will not (the other cluster)
  • Two measures are needed
  • Intra-cluster similarity
  • Inter-cluster dissimilarity

31
Vector Model (V)
  • In the vector model, intra-cluster similarity is
    quantified by tf factor (term frequency)
  • The raw frequency of a term ki inside a document
    dj
  • Idea measure how well that term describes the
    document contents
  • Inter-cluster dissimilarity is quantified by idf
    factor (inverse document frequency)
  • The inverse of the frequency of a term ki among
    the documents in the collection
  • Idea terms appearing in many documents are not
    very useful for distinguishing a relevant
    document from a non-relevant one

32
Vector Model (VI)
  • The most effective term-weighting schemes try to
    balance these two effects tf x idf
  • There are many variations of tf and idf factors,
    some popular ones are
  • The best known term-weighing schemes is

33
Vector Model (VII)
  • For query term weights, Salton and Buckley
    suggest

34
Vector Model (VIII)
  • A similarity measure is a function that computes
    the degree of similarity between two vectors
  • Using a similarity measure between the query and
    each documents
  • It is possible to rank the retrieved documents
  • It is possible to specify threshold for
    similarity so that the size of the result set can
    be controlled

35
Vector Model (IX)
  • Degree of similarity of dj and q can be computed
    as the cosine of the angle between these two
    vectors
  • Other measures are possible

36
Vector Model (X)
37
Vector Model (XI)
38
Vector Model (XII)
  • Advantages
  • Simple, mathematical-based approach
  • Ranks the documents according to their degree of
    similarity to the query
  • Provides partial matching and ranked results
  • Works well in practice
  • Allows efficient implement for large document
    collections

39
Vector Model (XIII)
  • Disadvantages
  • Index terms are assumed to be mutually
    independent
  • Missing syntactic information (e.g., phrase
    structure, word order, proximity information)
  • Missing semantic information (e.g., word sense)

40
Probabilistic Model
  • Attempts to capture the IR problem within a
    probabilistic framework
  • Concept
  • Given a user query, there is the ideal answer set
  • A set of documents which contains exactly the
    relevant documents and no other
  • Searching
  • A process of identifying/describing (specifying
    the properties of) the ideal answer set
  • Problem
  • We do not know exactly what those properties are

41
Probabilistic Model (II)
  • All we know is that there are index terms
  • Index terms semantics should be used to
    characterize these properties
  • Process
  • These properties are not known at query time, we
    should guess initially what they could be
  • Initial guess allows us to generate a preliminary
    probabilistic description of the deal answer set
    and retrieve a first set of documents
  • An interaction with the user is carried out to
    improve the probabilistic description
  • User looks at the retrieved documents and decides
    which one are relevant and which are not
  • The system uses this information to refine the
    description
  • Repeating this process many times, such a
    description will evolve and become closer to the
    real description of the set

42
Probabilistic Models (III)
  • Model description
  • Given a user query q and a document dj
  • The model tries to estimate the probability that
    the user will find the document dj relevant
  • Assumptions
  • Probability of relevance depends on the query and
    the document representations only
  • There is a subset R of all documents which the
    user prefers as the ideal answer set for q

43
Probabilistic Models (IV)
  • Problems with the model description
  • Does not state explicitly how to compute the
    probabilities of relevance
  • An approach
  • Assign to each document dj a measure of its
    similarity to the query
  • It computes the odds of the documents dj being
    relevant to q

44
Probabilistic Models (V)
  • Definition
  • The index term weight variables are all binary
  • Query q is a subset of index terms
  • R set of guessed relevant documents
  • set of the complement of R
  • P(d_j is relevant to q)
  • P(d_j is non-relevant to q)

45
Probabilistic Models (VI)
are the same for all the documents in the
collection
Assuming independence of index terms
46
Probabilistic Models (V)
We do not know the set R at the beginning we need
to initially compute
(assume it is constant for all index term k_i)
(assume distribution of index terms among
non-relevant documents can be approximated by
the distribution of index terms among all the
documents in the collection)
47
Probabilistic Models (VI)
After retrieving the initial set of documents
according to the initial guess, the initial
ranking is improved as follows
  • V subset of documents initially retrieved
  • V_i subset of V composed of documents in V
    which contain term k_i
  • Improve our guesses for

(approximated by the distribution of k_i among
documents retrieved so far)
(consider that all the non-relevant documents are
not relevant)
  • Repeat the process recursively
  • If we consider V to be the top r ranked documents
    (r is previously defined),
  • we can improve the guesses without any
    assistance from a human subject

Further improvements are discussed in the
textbook on page 33-34
48
Probabilistic Models (VII)
  • Advantages
  • Documents are ranked in decreasing order of their
    probability of being relevant
  • Disadvantages
  • Need to guess the initial separation of documents
    into relevant and non-relevant sets
  • The method does not take into account the
    frequency with which a term occurs inside a
    document
  • Adoption of the independence assumption for index
    terms
  • As in the Vector model, it is not clear that
    independence of terms is a bad assumption in
    practice
Write a Comment
User Comments (0)
About PowerShow.com