Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web

Description:

Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou – PowerPoint PPT presentation

Number of Views:11
Avg rating:3.0/5.0
Slides: 28
Provided by: eurip3
Category:

less

Transcript and Presenter's Notes

Title: Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web


1
Semantic Similarity Methods in WordNet andTheir
Application to Information Retrieval onthe Web
  • Giannis Varelas
  • Epimenidis Voutsakis
  • Paraskevi Raftopoulou
  • Euripides G.M. Petrakis
  • Evangelos Milios

2
Semantic Similarity
  • Semantic Similarity relates to computing the
    conceptual similarity between terms which are not
    lexicographically similar
  • car automobile
  • Map two terms to an ontology and compute their
    relationship in that ontology

3
Objectives
  • We investigate several Semantic Similarity
    Methods and we evaluate their performance
  • http//www.ece.tuc.gr/similarity
  • We propose the Semantic Similarity Retrieval
    Model (SSRM) for computing similarity between
    documents containing semantically similar but not
    necessarily lexicographically similar terms
  • http//www.ece.tuc.gr/intellisearch

4
Ontologies
  • Tools of information representation on a subject
  • Hierarchical categorization of terms from general
    to most specific terms
  • object ? artifact ? construction ? stadium
  • Domain Ontologies representing knowledge of a
    domain
  • e.g., MeSH medical ontology
  • General Ontologies representing common sense
    knowledge about the world
  • e.g., WordNet

5
WordNet
  • A vocabulary and a thesaurus offering a
    hierarchical categorization of natural language
    terms
  • More than 100,000 terms
  • An ontology of natural language terms
  • Nouns, verbs, adjectives and adverbs are grouped
    into synonym sets (synsets)
  • Synsets represent terms or concepts
  • stadium, bowl, arena, sports stadium (a large
    structure for open-air sports or entertainments)

6
WordNet Hierarchies
  • The synsets are also organized into senses
  • Senses Different meanings of the same term
  • The synsets are related to other synsets higher
    or lower in the hierarchy by different types of
    relationships e.g.
  • Hyponym/Hypernym (Is-A relationships)
  • Meronym/Holonym (Part-Of relationships)
  • Nine noun and several verb Is-A hierarchies

7
A Fragment of the WordNet Is-A Hierarchy
8
(No Transcript)
9
Semantic Similarity Methods
  • Map terms to an ontology and compute their
    relationship in that ontology
  • Four main categories of methods
  • Edge counting path length between terms
  • Information content as a function of their
    probability of occurrence in corpus
  • Feature based similarity between their
    properties (e.g., definitions) or based on their
    relationships to other similar terms
  • Hybrid combine the above ideas

10
Example
  • Edge counting distance between conveyance and
    ceramic is 2
  • An information content method, would associate
    the two terms with their common subsumer and with
    their probabilities of occurrence in a corpus

11
Semantic Similarity on WordNet
  • The most popular methods are evaluated
  • All methods applied on a set of 38 term pairs
  • Their similarity values are correlated with
    scores obtained by humans
  • The higher the correlation of a method the better
    the method is

12
Evaluation
Method Type Correlation
Rada 1989 Edge Counting 0.59
Wu 1994 Edge Counting 0.74
Li 2003 Edge Counting 0.82
Leackok 1998 Edge Counting 0.82
Richardson 1994 Edge Counting 0.63
Resnik 1999 Info. Content 0.79
Lin 1993 Info. Content 0.82
Lord 2003 Info. Content 0.79
Jiang 1998 Info. Content 0.83
Tversky 1977 Feature Based 0.73
Rodriguez 2003 Hybrid 0.71
13
Observations
  • Edge counting/Info. Content methods work by
    exploiting structure information
  • Good methods take the position of the terms into
    account
  • Higher similarity for terms which are close
    together but lower in the hierarchy e.g., Li
    et.al. 2003
  • Information Content is measured on WordNet rather
    than on corpus Seco2002
  • Similarity only for nouns and verbs
  • No taxonomic structure for other p.o.s

14
http//www.ece.tuc.gr/similarity
15
Semantic Similarity Retrieval Model (SSRM)
  • Classic retrieval models retrieve documents with
    the same query terms
  • SSRM will retrieve documents which also contain
    semantically similar terms
  • Queries and documents are initially assigned
    tfxidf weights
  • q(q1,q2,qN) , d(d1,d2,dN)

16
SSRM
  • Query term re-weighting
  • similar terms reinforce each other
  • Query term expansion with synonyms and similar
    terms
  • Document similarity

17
Query Term Expansion
18
Observations
  • Specification of T ?
  • Large T may lead to topic drift
  • Word sense disambiguation for expanding with the
    correct sense
  • Expansion with co-concurring terms?
  • SVD, local/global analysis
  • Semantic similarity between terms of different
    parts of speech?
  • Work with compound terms (phrases)

19
Evaluation of SSRM
  • SSRM is evaluated through intellisearch a system
    for information retrieval on the WWW
  • 1,5 Million Web pages with images
  • Images are described by surrounding text
  • The problem of image retrieval is transformed
    into a problem of text retrieval

20
http//www.ece.tuc.gr/intellisearch
21
Methods
  • Vector Space Model (VSM)
  • SSRM
  • Each method is represented by a precision/recall
    plot
  • Each point is the average precision/recall over
    20 queries
  • 20 queries from the list of the most frequent
    Google image queries

22
Experimental Results
23
MeSH and MedLine
  • MeSH ontology for medical and biological terms
    by the N.L.M.
  • 22,000 terms
  • MedLine the premier bibliographic medical
    database of N.L.M.
  • 13 Million references

24
Evaluation on MedLine
25
Conclusions
  • Semantic similarity methods approximated the
    human notion of similarity reaching correlation
    up to 83
  • SSRM exploits this information for improving the
    performance of retrieval
  • SSRM can work with any semantic similarity method
    and any ontology

26
Future Work
  • Experimentation with more data sets (TREC) and
    ontologies
  • Extend SSRM to work with
  • Compound terms
  • More parts of speech (e.g., adverbs)
  • Co-occurring terms
  • More terms relationships in WordNet
  • More elaborate methods for specification of
    thresholds

27
Try our system on the Web
  • Semantic Similarity System http//www.ece.tuc.gr/
    similarity
  • SRRM http//www.ece.tuc.gr/intellisearch
Write a Comment
User Comments (0)
About PowerShow.com