Title: Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web
1Semantic Similarity Methods in WordNet andTheir
Application to Information Retrieval onthe Web
- Giannis Varelas
- Epimenidis Voutsakis
- Paraskevi Raftopoulou
- Euripides G.M. Petrakis
- Evangelos Milios
2Semantic Similarity
- Semantic Similarity relates to computing the
conceptual similarity between terms which are not
lexicographically similar - car automobile
- Map two terms to an ontology and compute their
relationship in that ontology
3Objectives
- We investigate several Semantic Similarity
Methods and we evaluate their performance - http//www.ece.tuc.gr/similarity
- We propose the Semantic Similarity Retrieval
Model (SSRM) for computing similarity between
documents containing semantically similar but not
necessarily lexicographically similar terms - http//www.ece.tuc.gr/intellisearch
4Ontologies
- Tools of information representation on a subject
- Hierarchical categorization of terms from general
to most specific terms - object ? artifact ? construction ? stadium
- Domain Ontologies representing knowledge of a
domain - e.g., MeSH medical ontology
- General Ontologies representing common sense
knowledge about the world - e.g., WordNet
5WordNet
- A vocabulary and a thesaurus offering a
hierarchical categorization of natural language
terms - More than 100,000 terms
- An ontology of natural language terms
- Nouns, verbs, adjectives and adverbs are grouped
into synonym sets (synsets) - Synsets represent terms or concepts
- stadium, bowl, arena, sports stadium (a large
structure for open-air sports or entertainments)
6WordNet Hierarchies
- The synsets are also organized into senses
- Senses Different meanings of the same term
- The synsets are related to other synsets higher
or lower in the hierarchy by different types of
relationships e.g. - Hyponym/Hypernym (Is-A relationships)
- Meronym/Holonym (Part-Of relationships)
- Nine noun and several verb Is-A hierarchies
7A Fragment of the WordNet Is-A Hierarchy
8(No Transcript)
9Semantic Similarity Methods
- Map terms to an ontology and compute their
relationship in that ontology - Four main categories of methods
- Edge counting path length between terms
- Information content as a function of their
probability of occurrence in corpus - Feature based similarity between their
properties (e.g., definitions) or based on their
relationships to other similar terms - Hybrid combine the above ideas
10Example
- Edge counting distance between conveyance and
ceramic is 2 - An information content method, would associate
the two terms with their common subsumer and with
their probabilities of occurrence in a corpus
11Semantic Similarity on WordNet
- The most popular methods are evaluated
- All methods applied on a set of 38 term pairs
- Their similarity values are correlated with
scores obtained by humans - The higher the correlation of a method the better
the method is
12Evaluation
Method Type Correlation
Rada 1989 Edge Counting 0.59
Wu 1994 Edge Counting 0.74
Li 2003 Edge Counting 0.82
Leackok 1998 Edge Counting 0.82
Richardson 1994 Edge Counting 0.63
Resnik 1999 Info. Content 0.79
Lin 1993 Info. Content 0.82
Lord 2003 Info. Content 0.79
Jiang 1998 Info. Content 0.83
Tversky 1977 Feature Based 0.73
Rodriguez 2003 Hybrid 0.71
13Observations
- Edge counting/Info. Content methods work by
exploiting structure information - Good methods take the position of the terms into
account - Higher similarity for terms which are close
together but lower in the hierarchy e.g., Li
et.al. 2003 - Information Content is measured on WordNet rather
than on corpus Seco2002 - Similarity only for nouns and verbs
- No taxonomic structure for other p.o.s
14http//www.ece.tuc.gr/similarity
15Semantic Similarity Retrieval Model (SSRM)
- Classic retrieval models retrieve documents with
the same query terms - SSRM will retrieve documents which also contain
semantically similar terms - Queries and documents are initially assigned
tfxidf weights - q(q1,q2,qN) , d(d1,d2,dN)
16SSRM
- Query term re-weighting
- similar terms reinforce each other
- Query term expansion with synonyms and similar
terms - Document similarity
17Query Term Expansion
18Observations
- Specification of T ?
- Large T may lead to topic drift
- Word sense disambiguation for expanding with the
correct sense - Expansion with co-concurring terms?
- SVD, local/global analysis
- Semantic similarity between terms of different
parts of speech? - Work with compound terms (phrases)
19Evaluation of SSRM
- SSRM is evaluated through intellisearch a system
for information retrieval on the WWW - 1,5 Million Web pages with images
- Images are described by surrounding text
- The problem of image retrieval is transformed
into a problem of text retrieval
20http//www.ece.tuc.gr/intellisearch
21Methods
- Vector Space Model (VSM)
- SSRM
- Each method is represented by a precision/recall
plot - Each point is the average precision/recall over
20 queries - 20 queries from the list of the most frequent
Google image queries
22Experimental Results
23MeSH and MedLine
- MeSH ontology for medical and biological terms
by the N.L.M. - 22,000 terms
- MedLine the premier bibliographic medical
database of N.L.M. - 13 Million references
24Evaluation on MedLine
25Conclusions
- Semantic similarity methods approximated the
human notion of similarity reaching correlation
up to 83 - SSRM exploits this information for improving the
performance of retrieval - SSRM can work with any semantic similarity method
and any ontology
26Future Work
- Experimentation with more data sets (TREC) and
ontologies - Extend SSRM to work with
- Compound terms
- More parts of speech (e.g., adverbs)
- Co-occurring terms
- More terms relationships in WordNet
- More elaborate methods for specification of
thresholds
27Try our system on the Web
- Semantic Similarity System http//www.ece.tuc.gr/
similarity - SRRM http//www.ece.tuc.gr/intellisearch