Title: A Hybrid Approach for XML Similarity
1A Hybrid Approach for XML Similarity
- Joe TEKLI
- Richard CHBEIR
- Kokou YETONGNON
2Overview
- Introduction and motivation
- Current Solutions
- Proposal
- Implementation
- Conclusion
3Introduction and motivation
- XML (eXtensable Markup Language)
- Major means for efficient data representation and
management. - An XML document comes down to an Ordered Labeled
Tree - Example
- ltAcademygt
- ltDepartment gt
- ltLaboratorygt
- ltProfessorgtMartin R.lt/Professorgt
- ltStudentgtRoberts J.lt/Studentgt
- lt/Laboratorygt
- lt/Departmentgt
- lt/Academygt
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Node depths
1
0
Academy
1
2
Department
2
3
Laboratory
Student
Professor
3
4
5
4Introduction and motivation
- XML has become inevitable
- Current applications
- Information description, storage and retrieval
- Database information interchange
- Web services interaction
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Information destined to be broadcasted over the
web is henceforth represented using XML
5Introduction and motivation
Emergent need
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Information retrieval XML documents comparison
XML
6Introduction and motivation
- A range of algorithms for comparing
semi-structured data, e.g. XML documents, have
been proposed - Generally exploit the concept of Edit Distance
- Focus on the structure of XML documents
- Ignore the semantics involved
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
However, in the field of information retrieval
(IR), estimating semantic similarity between web
pages is of key importance to improving search
results 1
1 Maguitman A. G., Menczer F., Roinestad H. and
Vespignani A., Algorithmic Detection of Semantic
Similarity. In Proceedings of the 14th
International World Wide Web Conference, 107-116,
Chiba, Japan, 2005
7Introduction and motivation
ltFactorygt ltDepartment gt ltLaboratorygt
ltSupervisorgtlt/Supervisorgt lt/Laboratorygt
lt/Departmentgt lt/Factorygt
ltAcademygt ltDepartment gt ltLaboratorygt
ltProfessorgtlt/Professorgt
ltStudentgtlt/Studentgt lt/Laboratorygt
lt/Departmentgt lt/Academygt
ltCollegegt ltDepartment gt ltLaboratorygt
ltLecturergtlt/Lecturergt lt/Laboratorygt
lt/Departmentgt lt/Collegegt
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
XML Document C
XML Document B
XML Document A
Academy
College
Factory
College
Factory
Academy
Departement
Departement
Departement
Laboratory
Laboratory
Laboratory
Lecturer
Student
Professeur
Lecturer
Supervisor
Supervisor
Student
Professor
?
Sim(A, B) Sim(A, C)
8Introduction and motivation
Motivation
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
How to enhance existing XML comparison approaches
in order to take into consideration both
structural and semantic characteristics of XML
documents
?
9Introduction and motivation
- We consider heterogeneous XML documents, lacking
predefined grammars (DTDs or XML Schemas) - XML documents published on the web often found
without grammars
Goal
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
To put forward an improved XML comparison method
integrating semantic and structural similarity
10Overview
- Introduction and motivation
- Current solutions
- Proposal
- Implementation
- Conclusion
11Overview
- Introduction and motivation
- Current solutions
- XML structural similarity
- Semantic similarity
- Proposal
- Implementation
- Conclusion
12Current solutions
- XML structural similarity
- Most algorithms proposed in the literature
utilize the programming techniques for finding
the Edit Distance between trees
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Finding the cheapest sequence of edit operations
that can transform one tree into another
13Current solutions
- XML structural similarity
- Algorithms can be distinguished following
- The set of edit operations allowed
- Insert node Insertion of inner/leaf nodes
- Delete node Deletion of inner/leaf nodes
- Update node Relabelling nodes
- Insert tree
- Delete tree
- Move tree
- The overall complexity and performance
- O(N2 D2)
- O(N2)
- O(N log(N))
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
a
b
e
y
a
c
z
z
h
d
j
i
Optimality
14Overview
- Introduction and motivation
- Current solutions
- XML structural similarity
- Semantic similarity
- Proposal
- Implementation
- Conclusion
15Current Solutions
- Semantic Similarity
- Knowledge bases (Thesauri, taxonomies,
ontologies) provide a framework for organizing
words into a semantic space - Semantic similarity between two words
- Similarity between corresponding concepts in the
knowledge base
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Concept
c1
Word/expression
c3
c2
c4
c5
c9
c8
c7
c6
16Current Solutions
- Semantic similarity
- Several methods are proposed in the litterature
- Edge-based approaches
- Node-based approaches
- Node-based approaches seem more relevant
- Experimental results yield higher correlation
with human judgment
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Concept
Word/expression
c1
c3
c2
c4
c5
Information content of a concept c - Log p(c)
c9
c8
c7
c6
17Overview
- Introduction and motivation
- Current solutions
- Proposal
- Implementation
- Conclusion
18Proposal
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Edit Distance algorithm (Chawathe 2)
Semantic cost model (Lin 3)
2 Chawathe S., Comparing Hierarchical Data in
Extended Memory. In Proceedings of the
Twenty-fifth International Conference on Very
Large Data Bases. Edinburgh, Scotland, U.K., p.
90-101, 1999
3 Lin D., Am Information-Theoretic Definition
of Similarity. In Proceedings of the 15th
International Conference on Machine Learning,
296-304, Morgan Kaufmann Publishers Inc., 1998
19Proposal
- We adopt Chawathes Edit Distance algorithm 2
- Its a direct application of Wagner-Fisher 4
- It is among the fastest available
- Edit operations used
- Insertion of leaf nodes - Ins(x, i, p, ?(x))
- Deletion of leaf nodes - Del(x, p)
- Update internal/leaf nodes - Upd(x, y)
- Complexity
- O(N2)
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
2 Chawathe S., Comparing Hierarchical Data in
Extended Memory. In Proceedings of the
Twenty-fifth International Conference on Very
Large Data Bases. Edinburgh, Scotland, U.K., p.
90-101, 1999
4 Wagner J. and Fisher M., The String-to-String
correction problem. Journal of the Association
of Computing Machinery, 21(1)168-173, 1974
20Proposal
- Intuitive cost model
- CostIns 1
- CostDel 1
- CostUpd 1 when x.l ? y.l otherwise
CostUpd 0
A central question in most edit distance
approaches
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
?
How to assign edit operations costs
21Proposal
- Applying Chawathes approach 2
XML Document C
XML Document B
XML Document A
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Academy
College
Factory
College
1
1
1
2
2
2
Departement
Departement
Departement
3
3
Laboratory
Laboratory
Laboratory
3
4
4
4
Student
Professor
Lecturer
Supervisor
Lecturer
5
Del(A5, A3)
Upd(A1, B1),
Upd(A4, B4),
Edit script
Dist(A, B) Dist(A, C) 3
How can Semantic Similarity be taken into account
Sim 1 / 1 Dist
?
Sim(A, B) Sim(A, C) 0.25
22Proposal
- Semantic cost model
- Varying operations costs w.r.t. the semantic
relatedness of node labels - CostSem_Op(x, y)
- Varying costs w.r.t. corresponding node depths
- CostDepth_Op(x)
Solution
We propose to vary edit operations costs
according to the semantics of concerned nodes
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
CostOp(x, y) CostSem_Op(x, y) ? CostDepth_Op(x)
? 0, 1
23Proposal
- Label semantic similarity cost
- Edit operations
- CostSem_Upd(x, y) 1 SimSem(x.l, y.l)
- CostSem_Ins(x, i, p, ?(x)) 1 SimSem(?(x),
p.l) - CostSem_Del(x, p) 1 SimSem(x.l, p.l)
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
CostSem_Op ? when SimSem?
CostSem_Op ? when SimSem ?
24Proposal
- Label semantic similarity cost
- Semantic similarity measure adopted Lin 3
- SimSim(C1, C2)
with C the lowest common ancestor
of C1 and C2 (maximizing their pair-wise
similarity value)
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
2 log P(C)
log P(C1) log P(C2)
SimSem(C1, C2) ? 0, 1
3 Lin D., Am Information-Theoretic Definition
of Similarity. In Proceedings of the 15th
International Conference on Machine Learning,
296-304, Morgan Kaufmann Publishers Inc., 1998
25Proposal
- Label semantic similarity cost
- Example
- CostSem_Upd(A1, B1) 1 SimSem(Academy,
College) - CostSem_Upd(A1, C1) 1 SimSem(Academy,
Factory) - SimSem(Academy, College) gt SimSem(Academy,
Factory) - CostSem_Upd(A1, B1) lt CostSem_Upd(A1,
C1) - Dist(A, B) lt Dist(A, C)
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Del(A5, A3)
Upd(A1, B1),
Upd(A4, B4),
Edit script
XML Document C
XML Document B
XML Document A
Academy
College
Factory
College
1
1
1
2
2
2
Departement
Departement
Departement
3
3
Laboratory
Laboratory
Laboratory
3
Sim(A, B) gt Sim(A, C)
4
4
4
Student
Professor
Lecturer
Supervisor
Lecturer
5
26Proposal
- Semantic cost model
- Varying operations costs w.r.t. the semantic
relatedness of node labels - CostSem_Op(x, y)
- Varying operations costs w.r.t. the node depths
- CostDepth_Op(x)
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
CostOp(x, y) CostSem_Op(x, y) ? CostDepth_Op(x)
? 0, 1
27Proposal
- Node depth cost
- CostDepth_Op(x) 1 / (1 x.d) ? 0, 1
- Information becomes increasingly specific as one
descends in the XML tree hierarchy - Its semantic affect on the whole XML document
decreasing accordingly - Editing the root node of a document tree
- CostDepth_Op(racine) 1
- Operations costs decrease when
- moving downward in the hierarchy
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Document XML A
Academy
Hospital
Department
Laboratory
Student
Professor
28Proposal
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Proposal
- Edit Distance computations
- (Chawathe 2)
Semantic similarity evaluation (Lin 3)
Hybrid XML Comparison Approach
2 Chawathe S., Comparing Hierarchical Data in
Extended Memory. In Proceedings of the
Twenty-fifth International Conference on Very
Large Data Bases. Edinburgh, Scotland, U.K., p.
90-101, 1999
3 Lin D., Am Information-Theoretic Definition
of Similarity. In Proceedings of the 15th
International Conference on Machine Learning,
296-304, Morgan Kaufmann Publishers Inc., 1998
29Overview
- Introduction and motivation
- Current solutions
- Proposal
- Implementation
- Conclusion
30Implementation
- Prototype XS3 (XML Structure and Semantic
Similarity) - XML documents comparison
- 1/1
- 1/8 ranking documents according to their
similarity degrees - 8/8 XML documents classification/clustering
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
31Implementation
- Synthetic XML documents generator
- Producing sets of XML documents based on given
DTDs - Taxonomic analyzer
- Computing semantic similarity values between
words in a given knowledge base (taxonomy)
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
32Implementation
- Experimental results
- Higher average similarity values, underlining
similarities (of semantic nature) that were
previously undetected - Straight distinction between documents
corresponding to different DTDs - Capturing semantic affinities between document
sets
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
lt!DOCTYPE DTD2 lt!ELEMENT School (Administrative
unit)gt lt!ELEMENT Administrative unit
(Section?)gt lt!ELEMENT Section
(Educator?, Scholar)gt
lt!ELEMENT Educator (PCDATA)gt lt!ELEMENT
Scholar (PCDATA)gt gt
0.099
0.097
lt!DOCTYPE DTD1 lt!ELEMENT Academy
(Administrative unit)gt lt!ELEMENT
Administrative unit (Branch?)gt
lt!ELEMENT Branch (Educator?, Student)gt
lt!ELEMENT Educator (PCDATA)gt
lt!ELEMENT Student (PCDATA)gt gt
0.095
0.093
0.091
0.089
lt!DOCTYPE DTD3 lt!ELEMENT Government
(Administrative unit)gt lt!ELEMENT
Administrative unit (Section?)gt
lt!ELEMENT Section (Professional?, Worker)gt
lt!ELEMENT Professional (PCDATA)gt
lt!ELEMENT Worker (PCDATA)gt gt
0.087
0.085
Combined structural and semantic similarity
Structural similarity
33Implementation
- Experimental results
- Chawathes classical Edit Distance process 2
being linear in the number of nodes of each tree
O(A B)
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Our approach is of polynomial complexity
Number of nodes in each taxonomy
Time (m)
Time (s)
34Overview
- Introduction and motivation
- Current approaches
- Proposal
- Implementation
- Conclusion
35Conclusion
- Goal developing an integrated semantic an
structure based XML similarity approach, for
comparing XML documents, taking into account - Semantic meaning of XML elements/attributes
w.r.t. their labels and depths - Structural characteristics of XML documents
- This is the first attempt to combine Edit
Distance structural similarity computations with
IR semantic similarity assessment, in an XML
context - Experimental results are satisfactory
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
36Conclusion
- Future work
- Exploiting semantic similarity to compare, not
only the structure of XML documents, but also
their information content (values) - In such a framework, XML Schemas seem
unsurpassable - Studying XML similarity in a multimedia context
(MPEG7, SVG, ...) - Taking into consideration structural, semantic,
as well as multimedia-specific criterion
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
ltFactorygt ltDepartmentgt ltLaboratorygt
ltProductgt BMW Z3 lt/Productgt
ltProductgt BMW X5 lt/Productgt
lt/Laboratorygt lt/Departmentgt lt/Factorygt
37Thank you
Questions