A Hybrid Approach for XML Similarity - PowerPoint PPT Presentation

About This Presentation
Title:

A Hybrid Approach for XML Similarity

Description:

Major means for efficient data representation and management. ... Similarity between corresponding concepts in the knowledge base. c1. c2. c3. c4. c5. c6 ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 38
Provided by: richard1016
Category:

less

Transcript and Presenter's Notes

Title: A Hybrid Approach for XML Similarity


1
A Hybrid Approach for XML Similarity
  • Joe TEKLI
  • Richard CHBEIR
  • Kokou YETONGNON

2
Overview
  • Introduction and motivation
  • Current Solutions
  • Proposal
  • Implementation
  • Conclusion

3
Introduction and motivation
  • XML (eXtensable Markup Language)
  • Major means for efficient data representation and
    management.
  • An XML document comes down to an Ordered Labeled
    Tree
  • Example
  • ltAcademygt
  • ltDepartment gt
  • ltLaboratorygt
  • ltProfessorgtMartin R.lt/Professorgt
  • ltStudentgtRoberts J.lt/Studentgt
  • lt/Laboratorygt
  • lt/Departmentgt
  • lt/Academygt

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Node depths
1
0
Academy
1
2
Department
2
3
Laboratory
Student
Professor
3
4
5
4
Introduction and motivation
  • XML has become inevitable
  • Current applications
  • Information description, storage and retrieval
  • Database information interchange
  • Web services interaction

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Information destined to be broadcasted over the
web is henceforth represented using XML
5
Introduction and motivation
Emergent need
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Information retrieval XML documents comparison
XML
6
Introduction and motivation
  • A range of algorithms for comparing
    semi-structured data, e.g. XML documents, have
    been proposed
  • Generally exploit the concept of Edit Distance
  • Focus on the structure of XML documents
  • Ignore the semantics involved

Plan Introduction Current Solutions Proposal I
mplementation Conclusion

However, in the field of information retrieval
(IR), estimating semantic similarity between web
pages is of key importance to improving search
results 1
1 Maguitman A. G., Menczer F., Roinestad H. and
Vespignani A., Algorithmic Detection of Semantic
Similarity. In Proceedings of the 14th
International World Wide Web Conference, 107-116,
Chiba, Japan, 2005
7
Introduction and motivation
  • Example

ltFactorygt ltDepartment gt ltLaboratorygt
ltSupervisorgtlt/Supervisorgt lt/Laboratorygt
lt/Departmentgt lt/Factorygt
ltAcademygt ltDepartment gt ltLaboratorygt
ltProfessorgtlt/Professorgt
ltStudentgtlt/Studentgt lt/Laboratorygt
lt/Departmentgt lt/Academygt
ltCollegegt ltDepartment gt ltLaboratorygt
ltLecturergtlt/Lecturergt lt/Laboratorygt
lt/Departmentgt lt/Collegegt
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
XML Document C
XML Document B
XML Document A
Academy
College
Factory
College
Factory
Academy
Departement
Departement
Departement
Laboratory
Laboratory
Laboratory
Lecturer
Student
Professeur
Lecturer
Supervisor
Supervisor
Student
Professor
?
Sim(A, B) Sim(A, C)
8
Introduction and motivation
Motivation
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
How to enhance existing XML comparison approaches
in order to take into consideration both
structural and semantic characteristics of XML
documents
?
9
Introduction and motivation
  • We consider heterogeneous XML documents, lacking
    predefined grammars (DTDs or XML Schemas)
  • XML documents published on the web often found
    without grammars

Goal
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
To put forward an improved XML comparison method
integrating semantic and structural similarity
10
Overview
  • Introduction and motivation
  • Current solutions
  • Proposal
  • Implementation
  • Conclusion

11
Overview
  • Introduction and motivation
  • Current solutions
  • XML structural similarity
  • Semantic similarity
  • Proposal
  • Implementation
  • Conclusion

12
Current solutions
  • XML structural similarity
  • Most algorithms proposed in the literature
    utilize the programming techniques for finding
    the Edit Distance between trees

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Finding the cheapest sequence of edit operations
that can transform one tree into another
13
Current solutions
  • XML structural similarity
  • Algorithms can be distinguished following
  • The set of edit operations allowed
  • Insert node Insertion of inner/leaf nodes
  • Delete node Deletion of inner/leaf nodes
  • Update node Relabelling nodes
  • Insert tree
  • Delete tree
  • Move tree
  • The overall complexity and performance
  • O(N2 D2)
  • O(N2)
  • O(N log(N))

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
a
b
e
y
a
c
z
z
h
d
j
i
Optimality
14
Overview
  • Introduction and motivation
  • Current solutions
  • XML structural similarity
  • Semantic similarity
  • Proposal
  • Implementation
  • Conclusion

15
Current Solutions
  • Semantic Similarity
  • Knowledge bases (Thesauri, taxonomies,
    ontologies) provide a framework for organizing
    words into a semantic space
  • Semantic similarity between two words
  • Similarity between corresponding concepts in the
    knowledge base

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Concept
c1
Word/expression
c3
c2
c4
c5
c9
c8
c7
c6
16
Current Solutions
  • Semantic similarity
  • Several methods are proposed in the litterature
  • Edge-based approaches
  • Node-based approaches
  • Node-based approaches seem more relevant
  • Experimental results yield higher correlation
    with human judgment

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Concept
Word/expression
c1
c3
c2
c4
c5
Information content of a concept c - Log p(c)
c9
c8
c7
c6
17
Overview
  • Introduction and motivation
  • Current solutions
  • Proposal
  • Implementation
  • Conclusion

18
Proposal
  • Hybrid approach

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Edit Distance algorithm (Chawathe 2)
Semantic cost model (Lin 3)
2 Chawathe S., Comparing Hierarchical Data in
Extended Memory. In Proceedings of the
Twenty-fifth International Conference on Very
Large Data Bases. Edinburgh, Scotland, U.K., p.
90-101, 1999
3 Lin D., Am Information-Theoretic Definition
of Similarity. In Proceedings of the 15th
International Conference on Machine Learning,
296-304, Morgan Kaufmann Publishers Inc., 1998
19
Proposal
  • We adopt Chawathes Edit Distance algorithm 2
  • Its a direct application of Wagner-Fisher 4
  • It is among the fastest available
  • Edit operations used
  • Insertion of leaf nodes - Ins(x, i, p, ?(x))
  • Deletion of leaf nodes - Del(x, p)
  • Update internal/leaf nodes - Upd(x, y)
  • Complexity
  • O(N2)

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
2 Chawathe S., Comparing Hierarchical Data in
Extended Memory. In Proceedings of the
Twenty-fifth International Conference on Very
Large Data Bases. Edinburgh, Scotland, U.K., p.
90-101, 1999
4 Wagner J. and Fisher M., The String-to-String
correction problem. Journal of the Association
of Computing Machinery, 21(1)168-173, 1974
20
Proposal
  • Intuitive cost model
  • CostIns 1
  • CostDel 1
  • CostUpd 1 when x.l ? y.l otherwise
    CostUpd 0

A central question in most edit distance
approaches
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
?
How to assign edit operations costs
21
Proposal
  • Applying Chawathes approach 2

XML Document C
XML Document B
XML Document A
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Academy
College
Factory
College
1
1
1
2
2
2
Departement
Departement
Departement
3
3
Laboratory
Laboratory
Laboratory
3
4
4
4
Student
Professor
Lecturer
Supervisor
Lecturer
5
Del(A5, A3)
Upd(A1, B1),
Upd(A4, B4),
Edit script
Dist(A, B) Dist(A, C) 3
How can Semantic Similarity be taken into account
Sim 1 / 1 Dist
?
Sim(A, B) Sim(A, C) 0.25
22
Proposal
  • Semantic cost model
  • Varying operations costs w.r.t. the semantic
    relatedness of node labels
  • CostSem_Op(x, y)
  • Varying costs w.r.t. corresponding node depths
  • CostDepth_Op(x)

Solution
We propose to vary edit operations costs
according to the semantics of concerned nodes
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
CostOp(x, y) CostSem_Op(x, y) ? CostDepth_Op(x)
? 0, 1
23
Proposal
  • Label semantic similarity cost
  • Edit operations
  • CostSem_Upd(x, y) 1 SimSem(x.l, y.l)
  • CostSem_Ins(x, i, p, ?(x)) 1 SimSem(?(x),
    p.l)
  • CostSem_Del(x, p) 1 SimSem(x.l, p.l)

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
CostSem_Op ? when SimSem?
CostSem_Op ? when SimSem ?
24
Proposal
  • Label semantic similarity cost
  • Semantic similarity measure adopted Lin 3
  • SimSim(C1, C2)
    with C the lowest common ancestor
    of C1 and C2 (maximizing their pair-wise
    similarity value)

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
2 log P(C)
log P(C1) log P(C2)
SimSem(C1, C2) ? 0, 1
3 Lin D., Am Information-Theoretic Definition
of Similarity. In Proceedings of the 15th
International Conference on Machine Learning,
296-304, Morgan Kaufmann Publishers Inc., 1998
25
Proposal
  • Label semantic similarity cost
  • Example
  • CostSem_Upd(A1, B1) 1 SimSem(Academy,
    College)
  • CostSem_Upd(A1, C1) 1 SimSem(Academy,
    Factory)
  • SimSem(Academy, College) gt SimSem(Academy,
    Factory)
  • CostSem_Upd(A1, B1) lt CostSem_Upd(A1,
    C1)
  • Dist(A, B) lt Dist(A, C)

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Del(A5, A3)
Upd(A1, B1),
Upd(A4, B4),
Edit script
XML Document C
XML Document B
XML Document A
Academy
College
Factory
College
1
1
1
2
2
2
Departement
Departement
Departement
3
3
Laboratory
Laboratory
Laboratory
3
Sim(A, B) gt Sim(A, C)
4
4
4
Student
Professor
Lecturer
Supervisor
Lecturer
5
26
Proposal
  • Semantic cost model
  • Varying operations costs w.r.t. the semantic
    relatedness of node labels
  • CostSem_Op(x, y)
  • Varying operations costs w.r.t. the node depths
  • CostDepth_Op(x)

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
CostOp(x, y) CostSem_Op(x, y) ? CostDepth_Op(x)
? 0, 1
27
Proposal
  • Node depth cost
  • CostDepth_Op(x) 1 / (1 x.d) ? 0, 1
  • Information becomes increasingly specific as one
    descends in the XML tree hierarchy
  • Its semantic affect on the whole XML document
    decreasing accordingly
  • Editing the root node of a document tree
  • CostDepth_Op(racine) 1
  • Operations costs decrease when
  • moving downward in the hierarchy

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Document XML A
Academy
Hospital
Department
Laboratory
Student
Professor
28
Proposal
Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Proposal
  • Edit Distance computations
  • (Chawathe 2)

Semantic similarity evaluation (Lin 3)
Hybrid XML Comparison Approach
2 Chawathe S., Comparing Hierarchical Data in
Extended Memory. In Proceedings of the
Twenty-fifth International Conference on Very
Large Data Bases. Edinburgh, Scotland, U.K., p.
90-101, 1999
3 Lin D., Am Information-Theoretic Definition
of Similarity. In Proceedings of the 15th
International Conference on Machine Learning,
296-304, Morgan Kaufmann Publishers Inc., 1998
29
Overview
  • Introduction and motivation
  • Current solutions
  • Proposal
  • Implementation
  • Conclusion

30
Implementation
  • Prototype XS3 (XML Structure and Semantic
    Similarity)
  • XML documents comparison
  • 1/1
  • 1/8 ranking documents according to their
    similarity degrees
  • 8/8 XML documents classification/clustering

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
31
Implementation
  • Synthetic XML documents generator
  • Producing sets of XML documents based on given
    DTDs
  • Taxonomic analyzer
  • Computing semantic similarity values between
    words in a given knowledge base (taxonomy)

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
32
Implementation
  • Experimental results
  • Higher average similarity values, underlining
    similarities (of semantic nature) that were
    previously undetected
  • Straight distinction between documents
    corresponding to different DTDs
  • Capturing semantic affinities between document
    sets

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
lt!DOCTYPE DTD2 lt!ELEMENT School (Administrative
unit)gt lt!ELEMENT Administrative unit
(Section?)gt lt!ELEMENT Section
(Educator?, Scholar)gt
lt!ELEMENT Educator (PCDATA)gt lt!ELEMENT
Scholar (PCDATA)gt gt
0.099
0.097
lt!DOCTYPE DTD1 lt!ELEMENT Academy
(Administrative unit)gt lt!ELEMENT
Administrative unit (Branch?)gt
lt!ELEMENT Branch (Educator?, Student)gt
lt!ELEMENT Educator (PCDATA)gt
lt!ELEMENT Student (PCDATA)gt gt
0.095
0.093
0.091
0.089
lt!DOCTYPE DTD3 lt!ELEMENT Government
(Administrative unit)gt lt!ELEMENT
Administrative unit (Section?)gt
lt!ELEMENT Section (Professional?, Worker)gt
lt!ELEMENT Professional (PCDATA)gt
lt!ELEMENT Worker (PCDATA)gt gt
0.087
0.085
Combined structural and semantic similarity
Structural similarity
33
Implementation
  • Experimental results
  • Chawathes classical Edit Distance process 2
    being linear in the number of nodes of each tree
    O(A B)

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
Our approach is of polynomial complexity
Number of nodes in each taxonomy
Time (m)
Time (s)
34
Overview
  • Introduction and motivation
  • Current approaches
  • Proposal
  • Implementation
  • Conclusion

35
Conclusion
  • Goal developing an integrated semantic an
    structure based XML similarity approach, for
    comparing XML documents, taking into account
  • Semantic meaning of XML elements/attributes
    w.r.t. their labels and depths
  • Structural characteristics of XML documents
  • This is the first attempt to combine Edit
    Distance structural similarity computations with
    IR semantic similarity assessment, in an XML
    context
  • Experimental results are satisfactory

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
36
Conclusion
  • Future work
  • Exploiting semantic similarity to compare, not
    only the structure of XML documents, but also
    their information content (values)
  • In such a framework, XML Schemas seem
    unsurpassable
  • Studying XML similarity in a multimedia context
    (MPEG7, SVG, ...)
  • Taking into consideration structural, semantic,
    as well as multimedia-specific criterion

Plan Introduction Current Solutions Proposal I
mplementation Conclusion
ltFactorygt ltDepartmentgt ltLaboratorygt
ltProductgt BMW Z3 lt/Productgt
ltProductgt BMW X5 lt/Productgt
lt/Laboratorygt lt/Departmentgt lt/Factorygt
37
Thank you
Questions
Write a Comment
User Comments (0)
About PowerShow.com