# A Summary of XISS and Index Fabric - PowerPoint PPT Presentation

1 / 41
Title:

## A Summary of XISS and Index Fabric

Description:

### Absolute Path Expression (APE) ... APE queries are translated to prefix to keys and submitted to the index trie ... solve APE by single index lookup ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 42
Provided by: CSI115
Category:
Tags:
Transcript and Presenter's Notes

Title: A Summary of XISS and Index Fabric

1
A Summary of XISS and Index Fabric
• Ho Wai Shing

2
Contents
• Definition of Terms
• XISS (Li and Moon, VLDB2001)
• Numbering Scheme
• Indices Stored
• Join Algorithms
• Index Fabric (Cooper et al, VLDB2001)
• Patricia
• Balanced Trie
• Raw Path Index

3
Definition of Terms
• Absolute Path Expression (APE)
• the path which start from root, each step is a
traversal of child axis or attribute axis, no
wildcards
• e.g., /, /A/B, /A/_at_C

4
Definition of Terms
• Regular Path Expression (RPE)
• may start from root or not,
• may traverse different axes (restricted to child,
descendant-or-self, attribute for discussions
since they are the most commonly used ones)
• may contain wildcards
• e.g., //, /A//C, /A/_/B, //A/B//C/D/_at_E

5
XISS
• XISS XML Indexing and Storage System
• by Li and Moon, published in VLDB 2001, with
title Indexing and Querying XML Data for Regular
Path Expressions
• decomposes and stores XML documents in the
indices
• can answer regular path expressions

6
XISS - General Idea
• solve RPE by decomposing RPE into these 5 basic
subexpressions
• element retrieval
• attribute retrieval
• steps involve an element and an attribute
• steps involve two elements
• a Kleene Closure of another subexpression

7
XISS - General Idea
• each subexpression is solved by its own method
• element index lookup
• attribute index lookup
• EA-join
• EE-join
• KC-join

8
XISS - General Idea
• result lists from the subexpressions are joined
to produce the final result
• to make this decomposition and join efficient, an
efficient method to determine ancestor-descendant
relationship is needed
• XISS uses an extended preorder based numbering
scheme

9
XISS - Numbering Scheme
• number all the nodes with a ltorder, sizegt tuple
• order is assigned based on an extended preorder
traversal
• size can be imagined as the size of the subtree
rooted at that node

10
XISS - Numbering Scheme
• The rules for number assignment
• if x precedes y in the preorder traversal,
x.order lt y.order (preorder)
• if x and y are siblings, either x.order x.size
lt y.order or y.order y.size lt x.order(siblings
wont overlap)
• if x is an ancestor of y, x.order lt y.order lt
x.order x.size (ancestor contains descendant)

11
XISS - Numbering Scheme
• Actual Assignment
• uses heuristics to reserve some space between
orders
• reserve more space to the sizes for future node
insertions
• attributes are place before sibling elements

12
XISS - Index Organization
• There are 5 indices
• Name Index
• Element Index
• Attribute Index
• Structure Index
• Value Table

13
XISS - Name Index
• maps element or attribute name to a name
identifier (or nid)
• nid is used for further query evaluation
representing that element or attribute
• reduce the time for string comparison in further
index lookup
• stored in a B-tree

14
XISS - Name Index
Name
nid
B-tree
15
XISS - Value Table
• stores all the string values of the XML document

16
XISS - Element Index
• input nid, output list of element records
• implemented by a B-tree
• leaves are pointers to list of document ID (did),
each list element points to a list of all
elements with the same name in the same document

17
XISS - Element Index
element list
did list
nid
element list
ltorder, sizegt,Depth,ParentID
B-tree
element record
18
XISS - Attribute Index
• Very similar to element index
• always has a value identifier, vid

19
XISS - Structure Index
• Input did, Output array containing all the
element and attributes in the document
• implemented by a B-tree

20
XISS - Structure Index
did
nidltorder, sizegt,Parent order,Child
order,Sibling order,Attribute order
B-tree
record array
21
XISS - Indices
• When to use which index?
• first use Name Index to find nid of the
element/attribute to be queried
• search Element/Attribute index for the records
• if we need values, lookup Value Table
• use Structure Index to rebuild or traverse the
XML document tree

22
XISS - Join Algorithms
• After getting the record lists from each
subexpression, we need to find out which are
• e.g., to find /A/B, we found a record list of all
element A, another list of all element B, and we
have to find out which Bs are A/B

23
XISS - Join Algorithms
• Three join algorithms proposed
• EA-join - merges an element record list and an
attribute record list (solves A/_at_B)
• EE-join - merges two element record lists (solves
A/B or A//B)
• KC-join - self-merge an element record list
(solves (E))

24
XISS - EA-Join
• to solve E/_at_A
• input an element record list and an attribute
record list
• find out the attribute records which have parents
in the element record list
• two lists are sorted by did and then order

25
XISS - EA-join
• 2-stage sort-merge
• group by did first
• merge using order then
• output criterion E is a parent of A
• single scan on both list is enough

26
XISS - EE-join
• to solve E/_/E, e.g., E/E, E//E, E/_/E
• input two Element record lists, E, F
• output (e,f) where e is an ancestor of f
• also use 2-stage sort-merge
• however, may need scanning of lists multiple
times (for special cases, e.g., the document has
/A/A/B/B)

27
XISS - KC-join
• to solve Kleene Closure of a subexpression
• input a list of element records fits the base
case
• recursively use EE join on the list, and stop
until no more grow in the result list

28
Index Fabric
• by Cooper at el, published in VLDB 2001, with
title A fast index for semistructured data
• has 2 subtypes, raw path index and refined path
index
• use Patricia technique to compress the index

29
Index Fabric - General Idea
• it is a disk balanced indexing structure based on
Patricia
• each data node is associated with a key string
and this string is stored in the trie index for
retrieval
• the layered approach in building the index ensure
the number of disk pages accessed per query

30
Index Fabric - General Idea
• raw path index answers absolute path queries
• refined path index answers any predefined queries
• the difference is how to generate the key

31
Patricia
• Patricia Practical Algorithm To Retrieve
Information Coded in Alphanumeric
• by Morrison, in JACM 1968
• a method to store and retrieve strings in a
space efficient way
• binary, use bit comparisons, has a skip in each
internal node

32
Patricia
• an example Patricia trie

0
1
0
0
1
1
101110
101111
110000
110011
33
Patricia
• its basically a trie with internal nodes having
single child removed
• search is done by
• branch according to the value of bit at skip
• retrieve the string at leaf
• compare it with the query string

34
Index Fabric - Balanced Trie
• The number of disk pages accessed per query is
bounded by the number of layers in the layered
index
• The idea is similar to that of B-tree, The
Patricia trie is decomposed into blocks, and
there is an upper layer trie which traverse the
blocks

35
Index Fabric - Balanced Trie
1
• e.g.

0
1
0
0
1
1
101110
101111
110000
110011
Layer 0
Layer 1
36
Index Fabric - Balanced Trie
• There are 3 types of links in the balanced trie
• far link across layer, a result of branching
• near link within the same block, a result of
branching
• direct link across layer, the root nodes are the
same
• Each query will access 1 block in 1 layer

37
Index Fabric - Balanced Trie
• increase the speed by skipping nodes of original
trie using traversals in upper layers
• number of page accessed is bounded

38
Index Fabric - Raw Path
• each data node is associated with a key
• key path (encoded in designators) value
• designators are special characters, each
represents a name
• APE queries are translated to prefix to keys and
submitted to the index trie

39
Index Fabric - Raw Path
• Example
gt is translated to IBNHKU (bolded underlined
are designators
• query of /invoice/buyer/nameHKU is translated
to query string IBNHKU

40
Index Fabric - Refined Path
• Special designators can be assigned to special
queries (can be regular)
• e.g., we define P as the path //buyer/name, and
PHKU means there is a buyer/name has value HKU in
the document
• can answer any predefined RPE very quickly

41
Comparison
• XISS
• can solve general RPE
• solve APE by dividing it into steps
• Index Fabric
• RPE solved by compile time expansion of RPE or
using predefined Refined Path Index
• solve APE by single index lookup