Title: Estimating Rarity and Similarity over Data stream Windows
1Estimating Rarity and Similarity over Data stream
Windows
- Paper written by
- Mayur Datar
- S. Muthukrishnan
- Effi Goldstein
2Agenda
- Introduction
- Motivation of windowed data stream algorithms
- Define the problems
- The impressive Results
- Introducing the Algorithmic Tools well use
- Algorithm for Estimating rarity and similarity in
unbounded data stream model - Algorithm for Estimating rarity and similarity
over Windowed data streams
3Introduction - motivation
- The sliding window model
- Often used for observations telecom
networks (packets in routers, telephone calls) - Retrieving information on the fly (I.e.
highway control, stock exchange) - Important restriction - we are only allowed
polylogarithmic (in window size) storage space. - This is very difficult consider the problem of
calculating the minimum - Thats why we settle for a good estimation
4Introduction - motivation
- Motivation for rarity and similarity extracts
unique and interesting information in a data
stream - Rarity
- estimate the portion of users who are not
satisfied (online-stores) - Indication for DenialOfService.
- Similarity
- What are the commonly items in a market-basket.
- Similarity in IP-address in two web-sites
- All of these examples are well-motivated for
commercial uses.
5Introduction - the problems
- Recall our work space
- the window (of size N) -
- set of items - U 1,,u.
- Rarity -
- an item x is a-rare if x appears precisely a
times in the set. - a-rare no. of such items in the set.
- distinct no. of distinct items in the set.
- a-rarity -
6Introduction - the problems
- Rarity examples S 2, 3, 2, 4, 3, 1, 2,
4D(istinct) 1,2,3,41-rare 1
1-rarity 1/42-rare 3, 4
2-rarity 1/23-rare 2
3-rarity 1/4 - note that 1-rarity is the fraction of items that
do not repeat within the window.
7Introduction - the problems
- Similarity - here we have two sets A B
- define X(t) and Y(t) to be the set of distinct
items - we use the Jaccard coefficient to measure their
similarity - similarityexample A 1,2,4,2,5 B
2,3,1,3,2,6X(t) 1,2,4,5Y(t)
2,3,1,6 --gt 2/6
8Introduction - how good are the results...
- First important result is there is no other
known estimation for rarity similarity in a
windowed model ! - This is the reason there are no graphs at the
end - The final algorithm uses only
- O(logN logU) space
- O(log logN) time
- And estimates the results r, s with approximation
of 1e, where e can be reduced to any required
constant.
9Algorithmic Tools...
- Min-wise hashing
- set p to be a random permutation over U, and ,
- the min-hash value for A for p is which is
actually the element with the smallest index
after permuting the subset. - The hashing function should be unique-value
(one-to-one function) on the set U. - I.e.- permutation
10Algorithmic Tools - min-hash example
- For example consider the hash-functions p1
(1 2 3 4 5) x mod 5 p2 (5 4 3 2 1) p3
(3 4 5 1 2) p4(x) 2x1 mod 5 p2 and the
sets A 1,3,4 B 2,5 C 1,2,4Their
min-hash values are as follows hp1(A) 1
hp1(B) 2 hp1(C) 1 hp2(A) 4 hp2(B)
5 hp2(C) 4 hp3(A) 3 hp3(B) 5 hp3(C)
4
11Algorithmic Tools - min-hash power...
- An important property of min-hash
functions simple to prove however, leads to
powerful results - Lemma 1 Let be k independent
min-hash values for the set A (B). Let
S(A, b) be the fraction of the min-hash values
that they agree on
12Algorithmic Tools - min-hash families...
- Thus we will need to find a set of independent
min-hash functions. - Ideal family of min-hash functions is the set of
all permutations over U.However, itll
require O(u log u) bits to represent any
permutation. We cant afford that. We need to
find something else...
13Algorithmic Tools - min-hash families...
- Approximate min-hash family or otherwise known
as e-min-wise-independent hash family. - They have the property that for any we get
- It has proven that any function from this family
can be represented by only O(log u log(1/e) )
bits, and be computed in O(log(1/e)) time ! - The mentioned Lemma 1 still holds for this
family!We just need to set the value of k
appropriately in terms of e, and change the
expected error from er to ere.
14Algorithmic Tools - min-hash families...
- To conclude, we will only need O(log u log(1/e)
) bits for storing hash functions and O(k)
hashes, to get an approximation for the lemma !
15Estimating Rarity - in unbounded window
- Recall our goal find , up
to precision p, at any time t. - Define S - multiset. the actual data stream.
D - set of distinct items from S
-set of items who appear exactly a times
in S gt
16Estimating Rarity - in unbounded window
- Note 1 , and thus
- Note 2 iff the min-hash
value of D
appears exactly a times in S. gt Hence, it
suffices to maintain only min-hash values for D
only, as long as we can count the no. of
appearances.
17Estimating Rarity - in unbounded window
- To summarizewhat we want is ra, which equals
by our definition, which equals
(Note 1),which in turn equals
l1ltlltk, hl(Ra)hl(D)\k (Lemma 1), which
suffices to count of min-hash values of D that
are a-rare (Note 2).These observations lead to
following Algorithm
18Estimating Rarity - in unbounded window
- The Algorithmchoose k min-hash functions
. K will be determined
later.Maintain - hi(t)
which is the min-hash value of
the window by time t. - Ci(t)
counters of the no. of appearances of
hi(t).Initialize the min-hash values (hi) to
, and counters to 0.When item a(t1)
arrives 1) for each i - compute hi(t1) 2)
if hi(t1) lt hi(t), update hi(t1)hi(t1),
Ci(t1)1 3) if hi(t1) hi(t),
increment Ci(t1) 4) set hi(t1) to hi(t),
Ci(t1) Ci(t) for each i, process the
next item a(t2).
19Estimating Rarity - in unbounded window
- Now, we merely need to sum up all Ci(t)s that
equals a,since from Note 2 our summarize we
get l 1ltlltk, hl(rat)hl(Dt) l
1ltlltk, Ci(t) a - Space complexity - we need O(k) for min-hash
values (hi) and the counters (Ci),O(k) seeds
for the e-min-hash functions (hi), that each
needs - O(log u log (1/e)) bits to store.we set k
in terms of e(the desired accuracy), but in any
case kO(1).Finaly, we get space complexity
O(log u log (1/e)) !
20Estimating Rarity - in unbounded window
- Time complexity -in each step we need to compute
k values of the e-min-hash functions, which
takes O(k log(1/e)), also compare and sum up k
values.Since kO(1), we get time complexity
O(log(1/e)).
21Estimating Similarity - in unbounded window
- Our goal given 2 data streams X Y we want to
estimate - which, by Lemma 1, equals l 1ltlltk, hl(Xt)
hl(Yt) \ k. - we actually use an easier version of the
algorithm of rarity -since now we only need to
compare the hi(t) that X Y produced at time
t when item at arrives, we compute hi(at) and
set hi(t) min hi(t), hi(t-1) - space and time complexity are as before.
22Estimating Similarity - in window data streams
- We now consider the window
- We want to use a similar approach as in the
unbounded window, but maintaining a min-hash
value here is difficult. - instead, we keep a list of possible min-hash
values (and prove later that it is short
enough) - we use a domination property of min-hash
functions
23Estimating Similarity - in window data streams
- some definitions first
- an active item is an item who still lives in
the window boundary. - An active item a2 dominates active item a1, if
it arrived later in the window, but hi(a2) lt
hi(a1) (has smaller min-hash value).Notice that
a dominated item will never get to be a
min-hash value of hi within the window size,
since there is always a preferred item... - dominance property example
24Estimating Similarity - in window data streams
Dominating item
25Estimating Similarity - in window data streams
- Note that now hi(t) hi(aj1) !(hi(t) is the
min-hash value in the window) - The algorithm for maintaining - when
item arrives, we compute .-
delete all items in the list, that have have
bigger hash value (they are all being
dominated)- if equals the last
hash value on the list, just update that
pair with last arrival time.- else, append
the pair ( , t1) to the end of the
list.- check if the first item on the list has
not expired. If it has - delete it (it is no
longer active).
26Estimating Similarity - in window data streams
Min-hash list
We only have to make sure the list Li isnt too
long. We use...
27Estimating Similarity - in window data streams
- Lemma 2 - with high probability, the length of
is Q(HN), where HN is the Nth
harmonic number (11/21/31/N), which is .
O(logN). - Since we now know what is the min-hash value,
hi, in the window (the first item on the list,
)We now follow the logic we used in the
unbounded stream - We saw that
- (Lemma 1)
- So just compare the min-hash values of the
min-hash family, for both streams X Y.
28Estimating Similarity - in window data streams
- Space complexity we use O(k) hash-functions,
for each one we keep a linked-list of size O(log
N), with elements of size O(log u) each
one.Overall, we use space complexity O((log
N)(log u)) - Time complexity when updating the list
, we need to search the appropriate place to
insert the new item. Since the list is ordered,
it is a simple heap-insertion.? we get O(log
) O(log log N).
29Estimating Rarity - in window data streams
- We use a similar concept to the one we used
earlier - we still want to keep a linked-list of dominant
min-hash values - But since now we need to find a instances of an
item, we keep several arrival times of the item. - So now, each entry is the pair where
is an ordered list of the latest a time
instances of the item - So the list now looks like
30Estimating Rarity - in window data streams
- Note that here, we store a list of a instances of
an item, while previously we stored only the
latest arrival time of each item in list which
is the largest value in the list. - The algorithm for maintaining , resembles
the one before - when item arrives, we
compute .- delete all items in the
list, that have have bigger hash value (they
are all being dominated)- if
equals the last hash value on the list, append
t1 to the list . If the list
now has more than a items delete the first one.
31Estimating Rarity - in window data streams
- - else, append the pair ( , t1
) to the end of the list, where the arrival
list here is a singleton.- check if the first
arrival time of the first item on the list, has
not expired. If it has - delete it (it is
no longer active). - The list length here, is O(a logN). Using Lemma 2
(here we have a elements for each item).
32Estimating Rarity - in window data streams
- And the same logic holdssince
we getfrom Lemma 1 we getfrom Note 2 we get
iff the min-hash value of D
appears in a times in the window. - Thus we only have to count the min-hash values
hi (hi(aj1)) that their arrival-time list is a
long !!
33Estimating Rarity - in window data streams
- Space complexity we use O(k) hash-functions,
for each one we keep a linked-list of size O(a
log N), with elements of size O(log u) each
one.Overall, we use space complexity O((log
N)(log u)) - Time complexity updating the list ,
costs exactly as in the similaritys list.We get
time complexity O(log log N).
34Concluding remarks
- The algorithms presented here, are the first
solutions for the windowed Rarity and Similarity
problems (the authors claim..) - Citation from the article We expect our
technique to find applications in practice -