Estimating Rarity and Similarity over Data stream

Windows

- Paper written by
- Mayur Datar
- S. Muthukrishnan
- Effi Goldstein

Agenda

- Introduction
- Motivation of windowed data stream algorithms
- Define the problems
- The impressive Results
- Introducing the Algorithmic Tools well use
- Algorithm for Estimating rarity and similarity in

unbounded data stream model - Algorithm for Estimating rarity and similarity

over Windowed data streams

Introduction - motivation

- The sliding window model
- Often used for observations telecom

networks (packets in routers, telephone calls) - Retrieving information on the fly (I.e.

highway control, stock exchange) - Important restriction - we are only allowed

polylogarithmic (in window size) storage space. - This is very difficult consider the problem of

calculating the minimum - Thats why we settle for a good estimation

Introduction - motivation

- Motivation for rarity and similarity extracts

unique and interesting information in a data

stream - Rarity
- estimate the portion of users who are not

satisfied (online-stores) - Indication for DenialOfService.
- Similarity
- What are the commonly items in a market-basket.
- Similarity in IP-address in two web-sites
- All of these examples are well-motivated for

commercial uses.

Introduction - the problems

- Recall our work space
- the window (of size N) -
- set of items - U 1,,u.
- Rarity -
- an item x is a-rare if x appears precisely a

times in the set. - a-rare no. of such items in the set.
- distinct no. of distinct items in the set.
- a-rarity -

Introduction - the problems

- Rarity examples S 2, 3, 2, 4, 3, 1, 2,

4D(istinct) 1,2,3,41-rare 1

1-rarity 1/42-rare 3, 4

2-rarity 1/23-rare 2

3-rarity 1/4 - note that 1-rarity is the fraction of items that

do not repeat within the window.

Introduction - the problems

- Similarity - here we have two sets A B
- define X(t) and Y(t) to be the set of distinct

items - we use the Jaccard coefficient to measure their

similarity - similarityexample A 1,2,4,2,5 B

2,3,1,3,2,6X(t) 1,2,4,5Y(t)

2,3,1,6 --gt 2/6

Introduction - how good are the results...

- First important result is there is no other

known estimation for rarity similarity in a

windowed model ! - This is the reason there are no graphs at the

end - The final algorithm uses only
- O(logN logU) space
- O(log logN) time
- And estimates the results r, s with approximation

of 1e, where e can be reduced to any required

constant.

Algorithmic Tools...

- Min-wise hashing
- set p to be a random permutation over U, and ,
- the min-hash value for A for p is which is

actually the element with the smallest index

after permuting the subset. - The hashing function should be unique-value

(one-to-one function) on the set U. - I.e.- permutation

Algorithmic Tools - min-hash example

- For example consider the hash-functions p1

(1 2 3 4 5) x mod 5 p2 (5 4 3 2 1) p3

(3 4 5 1 2) p4(x) 2x1 mod 5 p2 and the

sets A 1,3,4 B 2,5 C 1,2,4Their

min-hash values are as follows hp1(A) 1

hp1(B) 2 hp1(C) 1 hp2(A) 4 hp2(B)

5 hp2(C) 4 hp3(A) 3 hp3(B) 5 hp3(C)

4

Algorithmic Tools - min-hash power...

- An important property of min-hash

functions simple to prove however, leads to

powerful results - Lemma 1 Let be k independent

min-hash values for the set A (B). Let

S(A, b) be the fraction of the min-hash values

that they agree on

Algorithmic Tools - min-hash families...

- Thus we will need to find a set of independent

min-hash functions. - Ideal family of min-hash functions is the set of

all permutations over U.However, itll

require O(u log u) bits to represent any

permutation. We cant afford that. We need to

find something else...

Algorithmic Tools - min-hash families...

- Approximate min-hash family or otherwise known

as e-min-wise-independent hash family. - They have the property that for any we get
- It has proven that any function from this family

can be represented by only O(log u log(1/e) )

bits, and be computed in O(log(1/e)) time ! - The mentioned Lemma 1 still holds for this

family!We just need to set the value of k

appropriately in terms of e, and change the

expected error from er to ere.

Algorithmic Tools - min-hash families...

- To conclude, we will only need O(log u log(1/e)

) bits for storing hash functions and O(k)

hashes, to get an approximation for the lemma !

Estimating Rarity - in unbounded window

- Recall our goal find , up

to precision p, at any time t. - Define S - multiset. the actual data stream.

D - set of distinct items from S

-set of items who appear exactly a times

in S gt

Estimating Rarity - in unbounded window

- Note 1 , and thus
- Note 2 iff the min-hash

value of D

appears exactly a times in S. gt Hence, it

suffices to maintain only min-hash values for D

only, as long as we can count the no. of

appearances.

Estimating Rarity - in unbounded window

- To summarizewhat we want is ra, which equals

by our definition, which equals

(Note 1),which in turn equals

l1ltlltk, hl(Ra)hl(D)\k (Lemma 1), which

suffices to count of min-hash values of D that

are a-rare (Note 2).These observations lead to

following Algorithm

Estimating Rarity - in unbounded window

- The Algorithmchoose k min-hash functions

. K will be determined

later.Maintain - hi(t)

which is the min-hash value of

the window by time t. - Ci(t)

counters of the no. of appearances of

hi(t).Initialize the min-hash values (hi) to

, and counters to 0.When item a(t1)

arrives 1) for each i - compute hi(t1) 2)

if hi(t1) lt hi(t), update hi(t1)hi(t1),

Ci(t1)1 3) if hi(t1) hi(t),

increment Ci(t1) 4) set hi(t1) to hi(t),

Ci(t1) Ci(t) for each i, process the

next item a(t2).

Estimating Rarity - in unbounded window

- Now, we merely need to sum up all Ci(t)s that

equals a,since from Note 2 our summarize we

get l 1ltlltk, hl(rat)hl(Dt) l

1ltlltk, Ci(t) a - Space complexity - we need O(k) for min-hash

values (hi) and the counters (Ci),O(k) seeds

for the e-min-hash functions (hi), that each

needs - O(log u log (1/e)) bits to store.we set k

in terms of e(the desired accuracy), but in any

case kO(1).Finaly, we get space complexity

O(log u log (1/e)) !

Estimating Rarity - in unbounded window

- Time complexity -in each step we need to compute

k values of the e-min-hash functions, which

takes O(k log(1/e)), also compare and sum up k

values.Since kO(1), we get time complexity

O(log(1/e)).

Estimating Similarity - in unbounded window

- Our goal given 2 data streams X Y we want to

estimate - which, by Lemma 1, equals l 1ltlltk, hl(Xt)

hl(Yt) \ k. - we actually use an easier version of the

algorithm of rarity -since now we only need to

compare the hi(t) that X Y produced at time

t when item at arrives, we compute hi(at) and

set hi(t) min hi(t), hi(t-1) - space and time complexity are as before.

Estimating Similarity - in window data streams

- We now consider the window
- We want to use a similar approach as in the

unbounded window, but maintaining a min-hash

value here is difficult. - instead, we keep a list of possible min-hash

values (and prove later that it is short

enough) - we use a domination property of min-hash

functions

Estimating Similarity - in window data streams

- some definitions first
- an active item is an item who still lives in

the window boundary. - An active item a2 dominates active item a1, if

it arrived later in the window, but hi(a2) lt

hi(a1) (has smaller min-hash value).Notice that

a dominated item will never get to be a

min-hash value of hi within the window size,

since there is always a preferred item... - dominance property example

Estimating Similarity - in window data streams

- window size N5

Dominating item

Estimating Similarity - in window data streams

- Note that now hi(t) hi(aj1) !(hi(t) is the

min-hash value in the window) - The algorithm for maintaining - when

item arrives, we compute .-

delete all items in the list, that have have

bigger hash value (they are all being

dominated)- if equals the last

hash value on the list, just update that

pair with last arrival time.- else, append

the pair ( , t1) to the end of the

list.- check if the first item on the list has

not expired. If it has - delete it (it is no

longer active).

Estimating Similarity - in window data streams

- Min-hash list example

Min-hash list

We only have to make sure the list Li isnt too

long. We use...

Estimating Similarity - in window data streams

- Lemma 2 - with high probability, the length of

is Q(HN), where HN is the Nth

harmonic number (11/21/31/N), which is .

O(logN). - Since we now know what is the min-hash value,

hi, in the window (the first item on the list,

)We now follow the logic we used in the

unbounded stream - We saw that
- (Lemma 1)
- So just compare the min-hash values of the

min-hash family, for both streams X Y.

Estimating Similarity - in window data streams

- Space complexity we use O(k) hash-functions,

for each one we keep a linked-list of size O(log

N), with elements of size O(log u) each

one.Overall, we use space complexity O((log

N)(log u)) - Time complexity when updating the list

, we need to search the appropriate place to

insert the new item. Since the list is ordered,

it is a simple heap-insertion.? we get O(log

) O(log log N).

Estimating Rarity - in window data streams

- We use a similar concept to the one we used

earlier - we still want to keep a linked-list of dominant

min-hash values - But since now we need to find a instances of an

item, we keep several arrival times of the item. - So now, each entry is the pair where

is an ordered list of the latest a time

instances of the item - So the list now looks like

Estimating Rarity - in window data streams

- Note that here, we store a list of a instances of

an item, while previously we stored only the

latest arrival time of each item in list which

is the largest value in the list. - The algorithm for maintaining , resembles

the one before - when item arrives, we

compute .- delete all items in the

list, that have have bigger hash value (they

are all being dominated)- if

equals the last hash value on the list, append

t1 to the list . If the list

now has more than a items delete the first one.

Estimating Rarity - in window data streams

- - else, append the pair ( , t1

) to the end of the list, where the arrival

list here is a singleton.- check if the first

arrival time of the first item on the list, has

not expired. If it has - delete it (it is

no longer active). - The list length here, is O(a logN). Using Lemma 2

(here we have a elements for each item).

Estimating Rarity - in window data streams

- And the same logic holdssince

we getfrom Lemma 1 we getfrom Note 2 we get

iff the min-hash value of D

appears in a times in the window. - Thus we only have to count the min-hash values

hi (hi(aj1)) that their arrival-time list is a

long !!

Estimating Rarity - in window data streams

- Space complexity we use O(k) hash-functions,

for each one we keep a linked-list of size O(a

log N), with elements of size O(log u) each

one.Overall, we use space complexity O((log

N)(log u)) - Time complexity updating the list ,

costs exactly as in the similaritys list.We get

time complexity O(log log N).

Concluding remarks

- The algorithms presented here, are the first

solutions for the windowed Rarity and Similarity

problems (the authors claim..) - Citation from the article We expect our

technique to find applications in practice