Estimating Rarity and Similarity over Data stream Windows - PowerPoint PPT Presentation

About This Presentation
Title:

Estimating Rarity and Similarity over Data stream Windows

Description:

Estimating Rarity and Similarity over Data stream Windows Paper written by: Mayur Datar S. Muthukrishnan Effi Goldstein – PowerPoint PPT presentation

Number of Views:2
Avg rating:3.0/5.0
Slides: 35
Provided by: Effi7
Category:

less

Transcript and Presenter's Notes

Title: Estimating Rarity and Similarity over Data stream Windows


1
Estimating Rarity and Similarity over Data stream
Windows
  • Paper written by
  • Mayur Datar
  • S. Muthukrishnan
  • Effi Goldstein

2
Agenda
  • Introduction
  • Motivation of windowed data stream algorithms
  • Define the problems
  • The impressive Results
  • Introducing the Algorithmic Tools well use
  • Algorithm for Estimating rarity and similarity in
    unbounded data stream model
  • Algorithm for Estimating rarity and similarity
    over Windowed data streams

3
Introduction - motivation
  • The sliding window model
  • Often used for observations telecom
    networks (packets in routers, telephone calls)
  • Retrieving information on the fly (I.e.
    highway control, stock exchange)
  • Important restriction - we are only allowed
    polylogarithmic (in window size) storage space.
  • This is very difficult consider the problem of
    calculating the minimum
  • Thats why we settle for a good estimation

4
Introduction - motivation
  • Motivation for rarity and similarity extracts
    unique and interesting information in a data
    stream
  • Rarity
  • estimate the portion of users who are not
    satisfied (online-stores)
  • Indication for DenialOfService.
  • Similarity
  • What are the commonly items in a market-basket.
  • Similarity in IP-address in two web-sites
  • All of these examples are well-motivated for
    commercial uses.

5
Introduction - the problems
  • Recall our work space
  • the window (of size N) -
  • set of items - U 1,,u.
  • Rarity -
  • an item x is a-rare if x appears precisely a
    times in the set.
  • a-rare no. of such items in the set.
  • distinct no. of distinct items in the set.
  • a-rarity -

6
Introduction - the problems
  • Rarity examples S 2, 3, 2, 4, 3, 1, 2,
    4D(istinct) 1,2,3,41-rare 1
    1-rarity 1/42-rare 3, 4
    2-rarity 1/23-rare 2
    3-rarity 1/4
  • note that 1-rarity is the fraction of items that
    do not repeat within the window.

7
Introduction - the problems
  • Similarity - here we have two sets A B
  • define X(t) and Y(t) to be the set of distinct
    items
  • we use the Jaccard coefficient to measure their
    similarity
  • similarityexample A 1,2,4,2,5 B
    2,3,1,3,2,6X(t) 1,2,4,5Y(t)
    2,3,1,6 --gt 2/6

8
Introduction - how good are the results...
  • First important result is there is no other
    known estimation for rarity similarity in a
    windowed model !
  • This is the reason there are no graphs at the
    end
  • The final algorithm uses only
  • O(logN logU) space
  • O(log logN) time
  • And estimates the results r, s with approximation
    of 1e, where e can be reduced to any required
    constant.

9
Algorithmic Tools...
  • Min-wise hashing
  • set p to be a random permutation over U, and ,
  • the min-hash value for A for p is which is
    actually the element with the smallest index
    after permuting the subset.
  • The hashing function should be unique-value
    (one-to-one function) on the set U.
  • I.e.- permutation

10
Algorithmic Tools - min-hash example
  • For example consider the hash-functions p1
    (1 2 3 4 5) x mod 5 p2 (5 4 3 2 1) p3
    (3 4 5 1 2) p4(x) 2x1 mod 5 p2 and the
    sets A 1,3,4 B 2,5 C 1,2,4Their
    min-hash values are as follows hp1(A) 1
    hp1(B) 2 hp1(C) 1 hp2(A) 4 hp2(B)
    5 hp2(C) 4 hp3(A) 3 hp3(B) 5 hp3(C)
    4

11
Algorithmic Tools - min-hash power...
  • An important property of min-hash
    functions simple to prove however, leads to
    powerful results
  • Lemma 1 Let be k independent
    min-hash values for the set A (B). Let
    S(A, b) be the fraction of the min-hash values
    that they agree on

12
Algorithmic Tools - min-hash families...
  • Thus we will need to find a set of independent
    min-hash functions.
  • Ideal family of min-hash functions is the set of
    all permutations over U.However, itll
    require O(u log u) bits to represent any
    permutation. We cant afford that. We need to
    find something else...

13
Algorithmic Tools - min-hash families...
  • Approximate min-hash family or otherwise known
    as e-min-wise-independent hash family.
  • They have the property that for any we get
  • It has proven that any function from this family
    can be represented by only O(log u log(1/e) )
    bits, and be computed in O(log(1/e)) time !
  • The mentioned Lemma 1 still holds for this
    family!We just need to set the value of k
    appropriately in terms of e, and change the
    expected error from er to ere.

14
Algorithmic Tools - min-hash families...
  • To conclude, we will only need O(log u log(1/e)
    ) bits for storing hash functions and O(k)
    hashes, to get an approximation for the lemma !

15
Estimating Rarity - in unbounded window
  • Recall our goal find , up
    to precision p, at any time t.
  • Define S - multiset. the actual data stream.
    D - set of distinct items from S
    -set of items who appear exactly a times
    in S gt

16
Estimating Rarity - in unbounded window
  • Note 1 , and thus
  • Note 2 iff the min-hash
    value of D
    appears exactly a times in S. gt Hence, it
    suffices to maintain only min-hash values for D
    only, as long as we can count the no. of
    appearances.

17
Estimating Rarity - in unbounded window
  • To summarizewhat we want is ra, which equals
    by our definition, which equals
    (Note 1),which in turn equals
    l1ltlltk, hl(Ra)hl(D)\k (Lemma 1), which
    suffices to count of min-hash values of D that
    are a-rare (Note 2).These observations lead to
    following Algorithm

18
Estimating Rarity - in unbounded window
  • The Algorithmchoose k min-hash functions
    . K will be determined
    later.Maintain - hi(t)
    which is the min-hash value of
    the window by time t. - Ci(t)
    counters of the no. of appearances of
    hi(t).Initialize the min-hash values (hi) to
    , and counters to 0.When item a(t1)
    arrives 1) for each i - compute hi(t1) 2)
    if hi(t1) lt hi(t), update hi(t1)hi(t1),
    Ci(t1)1 3) if hi(t1) hi(t),
    increment Ci(t1) 4) set hi(t1) to hi(t),
    Ci(t1) Ci(t) for each i, process the
    next item a(t2).

19
Estimating Rarity - in unbounded window
  • Now, we merely need to sum up all Ci(t)s that
    equals a,since from Note 2 our summarize we
    get l 1ltlltk, hl(rat)hl(Dt) l
    1ltlltk, Ci(t) a
  • Space complexity - we need O(k) for min-hash
    values (hi) and the counters (Ci),O(k) seeds
    for the e-min-hash functions (hi), that each
    needs
  • O(log u log (1/e)) bits to store.we set k
    in terms of e(the desired accuracy), but in any
    case kO(1).Finaly, we get space complexity
    O(log u log (1/e)) !

20
Estimating Rarity - in unbounded window
  • Time complexity -in each step we need to compute
    k values of the e-min-hash functions, which
    takes O(k log(1/e)), also compare and sum up k
    values.Since kO(1), we get time complexity
    O(log(1/e)).

21
Estimating Similarity - in unbounded window
  • Our goal given 2 data streams X Y we want to
    estimate
  • which, by Lemma 1, equals l 1ltlltk, hl(Xt)
    hl(Yt) \ k.
  • we actually use an easier version of the
    algorithm of rarity -since now we only need to
    compare the hi(t) that X Y produced at time
    t when item at arrives, we compute hi(at) and
    set hi(t) min hi(t), hi(t-1)
  • space and time complexity are as before.

22
Estimating Similarity - in window data streams
  • We now consider the window
  • We want to use a similar approach as in the
    unbounded window, but maintaining a min-hash
    value here is difficult.
  • instead, we keep a list of possible min-hash
    values (and prove later that it is short
    enough)
  • we use a domination property of min-hash
    functions

23
Estimating Similarity - in window data streams
  • some definitions first
  • an active item is an item who still lives in
    the window boundary.
  • An active item a2 dominates active item a1, if
    it arrived later in the window, but hi(a2) lt
    hi(a1) (has smaller min-hash value).Notice that
    a dominated item will never get to be a
    min-hash value of hi within the window size,
    since there is always a preferred item...
  • dominance property example

24
Estimating Similarity - in window data streams
  • window size N5

Dominating item
25
Estimating Similarity - in window data streams
  • Note that now hi(t) hi(aj1) !(hi(t) is the
    min-hash value in the window)
  • The algorithm for maintaining - when
    item arrives, we compute .-
    delete all items in the list, that have have
    bigger hash value (they are all being
    dominated)- if equals the last
    hash value on the list, just update that
    pair with last arrival time.- else, append
    the pair ( , t1) to the end of the
    list.- check if the first item on the list has
    not expired. If it has - delete it (it is no
    longer active).

26
Estimating Similarity - in window data streams
  • Min-hash list example

Min-hash list
We only have to make sure the list Li isnt too
long. We use...
27
Estimating Similarity - in window data streams
  • Lemma 2 - with high probability, the length of
    is Q(HN), where HN is the Nth
    harmonic number (11/21/31/N), which is .
    O(logN).
  • Since we now know what is the min-hash value,
    hi, in the window (the first item on the list,
    )We now follow the logic we used in the
    unbounded stream
  • We saw that
  • (Lemma 1)
  • So just compare the min-hash values of the
    min-hash family, for both streams X Y.

28
Estimating Similarity - in window data streams
  • Space complexity we use O(k) hash-functions,
    for each one we keep a linked-list of size O(log
    N), with elements of size O(log u) each
    one.Overall, we use space complexity O((log
    N)(log u))
  • Time complexity when updating the list
    , we need to search the appropriate place to
    insert the new item. Since the list is ordered,
    it is a simple heap-insertion.? we get O(log
    ) O(log log N).

29
Estimating Rarity - in window data streams
  • We use a similar concept to the one we used
    earlier
  • we still want to keep a linked-list of dominant
    min-hash values
  • But since now we need to find a instances of an
    item, we keep several arrival times of the item.
  • So now, each entry is the pair where
    is an ordered list of the latest a time
    instances of the item
  • So the list now looks like

30
Estimating Rarity - in window data streams
  • Note that here, we store a list of a instances of
    an item, while previously we stored only the
    latest arrival time of each item in list which
    is the largest value in the list.
  • The algorithm for maintaining , resembles
    the one before - when item arrives, we
    compute .- delete all items in the
    list, that have have bigger hash value (they
    are all being dominated)- if
    equals the last hash value on the list, append
    t1 to the list . If the list
    now has more than a items delete the first one.

31
Estimating Rarity - in window data streams
  • - else, append the pair ( , t1
    ) to the end of the list, where the arrival
    list here is a singleton.- check if the first
    arrival time of the first item on the list, has
    not expired. If it has - delete it (it is
    no longer active).
  • The list length here, is O(a logN). Using Lemma 2
    (here we have a elements for each item).

32
Estimating Rarity - in window data streams
  • And the same logic holdssince
    we getfrom Lemma 1 we getfrom Note 2 we get
    iff the min-hash value of D
    appears in a times in the window.
  • Thus we only have to count the min-hash values
    hi (hi(aj1)) that their arrival-time list is a
    long !!

33
Estimating Rarity - in window data streams
  • Space complexity we use O(k) hash-functions,
    for each one we keep a linked-list of size O(a
    log N), with elements of size O(log u) each
    one.Overall, we use space complexity O((log
    N)(log u))
  • Time complexity updating the list ,
    costs exactly as in the similaritys list.We get
    time complexity O(log log N).

34
Concluding remarks
  • The algorithms presented here, are the first
    solutions for the windowed Rarity and Similarity
    problems (the authors claim..)
  • Citation from the article We expect our
    technique to find applications in practice
Write a Comment
User Comments (0)
About PowerShow.com