Improve search in unstructured P2P overlay - PowerPoint PPT Presentation

1 / 73
About This Presentation

Improve search in unstructured P2P overlay


Decentralized: search is performed by probing peers ... Search strategies ... Allow keyword search. Example of searching a mp3 file in Gnutella network. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 74
Provided by: OPA72


Transcript and Presenter's Notes

Title: Improve search in unstructured P2P overlay


  • Improve search in unstructured P2P overlay

Peer-to-peer Networks
  • Peers are connected by an overlay network.
  • Users cooperate to share files (e.g., music,
    videos, etc.)

(Search in) Basic P2P Architectures
  • Centralized central directory server. (Napster)
  • Decentralized search is performed by probing
  • Structured (DHTs) (Can, Chord,) location is
    coupled with topology - search is routed by the
  • Only exact-match queries, tightly controlled
  • Unstructured (Gnutella) search is blind -
    probed peers are unrelated to query.

  • Search strategies
  • Beverly Yang and Hector Garcia-Molina, Improving
    Search in Peer-to-Peer Networks, ICDCS 2002
  • Arturo Crespo, Hector Garcia-Molina, Routing
    Indices For Peer-to-Peer Systems, ICDCS 2002
  • Short cuts
  • Kunwadee Sripanidkulchai, Bruce Maggs and Hui
    Zhang, Efficient Content Location Using
    Interest-based Locality in Peer-to-Peer Systems,
    infocom 2003.
  • Replication
  • Edith Cohen and Scott Shenker, Replication
    Strategies in Unstructured Peer-to-Peer
    Networks, SIGCOMM 2002.

Improving Search in Peer-to-Peer Networks
  • ICDCS 2002
  • Beverly Yang
  • Hector Garcia-Molina

  • The propose of a data-sharing P2P system is to
    accept queries from users, and locate and return
    data (or pointers to the data).
  • Metrics
  • Cost
  • Average aggregate bandwidth
  • Average aggregate processing cost
  • Quality of results
  • Number of results
  • Satisfaction a query is satisfied if Z (a value
    specified by user) or more results are returned.
  • Time to satisfaction

Current Techniques
  • Gnutella
  • BFS with depth limit D.
  • Waste bandwidth and processing resources
  • Freenet
  • DFS with depth limit D.
  • Poor response time.

Broadcast policies
  • Iterative deepening
  • Directed BFS
  • Local Indices

Iterative Deepening
  • In system where satisfaction is the metric of
    choice, iterative deepening is a good technique
  • Under policy P a, b, c waiting time W
  • A source node S first initiates a BFS of depth a
  • The query is processed and then becomes frozen at
    all nodes that are a hops from the source
  • S waiting for a time period W

Iterative Deepening
  • If query is not satisfied, S will start the next
    iteration, initiating a BFS of depth b.
  • S send a Resend with a TTL of a
  • A node that receives a Resend message will simply
    forward the message or if the node is at depth a,
    it will drop the resend message and unfreeze the
    corresponding query by forwarding the query
    message with a TTL of b-a to all its neighbors
  • A node need only freeze a query for slightly more
    than W time units before deleting it

Directed BFS
  • If minimizing response time is important to an
    application, iterative deepening may not be
  • A source send query messages to just a subset of
    its neighbors
  • A node maintains simple statistics on its
  • Number of results received from each neighbor
  • Latency of connection

Directed BFS (cont)
  • Candidate nodes
  • Returned the Highest number of results
  • The neighbor that returns response messages that
    have taken the lowest average number of hops
  • High message count

Local Indices
  • Each node n maintains an index over the data of
    all nodes within r hops radius.
  • All nodes at depths not listed in the policy
    simply forward the query.
  • Example policy P 1, 5

Experimental result
Routing Indices For Peer-to-Peer Systems
Arturo Crespo, Hector Garcia-Molina
Stanford University
  • A key part of a P2P system is document discovery
  • The goal is to help users find documents with
    content of interest across potential P2P sources
  • The mechanisms for searching can be classified in
    three categories
  • Mechanisms without an index
  • Mechanisms with specialized index nodes
    (centralized search)
  • Mechanisms with indices at each node (distributed

Motivation (cont.)
  • Gnutella uses a mechanism where nodes do not have
    an index
  • Queries are propagated from node to node until
    matching documents are found
  • Although this approach is simple and robust, it
    has the disadvantage of the enormous cost of
    flooding the network every time a query is
  • Centralized-search systems use specialized nodes
    that maintain an index of the documents available
    in the P2P system like Napster
  • The user queries an index node to identify nodes
    having documents with the content
  • A centralized system is vulnerable to attack and
    it is difficult to keep the indices up-to-date

Motivation (cont.)
  • A distributed-index mechanism
  • Routing Indices (RIs)
  • Give a direction towards the document, rather
    than its actual location
  • By using routes the index size is proportional
    to the number of neighbors

Peer-to-peer Systems
  • A P2P system is formed by a large number of nodes
    that can join or leave the system at any time
  • Each node has a local document database that can
    be accessed through a local index
  • The local index receives content queries and
    returns pointers to the documents with the
    requested content

Query Processing in a Distributed Search P2P
  • In a distributed-search P2P system, users submit
    queries to any node along with a stop condition
  • A node receiving a query first evaluates the
    query against its own database, returns to the
    user pointers to any results
  • If the stop condition has not been reached, the
    node selects one or more of its neighbors and
    forwards the query to them
  • Queries can be forwarded to the best neighbors in
    parallel or sequentially
  • A parallel approach yields better response time,
    but generates higher traffic and may waste

Routing indices
  • The objective of a Routing Index (RI) is to allow
    a node to select the best neighbors to send a
  • A RI is a data structure that, given a query,
    returns a list of neighbors, ranked according to
    their goodness for the query
  • Each node has a local index for quickly finding
    local documents when a query is received. Nodes
    also have a CRI containing
  • the number of documents along each path
  • the number of documents on each topic

Routing indices (cont.)
  • Thus, the number of results in a path can be
  • as
  • CRI(si) is the value for the cell at the column
    for topic si and at the row for a neighbor
  • The goodness of B 6
  • C
  • D
  • Note that these numbers are just estimates and
    they are subject to overcounts and/or undercounts
  • A limitation of using CRIs is that they do not
    take into account the difference in cost due to
    the number of hops necessary to reach a document

Using Routing Indices
Using Routing Indices (cont.)
  • The storage space required by an RI in a node is
    modest as we are only storing index information
    for each neighbor
  • t is the counter size in bytes, c is the number
    of categories, N the number of nodes, and b the
    branching factor
  • Centralized index would require t (c 1) N
  • the total for the entire distributed system is t
    (c 1) b N bytes
  • the RIs require more storage space overall than a
    centralized index, the cost of the storage space
    is shared among the network nodes

Creating Routing Indices
Maintaining Routing Indices
  • Maintaining RIs is identical to the process used
    for creating them
  • For efficiency, we may delay exporting an update
    for a short time so we can batch several updates,
    thus, trading RI freshness for a reduced update
  • We can also choose sending minor updates, but
    reduce accuracy of the RI

Hop-count Routing Indices
Hop-count Routing Indices (cont.)
  • The estimator of a hop-count RI needs a cost
    model to compute the goodness of a neighbor
  • We assumes that document results are uniformly
    distributed across the network and that the
    network is a regular tree with fanout F
  • We define the goodness (goodness hc) of Neighbor
    i with respect to query Q for hop-count RI as
  • If we assume F 3, the goodness of X for a query
    about DB documents would be 1310/3 16.33 and
    for Y would be 031/3 10.33

Exponentially aggregated RI
  • Each entry of the ERI for node N contains a value
    computed as
  • th is the height and F the fanout of the assumed
    regular tree, goodness() is the Compound RI
    estimator , Nj is the summary of the local
    index of neighbor j of N, and T is the topic of
    interest of the entry
  • Problems?!

Exponentially aggregated RI (cont.)
Cycles in the P2P Network
  • There are three general approaches for dealing
    with cycles
  • No-op solution No changes are made to the
  • Cycle avoidance solution In this solution we do
    not allow nodes to create an update connection
    to other nodes if such connection would create a
  • Absence of global information
  • Cycle detection and recovery This solution
    detects cycles sometime after they are formed
    and, after that, takes recovery actions to
    eliminate the effect of the cycles

Experimental Results
  • Modeling search mechanisms in a P2P system
  • We consider three kinds of network topologies
  • a tree because it does not have cycles
  • we start with a tree and we add extra vertices at
    random (creating cycles)
  • a power-law graph, is considered a good model for
    P2P systems and allows us to test our algorithms
    against a realistic topology
  • We model the location of document results using
    two distributions uniform and an 80/20 biased
  • 80/20 assigns uniformly 80 of the document
    results to 20 of the nodes
  • In this paper we focus on the network and we use
    the number of messages generated by each
    algorithm as a measure of cost

Experimental Results (cont.)
Experimental Results (cont.)
  • In particular, CRI uses all nodes in the network,
    HRI uses nodes within a predefined a horizon, and
    ERI uses nodes until the exponentially decayed
    value of an index entry reaches a minimum value
  • In the case of the No-RI approach, an 80/20
    document distribution penalizes performance as
    the search mechanism needs to visit a number of
    nodes until it finds a content-loaded node

Experimental Results (cont.)
  • RIs perform better in a power-law network than in
    a tree network (Query)
  • In a power-law network a few nodes have a
    significantly higher connectivity than the rest
  • Power-law distributions generate network
    topologies where the average path length between
    two nodes is lower than in tree topologies

Experimental Results (cont.)
  • The tradeoff between query and update costs for
  • The cost of CRI is much higher when compared with
    HRI and ERI
  • ERI only propagate the update to a subset of the

  • Achieve greater efficiency by placing Routing
    Indices in each node. Three possible RIs
    compound RIs, hopcount RIs, and exponential RIs
  • From experiments, ERIs and HRI offer significant
    improvements versus not using an RI, while
    keeping update costs low

Efficient Content Location Using Interest-based
Locality in Peer-to-Peer Systems
  • Each peer is connected randomly, and searching is
    done by flooding.
  • Allow keyword search

Example of searching a mp3 file in Gnutella
network. The query is flooded across the network.
  • DHT (Chord)
  • Given a key, Chord will map the key to the node.
  • Each node need to maintain O(log N) information
  • Each query use O(log N) messages.
  • Key search means searching by exact name

Interest-based Locality
  • Peers have similar interest will share similar

  • Shortcuts are modular.
  • Shortcuts are performance enhancement hints.

Creation of shortcuts
  • The peer use the underlying topology (e.g.
    Gnutella) for the first few searches.
  • One of the return peers is selected from random
    and added to the shortcut lists.
  • Each shortcut will be ordered by the metric, e.g.
    success rate, path latency.
  • Subsequent queries go through the shortcut lists
  • If fail, lookup through underlying topology.

Performance Evaluation
  • Performance metric
  • success rate
  • load characteristics (query packets per peers
    process in the system)
  • query scope (the fraction of peers in each query)
  • minimum reply path length
  • additional state kept in each node

Methodology query workload
  • Create traffic trace from the real application
  • Boeing firewall proxies
  • Microsoft firewall proxies
  • Passively collect the web traffic between CMU and
    the Internet
  • Passively collect typical P2P traffic (Kazza,
  • Use exact matching rather than keyword matching
    in the simulation.
  • song.mp3 and my artist song.mp3 will be
    treated as different.

Methodology Underlying peers topology
  • Based on the Gnutella connectivity graph in 2001,
    with 95 nodes about 7 hops away.
  • Searching TTL is set to 7.
  • For each kind of traffic (Boeing, Microsoft
    etc), run 8 times simulations, each with 1 hour.

Simulation Results success rate
Simulation Results load and path length
-- Query load for Boeing and Microsoft Traffic
-- Average path length of the traces
Increase Number of Shortcuts
Enhancement of Interest-based Locality
Using Shortcuts Shortcuts
Enhancement of Interest-based Locality
  • Idea

Add the shortcuts shortcut
Performance gain of 7 on average
Interest-based Structures
  • When viewed as an undirected graph
  • In the first 10 minutes, there are many connected
    components, each component has a few peers in
  • At the end of simulation, there are few connected
    components, each component has several hundred
    peers. Each component is well connected.
  • The clustering coefficient is about 0.6 0.7,
    which is higher than that in Web graph.

Sensitivity of Shortcuts
  • Run Interest based shortcuts over DHT (Chord)
    instead of Gnutella.

Query load is reduced by a factor 2 4. Query
scope is reduced from 7/N to 1.5/N
  • Interest based shortcuts are modular and
    performance enhancement hints over existing P2P
  • Shortcuts are proven can enhance the searching
  • Shortcuts form clusters within a P2P topology,
    and the clusters are well connected.

Replication Strategies in Unstructured
Peer-to-Peer Networks
  • Edith Cohen
  • ATT Labs-research

Scott Shenker ICIR
(replication in) P2P architectures
  • No proactive replication (Gnutella)
  • Hosts store and serve only what they requested
  • A copy can be found only by probing a host with a

Question how to use replication to improve
search efficiency in unstructured networks with a
proactive replication mechanism ?
Search and replication model

Unstructured networks with replication of keys or
copies. Peers probed (in the search and
replication process) are unrelated to query/item
  • Search probe hosts, uniformly at random, until
    the query is satisfied (or the search max size is
  • Replication Each host can store up to r copies
    of items.

Goal minimize average search size (number of
probes till query is satisfied)
Search size
  • What is the search size of a query ?
  • Soluble queries number of probes until answer is
  • We look at the Expected Search Size (ESS) of
    each item. The ESS is inversely proportional to
    the fraction of peers with a copy of the item.

Search Example
  • 2 probes

4 probes
Expected Search Size (ESS)
  • m items with relative query rates
  • q1 gt q2 gt q3 gt gt qm. Si qi 1
  • n nodes, capacity r, Rn r
  • ri number of copies of the ith items
  • Allocation p1(r1/R), p2, p3,, pm Si pi
  • ith item is allocated pi fraction of
  • Search size for ith item is a Geometric r.v. with
    mean Ai 1/(r pi ).
  • ESS is Si qi Ai (Si qi / pi)/r

Uniform and Proportional Replication
  • Two natural strategies
  • Uniform Allocation pi 1/m
  • Simple, resources are divided equally
  • Proportional Allocation pi qi
  • Fair, resources per item proportional to demand
  • Reflects current P2P practices

Basic Questions
  • How do Uniform and Proportional allocations
    perform/compare ?
  • Which strategy minimizes the Expected Search Size
    (ESS) ?
  • Is there a simple protocol that achieves optimal
    replication in decentralized unstructured
    networks ?

ESS under Uniform and Proportional Allocations
(soluble queries)
  • Lemma The ESS under either Uniform or
    Proportional allocations is m/r
  • Independent of query rates (!!!)
  • Same ESS for Proportional and Uniform (!!!)
  • Proof

Proportional ASS is (Si qi / pi)/r (Si qi /
qi)/r m/r
Uniform ASS is (Si qi / pi)/r (Si m qi)/r
(m/r) Si qi m/r pi(R/m)/R
Space of Possible Allocations
  • Definition Allocation p1, p2, p3,, pm is
    in-between Uniform and Proportional if
    for 1lt i ltm, q
    i1/q i lt p i1/p i lt 1
  • Theorem1 All (strictly) in-between strategies
    are (strictly) better than Uniform and

Theorem2 p is worse than Uniform/Proportional if
for all i, p i1/p i gt 1 (more popular gets
less) OR for all i, q i1/q i gt p i1/p i (less
popular gets less than fair share)
So, what is the best strategy for soluble queries
Square-Root Allocation
  • pi is proportional to square-root(qi)
  • Lies In-between Uniform and Proportional
  • Theorem Square-Root allocation minimizes the ESS
    (on soluble queries)
  • Minimize Si qi / pi such that Si pi 1

How much can we gain by using SR ?
Zipf-like query rates
Replication Algorithms
  • Uniform and Proportional are easy -
  • Uniform When item is created, replicate its key
    in a fixed number of hosts.
  • Proportional for each query, replicate the key
    in a fixed number of hosts

Desired properties of algorithm
  • Fully distributed where peers communicate through
    random probes minimal bookkeeping and no more
    communication than what is needed for search.
  • Converge to/obtain SR allocation when query rates
    remain steady.

Model for Copy Creation/Deletion
  • Creation after a successful search, C(s) new
    copies are created at random hosts.
  • Deletion is independent of the identity of the
    item copy survival chances are non-decreasing
    with creation time. (i.e., FIFO at each node)

Creation/Deletion Process
  • If

SR Replication Algorithms
  • Path replication number of new copies C(s) is
    proportional to the size of the search
  • Probe memory each peer records number and
    combined search size of probes it sees for each
    item. C(S) is determined by collecting this info
    from number of peers proportional to search size.
  • Extra communication (proportional to that needed
    for search).

Path Replication
  • Number of new copies produced per query, ltCigt, is
    proportional to search size 1/pi
  • Creation rate is proportional to qi ltCigt
  • Steady state creation rate proportional to
    allocation pi, thus

  • Random Search/replication Model probes to
    random hosts
  • Soluble queries
  • Proportional and Uniform allocations are two
    extremes with same average performance
  • Square-Root allocation minimizes Average Search
  • OPT (all queries) lies between SR and Uniform
  • SR/OPT allocation can be realized by simple
Write a Comment
User Comments (0)