PageRank

PageRank preview image

2 collaborators

Uri_dolphin3 Uri Wilensky (Author)

Tags

computer science 

Tagged by Reuven M. Lerner about 11 years ago

Model group CCL | Visible to everyone | Changeable by group members (CCL)
Model was written in NetLogo 5.0.4 • Viewed 1953 times • Downloaded 103 times • Run 1 time
Download the 'PageRank' modelDownload this modelEmbed this model

Do you have questions or comments about this model? Ask them here! (You'll first need to log in.)


WHAT IS IT?

PageRank is an algorithm/metric that was developed at Stanford University by Larry Page and Sergey Brin, who went on to create the Google search engine (and company) based on this method. PageRank is a technique for ranking the relevancy of web pages on the internet, through analysis of the hyperlink structure that links pages together.

This model demonstrates two distinct (though related) agent-based methods for calculating the PageRank of interconnected web pages. The use of an agent-based perspective attempts to provide a deeper understanding of this algorithm and the mathematics behind it.

Early web search engines often focused on the content of web pages, such as matching appropriate keywords or words found in the text of the page. PageRank, on the other hand, is interested in ranking sites based on their general usefulness in the web (apart from any specific search query or topic). Because Google uses PageRank as one component of its immensely popular internet search engine, it is easy to mistakenly call PageRank a search algorithm. However, it is technically a ranking algorithm, which provides importance weights for each page in a network. However, these rankings turn out to be very useful when performing an internet search, because they can be used to help determine the order in which search results are displayed to the user. Suppose someone searches for "emu" -- there are millions of web sites which contain the word "emu". Therefore, it is important to figure out which of those sites are more likely to provide the user with useful information.

HOW IT WORKS

PageRank depends heavily on one basic premise about the structure of the world wide web: Web pages that are more useful to people will also be more popular, and will accordingly have more hyperlinks pointing to them from other web pages.

If this is true, a very simple approach to figure out which sites are most useful/important would be to count the number of incoming links to each page, and use that as a ranking score. However, this would be assuming that every link counts equally, which is quite wrong. A single link from a important web site (e.g. from yahoo.com or whitehouse.gov) should count for much more than a link from some little-known page that presumably no one seems interested in. Thus, a page is important if many (and/or important) pages link to it. This appears to be a rather circular definition of importance, and begs the question: how can we tell which pages are important to begin with?

PageRank handles this problem by initially ranking all pages as equally important, but then it repeatedly performs a process on the rank scores of the pages that will cause the importance rankings to change. This PageRank NetLogo model presents two different ways of calculating PageRank, both of which would eventually converge to the exact same rankings being assigned to each web site, if you could let the algorithm run forever.

Method 1: The "random-surfer" approach

Imagine that you have a small army of robotic random web surfers. All they do is surf the web, going from one page to another, by randomly clicking on hyperlink after hyperlink. They don't actually read the web pages, and they spend the same amount of time (perhaps 1 millisecond) at each page before moving on to a new page. Occasionally instead of following a link, they choose to jump to a new page somewhere on the internet, chosen entirely at random. (How often they do this is based on the DAMPING-FACTOR parameter) They will also do a random jump if they reach a dead-end web page that has no outgoing links. But otherwise they are following links, and because we assume that links are more likely to lead to more important web sites, these random surfer robots are likely to spend more time at important pages than at the unimportant ones (which will have few incoming links). For each web page, suppose you count up the number of times that some random surfer visited that page, then divide that number by the total number of pages that all the random surfers visited. What you have calculated is the PageRank for that web site. (In more formal mathematical terminology, the resulting PageRanks can be viewed as the stationary distribution for a certain "Markov chain", where each page is a state, and there are transitional probabilities specified between each pair of states based on the hyperlinks between them).

Method 2: The "diffusion" approach

In the previous approach, our primary agents were the robotic web surfers, while the web pages themselves were mostly passive agents, simply acting as counters, incrementing a number each time a robot browsed them. In the "diffusion" approach, the web pages themselves are the central agents we are concerned with. Each web page starts with some RANK value, which is a measure of how important it is in the network. Initially, every page gets the same RANK value as every other page, and the sum of all the pages RANK values is 1. Then, in each time step, every web page distributes its RANK value (importance) to those web sites that it has outgoing hyperlinks to. Each page's new RANK value will thus be based on how much rank it receives from each of the sites that link to it, combined in a weighted average with a baseline amount of RANK value which each website gets each time step regardless of its neighbors. (The weight of the baseline amount is determined by the DAMPING-FACTOR parameter.) Over time, this process causes the RANK values of each page to converge to the actual PageRank values for each page. (In more formal mathematical terminology, this method is similar to using the "power method" for finding the principal eigenvector associated with a modified adjacency matrix of the directed hyperlink graph.)

HOW TO USE IT

First, you can decide what hyperlink network you would like to calculate the PageRank algorithm for, using the NETWORK-CHOICE chooser. Choices include two simple example networks, or a larger network that is created using a "preferential attachment" algorithm. The preferential attachment mechanism creates networks with scale-free degree distributions, which is one characteristic found to be true of the real world wide web. (For more information, see the Preferential Attachment model in the Models Library).

Then press SETUP to create the network.

Press GO to run the model, and watch as the PageRank is calculated. The area of each node is roughly proportional to its PageRank value, and you can see the PageRank numbers if the SHOW-PAGE-RANKS? switch is turned ON.

The DAMPING-FACTOR slider controls a parameter of the PageRank algorithm that affects how much the rank values are affected purely by the link structure. With DAMPING-FACTOR = 0, the link structure is completely damped and doesn't matter at all, and all pages will receive equal ranks regardless of who is linked to whom. With DAMPING-FACTOR = 1, there is no damping of the effect of link structure, meaning that pages with no inbound hyperlinks will receive 0 for PageRanks.

For the "random-surfer" method, the DAMPING-FACTOR controls the probability that a surfer robot will follow a link, as opposed to jumping randomly to a new page in the web. For the "diffusion" method, the DAMPING-FACTOR controls what fraction of a page's RANK value is determined by incoming links (as opposed to being given out gratis).

The CALCULATION-METHOD chooser controls whether the model will use the "random-surfer" or "diffusion" method to calculate the PageRank.

If the "random-surfer" method is chosen, the NUMBER-OF-SURFERS slider can be adjusted to change the number of surfing robots that are wandering the web. If the WATCH-SURFERS? switch is ON, then you can watch the surfer robots move around, and each time a random surfer robot follows a hyperlink, that link will be colored the same color as the random surfer that just followed it, to help you visually follow the movement of the surfers. If the WATCH-SURFERS? switch is OFF, then you will only see the PageRank values (and node sizes) adjusting over time, because the surfers are hidden.

Note: you may want to use the speed slider to slow down the model, so you can better examine what's happening.

THINGS TO NOTICE

In the "Example 1" network, five of the pages have no inbound links, meaning that nobody else in the network links to them. And yet, they usually still end up with a positive PageRank score. Why is this? Is there any scenario where they would end up with zero for their PageRank score?

Which calculation method ("random-surfer" or "diffusion") converges more quickly? Is one tick of the diffusion method comparable to one tick of the random-surfer method?

Which calculation method do you think is more amenable to being extended (with the goal of developing a ranking algorithm that provides better relevancy than PageRank does)? Why?

Is there an advantage to using more than one robot at a time when using the "random-surfer" method? Do you think this algorithm could actually be run in parallel on very large networks? Why or why not?

THINGS TO TRY

It is fairly common for the damping factor for the PageRank algorithm to be set somewhere near 0.85. Why do you think this is? Why do you think the creators of the algorithm included a damping factor? (What happens to the rankings if the damping factor is very low, or very high?)

When the damping-factor is set to 1.0, for some network configurations, the "diffusion" method and the "random-surfer" methods can arrive at different resulting PageRanks, and will never converge to the same thing, regardless of how long you let them run. Why?

The two example networks (Example 1 and Example 2) are the same every time, but the "Preferential Attachment" network is randomly generated, so each time you set it up, it is very likely to be different. How large is the PageRank of the most important page when you use the Preferential Attachment network? How much does this fluctuate between when the network changes?

EXTENDING THE MODEL

How does PageRank work if the whole network is not connected? What if there are 2 (or more) separate components of the network, that can never reach each other by browsing through hyperlinks? Extend this model by designing additional network configurations for the user to try out with the NETWORK-CHOICE chooser.

The random-surfer method offers an oversimplified view of how someone might surf the internet. Given only the structural information about the network (and not any content, such as topics, keywords, etc), can you come up with any more realistic behavior for a surfer to follow? Do you think this would produce better or worse relevancy measurements than the PageRank algorithm?

There are now many people who make a living by doing "search engine optimization" (SEO), to help improve the visibility of company web sites. While some of the methods used to improve search engine scores are completely legitimate and legal, people will sometimes engage in more ethically questionable practices (so-called "black hat SEO"), essentially trying to "game the system" and artificially increase the PageRank of their sites. Google (and other search engines) must spend considerable effort trying to counter tactics for unfairly manipulating search results. How might you extend this model to consider some attempts at "gaming" the PageRank algorithm (such as creating new nodes or links). You could also measure the effectiveness of these manipulations, and consider counter-measures to prevent them.

NETLOGO FEATURES

While NetLogo has a built-in diffuse primitive that can be used with patches, there is no equivalent primitive for diffusing a value through a network. However, it is not too difficult to write with NetLogo code, but we need to be careful to make sure everything updates at the same time, so that the total sum of PageRank values across the network remains constant. This can be accomplished by having two turtles-own variables rank and new-rank. First we compute the new-rank for each of the pages based on the old rank variable, before updating the rank variable of all the turtles using the new-rank value. (You'll also see this kind of "synchronous updating" in cellular automata models such as "Life", as well as many other models.)

RELATED MODELS

Preferential Attachment, Diffusion on a Directed Network, Link Walking Turtles Example (Code Example)

CREDITS AND REFERENCES

The network configurations given in "Example 1" (with DAMPING-FACTOR 0.85) and "Example 2" (with DAMPING-FACTOR 1.0) are the same as the examples given in the figures of http://en.wikipedia.org/wiki/PageRank (as of January 2009).

See also: Page et al. (1998) "The PageRank Citation Ranking: Bringing Order to the Web." Technical report, Stanford Digital Library Technologies Project.

HOW TO CITE

If you mention this model in a publication, we ask that you include these citations for the model itself and for the NetLogo software:

  • Stonedahl, F. and Wilensky, U. (2009). NetLogo PageRank model. http://ccl.northwestern.edu/netlogo/models/PageRank. Center for Connected Learning and Computer-Based Modeling, Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL.
  • Wilensky, U. (1999). NetLogo. http://ccl.northwestern.edu/netlogo/. Center for Connected Learning and Computer-Based Modeling, Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL.

COPYRIGHT AND LICENSE

Copyright 2009 Uri Wilensky.

CC BY-NC-SA 3.0

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.

Commercial licenses are also available. To inquire about commercial licenses, please contact Uri Wilensky at uri@northwestern.edu.

Comments and Questions

Is PageRank a Fuzzy Cognitive Network? (Question)

Is the PageRank model essentially the same as a fuzzy cognitive network as describe by Bart Kasko, http://sipi.usc.edu/~kosko/FCM.pdf That paper introduces a graphical way to show relations between connected nodes based on matrix algebra. I wonder if the movement of turtles in a similarly connected matrix of nodes would eventually settle down into the same solution in a sort of dynamic equilibrium effect? Is that what you have done with your Internet PageRank model? If not, can you make a Fuzzy Cognitive map of one of Kasko's examples using NetLogo? Kasko envisioned a mathematical solution to the problem of showing the relative changes in the importance of the connected nodes given a disturbance of the system. I just thought it would be more visually interesting to watch the ants travel to the different nodes and their final number at each node would show the same relative importance. Bye for now, George gDombi@chm.uri.edu

Posted over 10 years ago

Click to Run Model

breed [ pages page ]
breed [ surfers surfer ]

pages-own [
  rank new-rank ; for the diffusion approach
  visits ; for the random-surfer approach
]

surfers-own [ current-page ]

globals [ total-rank max-rank ]

;;
;; Setup Procedures
;;

to setup
  clear-all
  set-default-shape pages "circle"

  ifelse network-choice = "Example 1"
  [ create-network-example-1 ][
    ifelse network-choice = "Example 2"
    [ create-network-example-2 ][
      ifelse network-choice = "Preferential Attachment"
      [ create-network-preferential 100 2 ]
      [ user-message word "Error: unknown network-choice: " network-choice ] ] ]

  ask patches [ set pcolor white ]
  ask pages
  [ set rank 1 / count pages ]
  update-globals
  ask pages
  [
    setxy random-xcor random-ycor
    set label-color black
    update-page-appearance
  ]

  repeat 300 [ do-layout ]

  ask links [ set shape "curved" ]
  reset-ticks
end 

to create-network-example-1
  create-pages 11
  ask page 0 [ set color blue create-link-from page 3 ]
  ask page 1 [ set color red create-links-from (turtle-set page 2 page 3 page 4 page 5 page 6 page 7 page 8 ) ]
  ask page 2 [ set color orange create-link-from page 1 ]
  ask page 3 [ set color green create-link-from page 4 ]
  ask page 4 [ set color yellow create-links-from (turtle-set page 5 page 6 page 7 page 8 page 9 page 10) ]
  ask page 5 [ set color green create-link-from page 4 ]
  ask pages with [who > 5] [ set color violet ]
end 

to create-network-example-2
  create-pages 8
  ask page 0 [ die ]
  ask page 1 [ create-links-from (turtle-set page 2 page 3 page 5 page 6) ]
  ask page 2 [ create-links-from (turtle-set page 1 page 3 page 4) ]
  ask page 3 [ create-links-from (turtle-set page 1 page 4 page 5) ]
  ask page 4 [ create-links-from (turtle-set page 1 page 5) ]
  ask page 5 [ create-links-from (turtle-set page 1 page 4 page 6 page 7) ]
  ask page 6 [ create-links-from (turtle-set page 5) ]
  ask page 7 [ create-links-from (turtle-set page 1) ]
end 

to create-network-preferential [ n k ]
  create-pages n [ set color sky ]
  link-preferentially pages k
end 

; The parameter k (always an integer) gives the number of edges to add at
; each step (e.g. k=1 builds a tree)

to link-preferentially [nodeset k]
  ;; get the nodes in sorted order
  let node-list sort nodeset

  ;; get a sublist of the nodes from 0 to k
  let neighbor-choice-list sublist node-list 0 k

  ;; ask the kth node...
  ask item k node-list
  [
    ;; to make a link either to or from each preceding
    ;; node in the sorted list.
    foreach neighbor-choice-list
    [
      ifelse random 2 = 0
      [ create-link-to ? ]
      [ create-link-from ? ]
    ]
    ;; add k copies of this node to the beginning of the sublist
    set neighbor-choice-list sentence (n-values k [self]) neighbor-choice-list
  ]

  ;; ask each node after the kth node in order...
  foreach sublist node-list (k + 1) (length node-list)
  [
    ask ?
    [
      ;; ...to make k links
      let temp-neighbor-list neighbor-choice-list
      repeat k
      [
        ;; link to one of the nodes in the neighbor list
        ;; we remove that node from the list once it's been linked to
        ;; however, there may be more than one copy of some nodes
        ;; since those nodes have a higher probability of being linked to
        let neighbor one-of temp-neighbor-list
        set temp-neighbor-list remove neighbor temp-neighbor-list
        ;; when we've linked to a node put another copy of it on the
        ;; master neighbor choice list as it's now more likely to be
        ;; linked to again
        set neighbor-choice-list fput neighbor neighbor-choice-list
        ifelse random 2 = 0
        [ create-link-to neighbor ]
        [ create-link-from neighbor ]
      ]
      set neighbor-choice-list sentence (n-values k [self]) neighbor-choice-list
    ]
  ]
end 

to do-layout
  layout-spring pages links 0.2 20 / (sqrt count pages) 0.5
end 

;;
;; Runtime Procedures
;;

to go
  ifelse calculation-method = "diffusion"
  [
    if any? surfers [ ask surfers [ die ] ] ;; remove surfers if the calculation-method is changed

    ;; return links and pages to initial state
    ask links [ set color gray set thickness 0 ]
    ask pages [ set new-rank 0 ]

    ask pages
    [
      ifelse any? out-link-neighbors
      [
        ;; if a node has any out-links divide current rank
        ;; equally among them.
        let rank-increment rank / count out-link-neighbors
        ask out-link-neighbors [
          set new-rank new-rank + rank-increment
        ]
      ]
      [
        ;; if a node has no out-links divide current
        ;; rank equally among all the nodes
        let rank-increment rank / count pages
        ask pages [
          set new-rank new-rank + rank-increment
        ]
      ]
    ]

    ask pages
    [
      ;; set current rank to the new-rank and take the damping-factor into account
      set rank (1 - damping-factor) / count pages + damping-factor * new-rank
    ]
  ]
  [ ;;; "random-surfer" calculation-method
    ; surfers are created or destroyed on the fly if users move the
    ; NUMBER-OF-SURFERS slider while the model is running.
    if count surfers < number-of-surfers
    [
      create-surfers number-of-surfers - count surfers
      [
        set current-page one-of pages
        ifelse watch-surfers?
        [ move-surfer ]
        [ hide-turtle ]
      ]
    ]
    if count surfers > number-of-surfers
    [
      ask n-of (count surfers - number-of-surfers) surfers
        [ die ]
    ]
    ;; return links to their initial state
    ask links [ set color gray set thickness 0 ]

    ask surfers [
      let old-page current-page
      ;; increment the visits on the page we're on
      ask current-page [ set visits visits + 1 ]
      ;; with a probability depending on the damping-factor either go to a
      ;; random page or a random one of the pages that this page is linked to
      ifelse random-float 1.0 <= damping-factor and any? [my-out-links] of current-page
      [ set current-page one-of [out-link-neighbors] of current-page ]
      [ set current-page one-of pages ]

      ;; update the visualization
      ifelse watch-surfers?
      [
        show-turtle
        move-surfer
        let surfer-color color
        ask old-page [
          let traveled-link out-link-to [current-page] of myself
          if traveled-link != nobody [
            ask traveled-link [ set color surfer-color set thickness 0.08 ]
          ]
        ]
      ]
      [ hide-turtle ]
    ]
    ;; update the rank of each page
    let total-visits sum [visits] of pages
    ask pages [
      set rank visits / total-visits
    ]
  ]

  update-globals
  ask pages [ update-page-appearance ]
  tick
end 

to move-surfer ;; surfer procedure
  face current-page
  move-to current-page
end 

to update-globals
  set total-rank sum [rank] of pages
  set max-rank max [rank] of pages
end 

to update-page-appearance ;; page procedure
  ; keep size between 0.1 and 5.0
  set size 0.2 + 4 * sqrt (rank / total-rank)
  ifelse show-page-ranks?
  [ set label word (precision rank 3) "     " ]
  [ set label "" ]
end 


; Copyright 2009 Uri Wilensky.
; See Info tab for full copyright and license.

There are 10 versions of this model.

Uploaded by When Description Download
Uri Wilensky over 11 years ago Updated to NetLogo 5.0.4 Download this version
Uri Wilensky almost 12 years ago Updated version tag Download this version
Uri Wilensky almost 12 years ago Updated to version from NetLogo 5.0.3 distribution Download this version
Uri Wilensky over 12 years ago Updated to NetLogo 5.0 Download this version
Uri Wilensky about 14 years ago Updated from NetLogo 4.1 Download this version
Uri Wilensky about 14 years ago Updated from NetLogo 4.1 Download this version
Uri Wilensky about 14 years ago Updated from NetLogo 4.1 Download this version
Uri Wilensky about 14 years ago Updated from NetLogo 4.1 Download this version
Uri Wilensky over 14 years ago Model from NetLogo distribution Download this version
Uri Wilensky over 14 years ago PageRank Download this version

Attached files

File Type Description Last updated
PageRank.png preview Preview for 'PageRank' over 11 years ago, by Uri Wilensky Download

This model does not have any ancestors.

This model does not have any descendants.