How Google Works

Our goal is the following. Given a universe of S webpages, we'd like to rank the importance of webpage based only on the geometry of the connections in the universe (and not on content). Toward this end we'll label each webpage with a number from 1 to S, and our goal is to find a vector

$v = ({{v_1}, {v_2}, {.}, {.}, {.}, {v_S}})$

where the entry v_i gives the importance of the ith webpage. We can then rank webpages according to the entries v_i.

Depicting web connections by sparse matrices

Let's see an example of a web universe. The following function will generate a random web universe with N pages. An arrow from page i to page j means that there is a link to page j from page i. (The one exception is if a page links to itself, in which case you can't see the arrow. Sorry!)

$esf[i_, j_] := Block[{}, {Blue, Line[{i, j}], Arrow[coord[[j]], coord[[j]] + 0.95 (coord[[i]] - coord[[j]]), HeadLength→.03]}] ;$

[Graphics:HTMLFiles/index_11.gif]

We can capture all the information in this graph with an adjacency matrix. The corresponding matrix is NxN and has an entry 1 in position (i,j) if there is a link from page i to page j and a zero otherwise.

We could rank pages based solely on the number of links into that page. If this were our method, which page would rank first? Don't forget that a page might link to itself! One counts the number of links into page i by summing across row i. We can collect all these sums together by multiplying by the vector with a 1 in every position. We'll call this vector u.

Improving the ranking method

This method really isn't a bad measure of a pages, but means that a page which recommends everyone and their brother has more clout in determining web page worth than a page which recommends very few webpages. Shorter lists presumably contain more refined (and important and selective) information, and so we want to weight the value of a link from a page in inverse proportion to the number of links it contains. So if page i contains n_i links we'll weight each link with a value of 1/n_i. (The only problem is that if a page links nowhere then this formula has us dividing by 0. To fix this problem, Google pretends that a page which has no links out actually has a link to every webpage in the universe.) So the importance of a webpage is measured by

How can we compute this number? If we take our adjacency matrix and multiply column i by 1/n_i (and replace a 0 column appropriately), we can just sum across each row to count these numbers. I made a function which turns an adjacency matrix into a matrix of the desired form. Notice that it turns our adjacency matrix into a transition matrix (which is why I've named it in this silly way).

Now the rank of the page in this scheme is supposed to be given by summing across the rows of this matrix. To sum across rows we'll use the previous trick: multiplying by the vector u. Let's do it.

( {{1.2}, {1.4}, {0.7}, {1.23333}, {1.53333}, {0.2}, {0.733333}, {1.23333}, {1.23333}, {0.533333}} )

So the rank of the webpages in the new scheme is given by ordering the coordinates of this vector.

Another improvement

We've adjusted the value of a link from a page based on the number of links the page has, but we should also adjust the weight of a recommendation by the importance of the page. So whatever our value for the importance of a page (we have called these numbers v_i), we should have

This might seem weird, since we have numbers we're interested in (the v_i) and they appear on both sides of the equation. This might have made your head spin a long time ago, but now we have the technology to handle this situation. In particular, notice that this equation means that the vector v satisfies the property

$v = ({{a_11/n_1, a_12/n_2, ., ., ., a_ (1S)/n_S}, {a_21/n_1, a_22/n_2, ., ., ., a_ (2S)/n_S}, {., ., ., , , .}, {., ., , ., , .}, {., ., , , ., .}, {a_S1/n_1, a_S2/n_2, ., ., ., a_SS/n_S}}) v$

Aha! So v should be an eigenvector of our transitionification matrix with eigenvector 1. So to rank the pages in this new way (where we not only diminish the importance of a link based on the number of links it gives out, but also weight the value of a recommendation from a given page by the importance of the given page), we must find an eigenvector with eigenvalue 1 for the Transitionification of our adjacency matrix. The following command will let Mathematica do the work in finding an eigenvector with eigenvalue 1.

( {{2.74194}, {3.63441}, {1.45161}, {2.39785}, {3.03226}, {0.666667}, {1.56989}, {2.48387}, {2.84946}, {1.}} )

Solving some problems

There are a few problems with things as we've set them up so far.

(1) There are lots of pages tied with 0, and even in our small universe it's likely there are ties for first.  We don't want ties, because we want a nice clean ranking.
(2) Although it was easy to find an eigenvector in this particular example, in practice it is difficult to compute an eigenvector: since N is enormous this would mean solving a ridiculously huge system of equations which would literally take eons to calculate.
(3)  Since our matrix is only a transition matrix there might be more than eigenvector with eigenvalue 1.  In this case, how do we know which to choose to make our ranking vector?

To solve all but the first problem we'll use the Big Theorem on regular transition matrices.  To do this, we need to take our matrix and replace it with a regular transition matrix.  Google does this by shifting the transitionification matrix slightly.  It chooses a number r between 0 and 1 and then takes the matrix

r * Transitionification[A] + (1 - r)/S * ({{1, 1, ., ., ., 1}, {1, 1, ., ., ., 1}, {., ., ., , , .}, {., ., , ., , .}, {., ., , , ., .}, {1, 1, ., ., ., 1}})

instead. This will be a regular transition matrix, so these is only one eigenvector with eigenvalue 1 (up to scaling).

$GoogleMatrix[A_, r_] := Module[ {B}, B = r * Transitionification[A] + (1 - r)/S * Array[one, {S, S}] ] ;$

Now instead of actually computing an eigenvector corresponding to eigenvalue 1 (which would take zillions of years), it instead uses the fact that any initial condition will eventually approach this eigenvector. So if G is the google matrix, it just computes

until this vector `settles down.' The steady state vector will be the ranking vector we want.

( {{1.2131}, {1.53787}, {0.720591}, {1.09371}, {1.35184}, {0.395625}, {0.764626}, {1.12007}, {1.25054}, {0.552035}} )

A bigger example (for fun)

Let's try our procedure with a bigger system.

$esf[i_, j_] := Block[{}, {Blue, Line[{i, j}], Arrow[coord[[j]], coord[[j]] + 0.95 (coord[[i]] - coord[[j]]), HeadLength→.03]}] ;$

[Graphics:HTMLFiles/index_40.gif]

Isn't it just beautiful? Ok, let's try to compute the PageRank for this system in two ways. The first way will calculate an eigenvector by actually row reducing an appropriate matrix. I won't bother to give the ranking to you, but I will tell you how long it takes Mathematica to run the calculation.

Instead of finding the eigenvector by row reducing, we'll instead use the trick that giving any initial condition will eventually lead us to the desired ranking (it will, in general, be a scalar multiple of the magical vector provided by the Big Theorem). In our particular case, I'll see how long it takes Mathematica to compute a high power the GoogleMatrix acting on the u vector from before. You'll notice it's a lot quicker than the previous calculation.

The clever student will notice that this is backwards from how we computed large powers of matrices before: we used to compute eigenvectors in order to calculate high powers of matrices, and now we're using a high power of a matrix to find an eigenvector. The reason that we can take the latter approach in this case is that the Transitionification of the adjacency matrix is `sparse,' and so multiplying with this matrix (or, the GoogleMatrix which is a simple shift from the Transitionification of our adjacency matrix) happens much faster computationally than standard matrix multiplication.