The search engine how to crawl the web breadth first crawling strategy

1, a breadth first crawl strategy:

From the

 

we all know, most of the sites are in accordance with the tree to complete the distribution page, then the link structure of a tree in which the page is first crawl? Why should first crawl these pages? Breadth first crawling strategy is in accordance with the tree structure, the link to the first crawl, crawl links after the completion, then grab the next link. The following figure:

K the number of links to form a set of R links obtained by PageRank, the number of links contained in the S link, Q represents whether to participate in the transfer, represents the damping factor, then the weights link to obtain the formula:

 

 

Spider in the above, we retrieve the G link, the algorithm found no value G page, so the tragedy of the G link and the subordinate H link Spider to be harmonious. As for the G link should be harmonized? Well, we analyze.

each search engine has a set of PageRank (refer to page weight, non noble baby PR) calculation method, and often update. The Internet is almost infinite, the new link will produce huge amounts of daily. The search engine is completely non ergodic for the calculation can only link weights. Why noble baby PR to be about three months to update a? Why love Shanghai big update 1-2 two times a month? This is because the search engine uses a non complete link traversal algorithm to calculate link weight weight. In fact, according to the current technology, to achieve weight faster frequency is not difficult to update, computing speed and storage speed up, but why not do it? Because not so necessary, or has been achieved, but do not want to announce. Then, what is not completely traversing the link weight calculation of

formula can send >

you can find me in the statement, using the link structure instead of website structure. Here the link structure can be composed of any link to the page, not necessarily is the site of the internal links. This is a breadth first crawl strategy an ideal, in the actual process of grasping, can not think so completely breadth first, but the limited width is preferred, as shown below:

?We will

2, non complete traversal of link weight calculation:

search engines crawl – seemingly simple storage and query work, but all aspects of implicit algorithm is very complex. The search engine spiders crawl the page by (Spider) to complete the grab action is very easy to implement, but grab what page, grab the priority algorithm to determine which pages need, here are a few grasp algorithm:

Leave a comment

Your email address will not be published.


*