February 19, 2004

Homemade web Crawl robot::[Search Engine]


Liang

If you found a refer from http://blog.wespoke.com, which not necessary mean I visited your wesite, it maybe visited by my crawl :)
I am tired of working today, so for relax, I code a small crawl, which attache my own HTTP_Agent and REFER to crawl the whole website.

Example, My crawl will show such log in your /var/log/httpd/access_log

"luliang.dhs.org - - [19/Feb/2004:16:29:54 -0600] "GET / HTTP/1.0" 200 20945 "http://blog.wespoke.com/" "Power by LiangLu at Differential Technology""

It means that refer from "http://blog.wespoke.com/", but it is not, Haha, a nice promotion method.

And I reset the browser type to "Power by LiangLu at Differential Technology" , en, my crawl works great now.

So, what is the next?

1] Write a API connect to Baidu.com , download the all top 1000 search result of XXX, and craw all of these website.

2] Write API connect to Google.com, get the same thing.

3] Just relex
:D

Posted at February 19, 2004 05:17 PM by Liang at 05:17 PM | Comments (0) | TrackBack(0) | Booso!| Niu.la收藏!


Trackback

You can ping this entry by using http://www.wespoke.com/cgi-bin/mt/mt-tb.cgi/335

Comments

Post a comment

请注意,为了防止spam,您的留言必需含有中文字符!









Remember personal info?




所有发表