|
本帖最后由 zmq9999 于 2011-9-21 19:56 编辑
Linux 下的程序
需要有ruby环境
输入为日志文件,并且日志的第一个字段需要是ip
输出为去掉假冒的爬虫之后的谷歌和百度的爬虫的日志记录
如:
cat log | ruby fake_spider_filter.rb > real_spider.log
过滤的原理是反解析爬虫的ip,看是不是百度和谷歌的域名
源代码如下:- #Fake_spider_filter .rb
- Spider = {"Baiduspider" => ["baidu.com",[],[]] , "Googlebot" => ["googlebot.com",[],[]] }
- #Fake = open("./fake_spider_log.txt","a")
- def filter(factor,ip,line,spider)
- host = factor[0]
- goodip = factor[1]
- badip = factor[2]
- if goodip.include?(ip)
- puts line
- elsif badip.include?(ip)
- #Fake.puts(line)
- else
- check(ip,host,line,spider)
- end
- end
- def check(ip,host,line,spider)
- check_host = `host #{ip}`
- if check_host.include?(host)
- puts line
- Spider[spider][1] << ip
- else
- #Fake.puts(line)
- Spider[spider][2] << ip
- end
- end
- while line = gets
- line.chomp!
- ip = line.split[0]
- Spider.each_pair do |spider,factor|
- filter(factor,ip,line,spider) if line.include?(spider)
- end
- end
|
|
|