一个解析过滤爬虫的ip的一个小代码分析

那年*** · 发表于 2012-7-6 17:33:13

本帖最后由 zmq9999 于 2011-9-21 19:56 编辑

Linux 下的程序
需要有ruby环境
输入为日志文件，并且日志的第一个字段需要是ip
输出为去掉假冒的爬虫之后的谷歌和百度的爬虫的日志记录
如：
cat log | ruby fake_spider_filter.rb > real_spider.log

过滤的原理是反解析爬虫的ip，看是不是百度和谷歌的域名

源代码如下：

#Fake_spider_filter .rb
Spider = {"Baiduspider" => ["baidu.com",[],[]] , "Googlebot" => ["googlebot.com",[],[]] }
#Fake = open("./fake_spider_log.txt","a")
def filter(factor,ip,line,spider)
host = factor[0]
goodip = factor[1]
badip = factor[2]
if goodip.include?(ip)
puts line
elsif badip.include?(ip)
#Fake.puts(line)
else
check(ip,host,line,spider)
end
end
def check(ip,host,line,spider)
check_host = `host #{ip}`
if check_host.include?(host)
puts line
Spider[spider][1] << ip
else
#Fake.puts(line)
Spider[spider][2] << ip
end
end
while line = gets
line.chomp!
ip = line.split[0]
Spider.each_pair do |spider,factor|
filter(factor,ip,line,spider) if line.include?(spider)
end
end

一路*** · 发表于 2012-7-7 01:25:14

过滤掉做什么啊。。

		自动登录	找回密码
密码			立即注册

一个解析过滤爬虫的ip的一个小代码分析

评分

浏览过的版块

站长推荐 /1