LinuxSir.cn,穿越时空的Linuxsir!

 找回密码
 注册
搜索
热搜: shell linux mysql
查看: 933|回复: 6

wget 做镜像

[复制链接]
发表于 2006-12-18 10:07:04 | 显示全部楼层 |阅读模式
请问各位平时做一个网站的镜像都是用什么工具?
我最常用wget
[PHP]
wget -p -np -k -m http://www.abc.com
[/PHP]
可是有个网站这个用法就不好使,只能下第一页,不知道为什么
http://worldhello.net/doc/docbook_howto/index.html
发表于 2006-12-18 11:11:45 | 显示全部楼层
User-agent: wget
Disallow: /
回复 支持 反对

使用道具 举报

发表于 2006-12-18 12:26:19 | 显示全部楼层
或者 wget -r -p -np -k www.example.com
回复 支持 反对

使用道具 举报

发表于 2006-12-18 12:32:21 | 显示全部楼层
没用的,你看它的robots.txt,wget可是遵守这一协议的。
回复 支持 反对

使用道具 举报

 楼主| 发表于 2006-12-18 14:30:32 | 显示全部楼层
robots.txt好像不像平时的那么简短
#
# robots.txt for http://whodo.worldhello.net
#

# enable google adsense (advertising-related bots):
# User-agent: Mediapartners-Google*
# Disallow:

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Zealbot
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: Fetch
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: WebZIP
Disallow: /

User-agent: linko
Disallow: /

User-agent: HTTrack
Disallow: /

User-agent: Microsoft.URL.Control
Disallow: /

User-agent: Xenu
Disallow: /

User-agent: larbin
Disallow: /

User-agent: libwww
Disallow: /

User-agent: ZyBORG
Disallow: /

User-agent: Download Ninja
Disallow: /

#
# Sorry, wget in its recursive mode is a frequent problem.
# Please read the man page and use it properly; there is a
# --wait option you can use to set the delay between hits,
# for instance.
#
User-agent: wget
Disallow: /

#
# The 'grub' distributed client has been *very* poorly behaved.
#
User-agent: grub-client
Disallow: /

#
# Doesn't follow robots.txt anyway, but...
#
User-agent: k2spider
Disallow: /

#
# Hits many times per second, not acceptable
# http://www.nameprotect.com/botinfo.html
User-agent: NPBot
Disallow: /

# A capture bot, downloads gazillions of pages with no public benefit
# http://www.webreaper.net/
User-agent: WebReaper
Disallow: /

# A Japan crawler: goo.ne.jp
User-agent: ichiro/2.0
Disallow: /

#
# Friendly, low-speed bots are welcome viewing article pages, but not
# dynamically-generated pages please.
#
User-agent: *
Disallow: /w/
Disallow: /images/
Disallow: /cgi-bin/
Disallow: /wiki/Special:Randompage
Disallow: /wiki/Special%3ARandompage
Disallow: /wiki/Special:Search
Disallow: /wiki/Special%3ASearch
Crawl-delay: 1

哪位xd能知道什么意思的么?
回复 支持 反对

使用道具 举报

 楼主| 发表于 2006-12-18 14:31:10 | 显示全部楼层
robots.txt好像不像平时的那么简短
#
# robots.txt for http://whodo.worldhello.net
#

# enable google adsense (advertising-related bots):
# User-agent: Mediapartners-Google*
# Disallow:

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Zealbot
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: Fetch
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: WebZIP
Disallow: /

User-agent: linko
Disallow: /

User-agent: HTTrack
Disallow: /

User-agent: Microsoft.URL.Control
Disallow: /

User-agent: Xenu
Disallow: /

User-agent: larbin
Disallow: /

User-agent: libwww
Disallow: /

User-agent: ZyBORG
Disallow: /

User-agent: Download Ninja
Disallow: /

#
# Sorry, wget in its recursive mode is a frequent problem.
# Please read the man page and use it properly; there is a
# --wait option you can use to set the delay between hits,
# for instance.
#
User-agent: wget
Disallow: /

#
# The 'grub' distributed client has been *very* poorly behaved.
#
User-agent: grub-client
Disallow: /

#
# Doesn't follow robots.txt anyway, but...
#
User-agent: k2spider
Disallow: /

#
# Hits many times per second, not acceptable
# http://www.nameprotect.com/botinfo.html
User-agent: NPBot
Disallow: /

# A capture bot, downloads gazillions of pages with no public benefit
# http://www.webreaper.net/
User-agent: WebReaper
Disallow: /

# A Japan crawler: goo.ne.jp
User-agent: ichiro/2.0
Disallow: /

#
# Friendly, low-speed bots are welcome viewing article pages, but not
# dynamically-generated pages please.
#
User-agent: *
Disallow: /w/
Disallow: /images/
Disallow: /cgi-bin/
Disallow: /wiki/Special:Randompage
Disallow: /wiki/Special%3ARandompage
Disallow: /wiki/Special:Search
Disallow: /wiki/Special%3ASearch
Crawl-delay: 1

哪位xd能知道什么意思的么?
回复 支持 反对

使用道具 举报

 楼主| 发表于 2006-12-18 14:35:14 | 显示全部楼层
晕···网络打嗝了么?
怎么一下发了这么多?

请斑竹帮忙删去
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

快速回复 返回顶部 返回列表