wget 做镜像

redinux · 发表于 2006-12-18 10:07:04

请问各位平时做一个网站的镜像都是用什么工具？
我最常用wget
[PHP]
wget -p -np -k -m http://www.abc.com
[/PHP]
可是有个网站这个用法就不好使，只能下第一页，不知道为什么
http://worldhello.net/doc/docbook_howto/index.html

seamonkey · 发表于 2006-12-18 11:11:45

User-agent: wget
Disallow: /

well · 发表于 2006-12-18 12:26:19

或者 wget -r -p -np -k www.example.com

seamonkey · 发表于 2006-12-18 12:32:21

没用的，你看它的robots.txt，wget可是遵守这一协议的。

redinux · 发表于 2006-12-18 14:30:32

robots.txt好像不像平时的那么简短

#
# robots.txt for http://whodo.worldhello.net
#

# enable google adsense (advertising-related bots):
# User-agent: Mediapartners-Google*
# Disallow:

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Zealbot
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: Fetch
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: WebZIP
Disallow: /

User-agent: linko
Disallow: /

User-agent: HTTrack
Disallow: /

User-agent: Microsoft.URL.Control
Disallow: /

User-agent: Xenu
Disallow: /

User-agent: larbin
Disallow: /

User-agent: libwww
Disallow: /

User-agent: ZyBORG
Disallow: /

User-agent: Download Ninja
Disallow: /

#
# Sorry, wget in its recursive mode is a frequent problem.
# Please read the man page and use it properly; there is a
# --wait option you can use to set the delay between hits,
# for instance.
#
User-agent: wget
Disallow: /

#
# The 'grub' distributed client has been *very* poorly behaved.
#
User-agent: grub-client
Disallow: /

#
# Doesn't follow robots.txt anyway, but...
#
User-agent: k2spider
Disallow: /

#
# Hits many times per second, not acceptable
# http://www.nameprotect.com/botinfo.html
User-agent: NPBot
Disallow: /

# A capture bot, downloads gazillions of pages with no public benefit
# http://www.webreaper.net/
User-agent: WebReaper
Disallow: /

# A Japan crawler: goo.ne.jp
User-agent: ichiro/2.0
Disallow: /

#
# Friendly, low-speed bots are welcome viewing article pages, but not
# dynamically-generated pages please.
#
User-agent: *
Disallow: /w/
Disallow: /images/
Disallow: /cgi-bin/
Disallow: /wiki/Special:Randompage
Disallow: /wiki/Special%3ARandompage
Disallow: /wiki/Special:Search
Disallow: /wiki/Special%3ASearch
Crawl-delay: 1

哪位xd能知道什么意思的么？

redinux · 发表于 2006-12-18 14:31:10

robots.txt好像不像平时的那么简短

#
# robots.txt for http://whodo.worldhello.net
#

# enable google adsense (advertising-related bots):
# User-agent: Mediapartners-Google*
# Disallow:

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Zealbot
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: Fetch
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: WebZIP
Disallow: /

User-agent: linko
Disallow: /

User-agent: HTTrack
Disallow: /

User-agent: Microsoft.URL.Control
Disallow: /

User-agent: Xenu
Disallow: /

User-agent: larbin
Disallow: /

User-agent: libwww
Disallow: /

User-agent: ZyBORG
Disallow: /

User-agent: Download Ninja
Disallow: /

#
# Sorry, wget in its recursive mode is a frequent problem.
# Please read the man page and use it properly; there is a
# --wait option you can use to set the delay between hits,
# for instance.
#
User-agent: wget
Disallow: /

#
# The 'grub' distributed client has been *very* poorly behaved.
#
User-agent: grub-client
Disallow: /

#
# Doesn't follow robots.txt anyway, but...
#
User-agent: k2spider
Disallow: /

#
# Hits many times per second, not acceptable
# http://www.nameprotect.com/botinfo.html
User-agent: NPBot
Disallow: /

# A capture bot, downloads gazillions of pages with no public benefit
# http://www.webreaper.net/
User-agent: WebReaper
Disallow: /

# A Japan crawler: goo.ne.jp
User-agent: ichiro/2.0
Disallow: /

#
# Friendly, low-speed bots are welcome viewing article pages, but not
# dynamically-generated pages please.
#
User-agent: *
Disallow: /w/
Disallow: /images/
Disallow: /cgi-bin/
Disallow: /wiki/Special:Randompage
Disallow: /wiki/Special%3ARandompage
Disallow: /wiki/Special:Search
Disallow: /wiki/Special%3ASearch
Crawl-delay: 1

哪位xd能知道什么意思的么？

redinux · 发表于 2006-12-18 14:35:14

晕···网络打嗝了么？
怎么一下发了这么多？

请斑竹帮忙删去

		自动登录	找回密码
密码			注册

wget 做镜像

浏览过的版块