The Webdunia’s Web Crawler:
Webdunia Crawler is web crawling robot. It fetches documents from the web to build
the index for the Webdunia search engine http://www.search.webdunia.com/ On this
page, you'll find answers to the most frequently asked questions about the behavior
of the Webdunia Crawler.
|
|
|
|
1. What is your crawler's HTTP user-agent string? |
|
2. How often will Webdunia Crawler access my web site? |
|
3. How do I request that Webdunia Crawler not crawl parts or all of my
site? |
|
4. How can I control how frequently Webdunia Crawler visits my site? |
|
5. Why is Webdunia Crawler trying to access a file called robots.txt that
isn't on my server? |
|
6. Why is Webdunia Crawler attempting to download incorrect or non-existent
links from my server? |
|
7. Why isn't Webdunia Crawler respecting my robots.txt file? |
|
8. I'd like to filter my logs, what IP addresses does Webdunia Crawler
crawl from? |
|
9. What other user-agents are/were used by Webdunia Crawler Crawler? |
|
10. Why is Webdunia Crawler retrieving the same page on my site multiple
times? |
|
11. I have additional questions or comment about Webdunia Crawler, who
should I contact? |
|
|
|
Answers
|
|
|
|
1. What is your crawler's HTTP user-agent string? |
|
Webduniabot/1.0 |
|
|
|
2. How often will Webdunia Crawler access my web site? |
|
Webdunia Crawler attempts to access each web server once in at least one second
duration, sometimes this duration also increases to two seconds and above due to
network delays. It may also increase periodically as we test new crawler software
while running our operational crawl at the same time. |
|
|
|
3. How do I request that Webdunia Crawler not crawl parts or all of my site? |
|
The Robot Exclusion Standard provides a way for web site administrators to restrict
robot’s access to their web server by specifying crawler directives in a file called
/robots.txt. Webdunia Crawler caches a copy of /robots.txt for each web server and
it refreshes the same every 24 hours.
|
|
|
|
4. How can I control how frequently Webdunia Crawler visits my site? |
Webdunia Crawler respects a new directive in the /robots.txt file called "Crawl-delay".
The syntax is "Crawl-delay: xx", where "xx" is the delay in seconds between successive
crawler visits. If Webdunia Crawler's access rate is inappropriate for your server,
you can throttle it back to, say, once every 10 seconds with the following lines:
User-agent: Webduniabot
Crawl-delay: 10
As with all /robots.txt changes, it will take up to 24 hours for Webdunia Crawler
to pick up the change.
|
|
|
|
5. Why is Webdunia Crawler trying to access a file called robots.txt that isn't
on my server? |
|
/robots.txt is a file that contains directives for web robots that restrict access
to all or part of a web site. For information on how to create a /robots.txt file,
see The Robot Exclusion Standard. If you just want to prevent the "file not found"
error messages in your web server log, you can create an empty file named /robots.txt. |
|
|
|
6. Why is Webdunia Crawler attempting to download incorrect or non-existent
links from my server? |
|
Webdunia Crawler discovers web pages by extracting links from other web pages that
it already knows about. Sometimes a page is removed from a web site, but links to
it remain on other pages. Incorrect page references may also be created directly
by a web page author due to a typo or misspelling. When Webdunia Crawler discovers
these bogus links, it will attempt to crawl them.
|
|
|
|
7. Why isn't Webdunia Crawler respecting my robots.txt file? |
For efficiency reasons, Webdunia Crawler caches a copy of the /robots.txt file locally,
which it refreshes every 24 hours. It can therefore take up to 24 hours for changes
in a /robots.txt file to get picked up by the crawler.
If the /robots.txt file is not in the proper location, it wont get picked up. Make
sure you're following the Robot Exclusion Standard exactly.
If your web server is configured to block access to /robots.txt, the crawler won't
be able to read it and will assume access to your entire site is disallowed.
|
|
|
|
8. I'd like to filter my logs, what IP addresses does Webdunia Crawler crawl
from? |
|
We recommend that you use the user-agent string to filter Webdunia Crawler's crawl.
Webdunia Crawler's IP address will vary with time. |
|
|
|
9. What other user-agents are/were used by Webdunia Crawler Crawler? |
Webdunia-crawler and image-webduniabot/1.0
Please use Webduniabot/1.0 in your /robots.txt file to specify rules for our Crawler. |
|
|
|
10. Why is Webdunia Crawler retrieving the same page on my site multiple times? |
|
Webdunia Crawler keeps track of how frequently pages change so that it can maintain
a fresh copy of each page. Pages that change frequently, get crawled frequently. |
|
|
|
11. I have additional questions or comment about Webdunia Crawler, who should
I contact? |
|
Please write us at crawler@webdunia.net
with your questions. |
|
|
|
|