Frequently Asked Questions
The Webdunia’s Web Crawler:
Webdunia Crawler is web crawling robot. It fetches documents from the web to build the index for the Webdunia search engine http://www.search.webdunia.com/ On this page, you'll find answers to the most frequently asked questions about the behavior of the Webdunia Crawler.
 
1. What is your crawler's HTTP user-agent string?
2. How often will Webdunia Crawler access my web site?
3. How do I request that Webdunia Crawler not crawl parts or all of my site?
4. How can I control how frequently Webdunia Crawler visits my site?
5. Why is Webdunia Crawler trying to access a file called robots.txt that isn't on my server?
6. Why is Webdunia Crawler attempting to download incorrect or non-existent links from my server?
7. Why isn't Webdunia Crawler respecting my robots.txt file?
8. I'd like to filter my logs, what IP addresses does Webdunia Crawler crawl from?
9. What other user-agents are/were used by Webdunia Crawler Crawler?
10. Why is Webdunia Crawler retrieving the same page on my site multiple times?
11. I have additional questions or comment about Webdunia Crawler, who should I contact?
 
Answers
 
1. What is your crawler's HTTP user-agent string?
Webduniabot/1.0
 
2. How often will Webdunia Crawler access my web site?
Webdunia Crawler attempts to access each web server once in at least one second duration, sometimes this duration also increases to two seconds and above due to network delays. It may also increase periodically as we test new crawler software while running our operational crawl at the same time.
 
3. How do I request that Webdunia Crawler not crawl parts or all of my site?
The Robot Exclusion Standard provides a way for web site administrators to restrict robot’s access to their web server by specifying crawler directives in a file called /robots.txt. Webdunia Crawler caches a copy of /robots.txt for each web server and it refreshes the same every 24 hours.
 
4. How can I control how frequently Webdunia Crawler visits my site?
Webdunia Crawler respects a new directive in the /robots.txt file called "Crawl-delay". The syntax is "Crawl-delay: xx", where "xx" is the delay in seconds between successive crawler visits. If Webdunia Crawler's access rate is inappropriate for your server, you can throttle it back to, say, once every 10 seconds with the following lines:
User-agent: Webduniabot
Crawl-delay: 10

As with all /robots.txt changes, it will take up to 24 hours for Webdunia Crawler to pick up the change.
 
5. Why is Webdunia Crawler trying to access a file called robots.txt that isn't on my server?
/robots.txt is a file that contains directives for web robots that restrict access to all or part of a web site. For information on how to create a /robots.txt file, see The Robot Exclusion Standard. If you just want to prevent the "file not found" error messages in your web server log, you can create an empty file named /robots.txt.
 
6. Why is Webdunia Crawler attempting to download incorrect or non-existent links from my server?
Webdunia Crawler discovers web pages by extracting links from other web pages that it already knows about. Sometimes a page is removed from a web site, but links to it remain on other pages. Incorrect page references may also be created directly by a web page author due to a typo or misspelling. When Webdunia Crawler discovers these bogus links, it will attempt to crawl them.
 
7. Why isn't Webdunia Crawler respecting my robots.txt file?
For efficiency reasons, Webdunia Crawler caches a copy of the /robots.txt file locally, which it refreshes every 24 hours. It can therefore take up to 24 hours for changes in a /robots.txt file to get picked up by the crawler.
If the /robots.txt file is not in the proper location, it wont get picked up. Make sure you're following the Robot Exclusion Standard exactly.
If your web server is configured to block access to /robots.txt, the crawler won't be able to read it and will assume access to your entire site is disallowed.
 
8. I'd like to filter my logs, what IP addresses does Webdunia Crawler crawl from?
We recommend that you use the user-agent string to filter Webdunia Crawler's crawl. Webdunia Crawler's IP address will vary with time.
 
9. What other user-agents are/were used by Webdunia Crawler Crawler?
Webdunia-crawler and image-webduniabot/1.0
Please use Webduniabot/1.0 in your /robots.txt file to specify rules for our Crawler.
 
10. Why is Webdunia Crawler retrieving the same page on my site multiple times?
Webdunia Crawler keeps track of how frequently pages change so that it can maintain a fresh copy of each page. Pages that change frequently, get crawled frequently.
 
11. I have additional questions or comment about Webdunia Crawler, who should I contact?
Please write us at crawler@webdunia.net with your questions.
 
 
Advertising Programs  | About Webdunia | Disclaimer  | Privacy Policy | Terms of Service | Content Removal Request
© 2007-2008 Webdunia.com