This is a discussion on Googlebot Auto Email within the PHP Programming forums, part of the Web Development category; Auto Email on Googlebot Detected Crawling Page Simple script that you can insert in a .php page that will email ...
| |||||||
| Register | FAQ | Members List | Calendar | Mark Forums Read |
| |||
| Auto Email on Googlebot Detected Crawling Page Simple script that you can insert in a .php page that will email you when Google is indexing your site. You will need to change the values in the script for your own site and contact details. Simply cut and paste from the following box. Dont forget the opening and closing < ? PHP and ? > tags (without the spaces)
__________________ Thanks & Regards Sabari... |
| Sponsored Links |
| |||
| This script is completely free for you to use and modify however you see fit, but if you make any cool changes, please share them with us ![]() <?php if(eregi("googlebot",$HTTP_USER_AGENT)) { mail("you at youremail.com", "Googlebot detected on yourdomainname.com", "Google has crawled yourdomainname.com"); } ?>
__________________ Thanks & Regards Sabari... |
| |||
| Advanced Version: This is a much better version that will automatically fill in the Domain, the actual Page (including any query strings), as well as tell you the Date and Time the page was crawled. Very useful if you want to add this script to many pages. <?php if(eregi("googlebot",$HTTP_USER_AGENT)) { if ($QUERY_STRING != "") { $url = "http://".$SERVER_NAME.$PHP_SELF.'?'.$QUERY_STRING; } else { $url = "http://".$SERVER_NAME.$PHP_SELF; } $today = date("F j, Y, g:i a"); mail("you at youremail.com", "Googlebot detected on http://$SERVER_NAME", "$today - Google crawled $url"); } ?>
__________________ Thanks & Regards Sabari... |
| |||
| Hi Sabari, I think it is really fantastic to know the unknown concepts like this. can i know more on Googlebot? Because i am very new for this concepts
__________________ With, J. Jeyaseelan Everything Possible Last edited by Jeyaseelansarc : 11-15-2007 at 10:18 PM. |
| |||
| sure Mr. Jeyaseelansarc i'll explain step by step in details about this topic Googlebot, Googlebot is a name of the indexing robot of Google that scans the web from link to link for new pages. You may know if Googlebot came to visit your website by looking at the log files of your server.
__________________ Thanks & Regards Sabari... |
| |||
| What is Googlebot? "Googlebot" is the term Google uses for their web crawler. Essentially, Googlebot visits pages all over the internet, mainly by following links from existing pages, and creates the Google Index based on what it finds. Googlebot parses the HTML code that is the backbone of most web pages and stores what it finds in the index - which is then quickly and effectively searchable by the Google search engine. Basically, when users enter a search term or phrase into the search box on Google's web site, what they are getting are not the direct results of the Googlebot's crawlings but recorded results. In other words, the results the user receives can be from a week or two earlier when Googlebot searched the internet for websites and their content. There are two versions of Googlebot: deepbot and freshbot. These two variations of Googlebot are true to their namesakes. Deepbot attempts to index all there is to index, following every link it finds and indexing all content. Freshbot, on the other hand, is geared toward maintaining a fresh index of frequently-updated websites. Using these two variations on Googlebot allows Google to keep a fresh index of constantly-updated websites without delaying these results in order to do a complete crawl of the web. Freshbot runs a lot more often than deepbot.
__________________ Thanks & Regards Sabari... Last edited by Sabari : 11-15-2007 at 11:06 PM. |
| |||
| Hi, I think this is very useful information for me. i have another questions here what is web crawler? can i know more on this
__________________ With, J. Jeyaseelan Everything Possible |
| |||
| Web Crawler A web crawler is an automated program that accesses a web site and traverses through the site by following the links present on the pages. Known as a bot, robot, spider or Web Crawler.
__________________ Thanks & Regards Sabari... |
| |||
| Spider A spider is an automated program that accesses a web site and traverses through the site by following the links present on the pages. Known as a bot, robot, spider or Web Crawler.
__________________ Thanks & Regards Sabari... |
| |||
| Bot A bot is an automated program that accesses a web site and traverses through the site by following the links present on the pages. Known as a bot, robot, spider or Web Crawler.
__________________ Thanks & Regards Sabari... |
| |||
| Robot A robot is an automated program that accesses a web site and traverses through the site by following the links present on the pages. Known as a bot, robot, spider or Web Crawler.
__________________ Thanks & Regards Sabari... |
| |||
| Spider A spider is an automated program that accesses a web site and traverses through the site by following the links present on the pages. Known as a bot, robot, spider or Web Crawler.
__________________ Thanks & Regards Sabari... |
| |||
| The number of pages Googlebot crawls The Googlebot activity reports in webmaster tools show you the number of pages of your site Googlebot has crawled over the last 90 days. We've seen some of you asking why this number might be higher than the total number of pages on your sites. ![]() Googlebot crawls pages of your site based on a number of things including: * pages it already knows about * links from other web pages (within your site and on other sites) * pages listed in your Sitemap file More specifically, Googlebot doesn't access pages, it accesses URLs. And the same page can often be accessed via several URLs. Consider the home page of a site that can be accessed from the following four URLs: * Example Web Page * Example Web Page * Example Web Page * Example Web Page Although all URLs lead to the same page, all four URLs may be used in links to the page. When Googlebot follows these links, a count of four is added to the activity report. Many other scenarios can lead to multiple URLs for the same page. For instance, a page may have several named anchors, such as: * http://www.example.com/mypage.html#heading1 * http://www.example.com/mypage.html#heading2 * http://www.example.com/mypage.html#heading3 And dynamically generated pages often can be reached by multiple URLs, such as: * http://www.example.com/furniture?type=chair&brand=123 * http://www.example.com/hotbuys?type=chair&brand=123 As you can see, when you consider that each page on your site might have multiple URLs that lead to it, the number of URLs that Googlebot crawls can be considerably higher than the number of total pages for your site. Of course, you (and we) only want one version of the URL to be returned in the search results. Not to worry -- this is exactly what happens. Our algorithms selects a version to include, and you can provide input on this selection process.
__________________ Thanks & Regards Sabari... Last edited by Sabari : 11-16-2007 at 02:52 AM. |
| |||
| Redirect to the preferred version of the URL You can do this using 301 (permanent) redirect. In the first example that shows four URLs that point to a site's home page, you may want to redirect index.html to Example Web Page. And you may want to redirect example.com to Example Web Page so that any URLs that begin with one version are redirected to the other version. Note that you can do this latter redirect with the Preferred Domain feature in webmaster tools. (If you also use a 301 redirect, make sure that this redirect matches what you set for the preferred domain.)
__________________ Thanks & Regards Sabari... |
| |||
| Block the non-preferred versions of a URL with a robots.txt file For dynamically generated pages, you may want to block the non-preferred version using pattern matching in your robots.txt file. (Note that not all search engines support pattern matching, so check the guidelines for each search engine bot you're interested in.) For instance, in the third example that shows two URLs that point to a page about the chairs available from brand 123, the "hotbuys" section rotates periodically and the content is always available from a primary and permanent location. If that case, you may want to index the first version, and block the "hotbuys" version. To do this, add the following to your robots.txt file: User-agent: Googlebot Disallow: /hotbuys?* To ensure that this directive will actually block and allow what you intend, use the robots.txt analysis tool in webmaster tools. Just add this directive to the robots.txt section on that page, list the URLs you want to check in the "Test URLs" section and click the Check button. For this example, you'd see a result like this: ![]() Don't worry about links to anchors, because while Googlebot will crawl each link, our algorithms will index the URL without the anchor. And if you don't provide input such as that described above, our algorithms do a really good job of picking a version to show in the search results.
__________________ Thanks & Regards Sabari... Last edited by Sabari : 11-16-2007 at 03:06 AM. |
| |||
| Googlebot activity reports The webmaster tools team has a very exciting mission: we dig into our logs, find as much useful information as possible, and pass it on to you, the webmasters. Our reward is that you more easily understand what Google sees, and why some pages don't make it to the index. The latest batch of information that we've put together for you is the amount of traffic between Google and a given site. We show you the number of requests, number of kilobytes (yes, yes, I know that tech-savvy webmasters can usually dig this out, but our new charts make it really easy to see at a glance), and the average document download time. You can see this information in chart form, as well as in hard numbers (the maximum, minimum, and average). For instance, here's the number of pages Googlebot has crawled in the Webmaster Central blog over the last 90 days. The maximum number of pages Googlebot has crawled in one day is 24 and the minimum is 2. That makes sense, because the blog was launched less than 90 days ago, and the chart shows that the number of pages crawled per day has increased over time. The number of pages crawled is sometimes more than the total number of pages in the site -- especially if the same page can be accessed via several URLs. So Official Google Webmaster Central Blog: Learn more about Googlebot's crawl of your site and more! and Official Google Webmaster Central Blog: Learn more about Googlebot's crawl of your site and more! are different, but point to the same page (the second points to an anchor within the page). ![]() And here's the average number of kilobytes downloaded from this blog each day. As you can see, as the site has grown over the last two and a half months, the number of average kilobytes downloaded has increased as well. ![]() The first two reports can help you diagnose the impact that changes in your site may have on its coverage. If you overhaul your site and dramatically reduce the number of pages, you'll likely notice a drop in the number of pages that Googlebot accesses. The average document download time can help pinpoint subtle networking problems. If the average time spikes, you might have network slowdowns or bottlenecks that you should investigate. Here's the report for this blog that shows that we did have a short spike in early September (the maximum time was 1057 ms), but it quickly went back to a normal level, so things now look OK. ![]() In general, the load time of a page doesn't affect its ranking, but we wanted to give this info because it can help you spot problems. We hope you will find this data as useful as we do!
__________________ Thanks & Regards Sabari... Last edited by Sabari : 11-16-2007 at 02:51 AM. |
| |||
| Hi, How can we boycott from this operation from Google? Do we have any method to stop entering google in our sites?
__________________ With, J. Jeyaseelan Everything Possible |
| |||
| yes, we can block, please gothrough the below points. Blocking Googlebot Google uses several user-agents. You can block access to any of them by including the bot name on the User-Agent line of an entry. Blocking Googlebot blocks all bots that begin with "Googlebot". * Googlebot: crawl pages from our web index and our news index * Googlebot-Mobile: crawls pages for our mobile index * Googlebot-Image: crawls pages for our image index * Mediapartners-Google: crawls pages to determine AdSense content. We only use this bot to crawl your site if you show AdSense ads on your site. * Adsbot-Google: crawls pages to measure AdWords landing page quality. We only use this bot if you use Google AdWords to advertise your site. Find out more about this bot and how to block it from portions of your site. For instance, to block Googlebot entirely, you can use the following syntax: User-agent: Googlebot Disallow: / Allowing Googlebot If you want to block access to all bots other than the Googlebot, you can use the following syntax: User-agent: * Disallow: / User-agent: Googlebot Disallow: Googlebot follows the line directed at it, rather than the line directed at everyone. The Allow extension Googlebot recognizes an extension to the robots.txt standard called Allow. This extension may not be recognized by all other search engine bots, so check with other search engines you're interested in to find out. The Allow line works exactly like the Disallow line. Simply list a directory or page you want to allow. You may want to use Disallow and Allow together. For instance, to block access to all pages in a subdirectory except one, you could use the following entries: User-Agent: Googlebot Disallow: /folder1/ Allow: /folder1/myfile.html Those entries would block all pages inside the folder1 directory except for myfile.html. If you block Googlebot and want to allow another of Google's bots (such as Googlebot-Mobile), you can allow access to that bot using the Allow rule. For instance: User-agent: Googlebot Disallow: / User-agent: Googlebot-Mobile Allow: /
__________________ Thanks & Regards Sabari... |
| |||
| Blocking Googlebot If a webmaster wishes to restrict the information on their site available to a Googlebot, or another well-behaved spider, they can do so with the appropriate directives in a robots.txt file, and by adding the meta tag <META NAME="Googlebot" CONTENT="nofollow"> to the webpage. Googlebot requests to Web servers are discernible from their user-agent string 'Googlebot'. Regards, R.Kamalakannan. |
| |||
| Webmaster Tools A problem which webmasters have often noted with the Googlebot is that it takes up an enormous amount of bandwidth. This can cause websites to exceed their bandwidth limit and be taken down temporarily. This is especially troublesome for mirror sites which host many gigabytes of data. Google provides "Webmaster Tools" that allow website owners to throttle the crawl rate. Regards, R.Kamalakannan. |
![]() |
| Thread Tools | |
| Display Modes | |
| |
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Auto Resize Applications in J2ME ? | itbarota | J2ME | 8 | 09-30-2008 04:44 AM |
| can i make auto login? | saravanan | Operating Systems | 0 | 03-21-2008 05:33 AM |
| How can I check whether a block element like a div with overflow as auto or scroll ha | kingmaker | HTML, CSS and Javascript Coding Techniques | 1 | 09-18-2007 11:23 PM |
| How to create an auto startup application on WM5 ? | theone | Windows Mobile | ||