In this article, David Bolton shows how to create a simple Web robot that does multiple parallel searches on a search engine and visits each Web site in the results and downloads that page.
In this article, David Bolton shows how to create a simple Web robot that performs multiple parallel searches on a search engine then visits each Web site in the results and downloads that page. It uses the ActiveX Components provided by Internet Explorer 4 or 5.
Caveat- the code as originally written would work with Altavista but that has changed probably a dozen times so its as likely to work as your chance of bicyling up Mt Everest! Copernicus (www.copernic.com) is an awesome (and free) search engine searcher and they issue upgrades for specific engines on a regular basis. If you want to write one, play with Copernicus. I rate it 11 out of 10. (No I've no finanicial or otherwise connection to them- I'm just a very very satisfied customer).
Although it sounds exotic, a bot (also known as a spider, intelligent agent, Web robot, crawler, robot, and so on) is simply a program that visits a number of Web sites. The best-known bots are, of course, the spiders used by various search engines to catalog new content. Take a look on the Web and you'll find lots of references and details. There's even a book on the subject, published by Microsoft Press: Programming Bots, Spiders and Intelligent Agents in Visual C++, by David Pallmann (ISBN 0-7356-0565-3). It's well worth getting if you're interested in writing bots and you don't mind wading through C++ code.
When you create a bot, you should be aware that material your bot gathers from sites that it visits may well be copyrighted-so be careful how you use it. Another thing to keep in mind is this: If your bot visits a Web site repeatedly, it might upset the Web site's owners, particularly if they carry paid advertising. For a similar reaction, just mention Video Recorders that automatically skip ads or Tivo to advertising people. Of course, if your bot does hammer a particular Web site and gets noticed, you might find that your IP address is no longer allowed access to that site (The dreaded 403!). In that case, a dialup account where the ISP gives you a dynamic IP address is probably a much better idea. I'll discuss the Robot Exclusion Standard later in respect to this.
The major problem with rolling your own bot isn't writing the code, it's how fast your Internet link is. For serious crawling, you need a permanent link, not dialup!
Microsoft has made life a lot easier for bot creators (and virus creators, trojan authors!) by their usual practice of including a couple of ActiveX browsing objects in Internet Explorer (IE) since version 4. Actually this reusable 'engine' approach is to be admired, if only it wasn't misused so much! If you use them, they take care of 99 percent of the difficult stuff like Internet access, firewalls, and using HTTP to download the pages' HTML. IE has a lot of functionality built in, and much of it is accessible. IE 3 had some objects in it, but I'm not sure whether these are usable in the same way.
If you're an ardent IE hater, take heart! You don't have to betray your principles or skip this article. When you use IE's objects, you never actually see IE-it's fully integrated into Windows.
WebBrowser is the name of the ActiveX object from IE. With Delphi 3, if you have IE installed on your PC, you must create the type library unit-go to Import ActiveX Controls in Delphi, select Microsoft Internet Controls, and click install. You should now see TWebBrowser_V1, TWebBrowser, and TShellFolderViewOC on the ActiveX tab on the component palette. We'll be using TWebBrowser. Delphi 4 presents a problem due to changes in handling ActiveX between Delphi versions 3 and 4. A program that ran fine under Delphi 3 generates an EOLESysError under Delphi 4: "CoInitialize not Called." The type library Pascal source for the control is now twice the size in Delphi 5 than it was in Delphi 3. If you have Delphi 4, I suggest you either upgrade to Delphi 5 or find someone who has it and see if the Delphi 5 shdocvw.pas works for you. All of the IE object functionality is contained in the shdocvw.dll.
If you have Delphi 5 or 6, you don't need to do this. TWebBrowser has replaced the older THTML component and is the last component on the Internet Tab. You also get a demo of using this in the folder Demos/Coolstuff, and the Help file has some stuff on TWebBrowser, but curiously, Borland hasn't added the TWebBrowser_V1 component, even though it's in the source file. If you want to find out more about using TWebBrowser or TWebBrowser_V1, go to www.microsoft.com and do a search, or get Delphi 5 for the Help file!
TWebBrowser is a very easy component to work with. About half of the properties can be ignored as they are for the visible IE interface to control the onscreen look of IE-like toolbars or displaying in full screen. The Visible property determines whether we can see the browser window or not. In the final application, the user will never see it, but it can be useful for debugging.
The simplest way of using WebBrowser is by calling the Navigate(URL) method, then handling the OnNavigateComplete2 event and using the Document property to access the downloaded page. Although there are two other events-OnDocumentComplete and OnDownloadComplete-that should help determine whether a Web page was successfully downloaded, I found it easier to process everything from the OnNavigateComplete2. This only triggers when the browser has successfully moved to the specified URL; however, it's confused by multiple frames, so some extra care has to be taken, as you'll see.
WebBrowser provides you with several properties that simplify the task of extracting the data. These properties include Links, anchors, applets, forms, frames, style sheets, and a few more. The only problem, especially when frames are used, is sorting the chaff from the wheat-which links are valid, and which might be ads or other services? In that case, the only way to do it is to scan the HTML and extract the relevant information. As each page is downloaded, it can be accessed directly.
As you might know, many SearchEngines don't index frames. An HTML document can consist of one or more frames. Which frame holds the stuff you want? It's usually the first (and often only) page. The only reliable way is to walk through each page looking for text and then searching this for the strings that identify results. A much greater problem is the multiple triggering of the various DocumentComplete, DownloadComplete, and NavigationComplete events. Much debugging, cursing, and hair removal occurred before I realized what was happening. Ignore DownloadComplete. Instead, use either DocumentComplete or NavigationComplete. I did some more experiments and found that the best way was to use one and check whether a document was ready with VarIsEmpty(fwebbrowser.document). Then get the frame from the browser Document := fWebBrowser.Document, count the number of frames, and index through them. Frame 0 uses the script.top, while other frames use the frames.Item(index). Note the Basic type array indexing. From the frame, check the document.body, and extract the actual text from this. Be aware that the CreateTextRange will cause an exception if there's no text, as in a banner object-hence the try-except to catch it. At this point, we have the complete HTML code, and all we do next is get the results and navigation links from it.
When I was testing this, I used the word Options on AltaVista. The page that was returned contained 85 links according to the WebBrowser links (document.links.Items(Index) property is used). Of these, just 10 are results and a number are from ads, banners, and stuff like that as well as the Navigation links. The layout is different for each search engine's result, and an HTML analysis object would make a good idea for a future article. I've stuck with AltaVista, as other search engines lay things out in their own way. To keep the code short, I've used two text strings-"AltaVista found" and "Result Pages"-to mark the start and stop of the result jumps. All URLs (look for "href=") that occur between these two strings and don't contain the text defined in the ignore text (like "jump.altavista") are used.
Everything centers on three components: TSearchEngine, TFetchResult, and TResultHandler. A list of TSearchEngines is constructed at program start with each object holding details needed to use that engine and an instance of the WebBrowser object. The search string is passed to this list, and each component then starts a query with its own engine. Result links are returned to a central list, while being careful to serialize access to this list through a simple guard variable (fbusy) to prevent two list additions from occurring at the same time. If there were just one operation being done, this could be avoided, but I have a list search as well and this takes time, so the guard variable must be used.
For a typical search engine like AltaVista, the query is processed by a cgi-bin script called query with various parameters added something like this for the search string "search text" (quotes included): pg=q&kl=en&q=%22search text%22&stq=20, which I understand as pg = q (it's a query), kl = en (Language is English), and stq=20 (results start with the 20th result).
The class TSearchEngine has methods PostQuestion, ExtractResultLinks, and NavigateNextResultPage and sets the WebBrowser event handlers to act accordingly. Most of the application's time is spent doing nothing more than waiting for the WebBrowser events to trigger. I've included a simple state mechanism so that users of this class can determine what's happening, and it's able to keep track of what it's supposed to be doing.
For each results link found, a TFetchResult component is created and added to a list of fetches. Every instance of this class has its own WebBrowser component and event handler code. I used a list to simplify tracking all fetches. I use a timeout period (default 240 seconds), and every 30 seconds, the entire list of fetches is scanned and checked for timeouts. One global timer is lighter on Windows resources than a timer for each object. As it's also difficult to exactly determine when a page has been fully downloaded, this timeout provides a tidier way of doing it.
If a fetch succeeds, the HTML contents of the document are saved out to a results folder. I haven't included the graphics in order to keep the code shorter. The filename is derived from the frame name after removing unacceptable characters.
In a sense, this is a multithreaded application with a fair degree of parallelism, although there are no threads used explicitly. Each of the WebBrowser components is used independently, one in each TSearchEngine and one in each TFetchResult. There seems to be no upper limit to the number of parallel searches, although the overall speed is obviously subject to the bandwidth of the Internet link.
If you were writing a retail product, you'd probably create a config file with some way of defining the query format for each engine it can handle as well as extracting the result links. Every search engine produces results in different formats and might well change the format whenever they like.
There's also the high probability that two or more searches will yield the same URL. I get around that by searching the list of current fetches. I suggest another list be used, as this only holds current fetches. Before fetching the results from a particular URL, a quick search is done of this list to make sure that the URL hasn't already been fetched. If it hasn't, the URL is added to the list and the fetch proceeds.
As always with search engines, you need to choose your search text carefully to avoid multi-million results. A limit value hard-coded as 100 is set for each TSearchEngine.
One last point: Be careful with the use of Application.ProcessMessages when you use WebBrowser. I've avoided it except where fetches are added, which waits if the resultProcessor is busy. I don't know if it's the way events are sent, but I found that setting a label and calling application.ProcessMessages could force the same event to happen again. This occurred more in the debugger, but if you get odd behavior, comment them out.
I think this code is somewhat rough and ready, but it works very well. Possibly the least satisfactory methods are detecting which frame has text. I found it hard to work out, other than by using exception catching, so if you run it in the debugger, expect a few trapped exceptions. It's also not a very polished application-there could be a better user interface, for example. But it does what I intended-it performs several parallel searches and downloads at the same time.
Robots not welcome
A standard has emerged called the Robot Exclusion Standard, which Web sites should use to specify trees that shouldn't be indexed by spiders because the pages change frequently or contain executables or non-text files.
I read that the latest estimates suggest that there are more than 800 million Web pages in existence, and that most of the search engines have indexed fewer than half of the total among them. The biggest search engines only have indexes for 150-200 million pages. So anything that limits the "noise" is welcome, and one way of doing this is through this standard. Just place a robots.txt text file in the root-for example, www.altavista.com/robots.txt-like the following:
This stipulates that all Web robots shouldn't look in /cgi-bin. Of course, your robots don't have to heed this-it's meant mainly for Web spiders, but it's bad manners to ignore it, and it makes sense to go along with it unless you really need to see what's there.
How, you might wonder, do Web site people know when a robot has visited them? Quite simply-it's the UserAgent name in the header passed to the Web server. This is most often Mozilla, but it might be Scooter (the AltaVista spider) or others. Mozilla (for "Mosaic Killer") was the working name for Netscape, and IE has also adopted it. It might be possible to change this in the fifth parameter of the TWebBrowser Navigate method, which holds additional headers. Or you could write your own HTTP interface instead of using IE, and then specify a name. Given that what we're doing might not be very popular with search engines, it's probably best to do it behind the anonymity of a Mozilla browser! Additionally, you might want to add a 10-second delay between jumping pages. That way, it will at least look as if it's a human operator and not a program.