Selecting a Task Type

As we have mentioned before in this help file, there are two primary functions of an extraction project. The first is getting the pages to extract via a download, form submit, or crawler task. The second is extracting the data using a datapage. At first, perhaps, the most difficult part to understand is how to create a datapage, but as you gain experience you will find that creating datapages is more or less repeatable each time.

 

Figuring out how to effectively navigate to and download the web pages that contain the data that you want to extract is the most challenging aspect of many, if not most, projects. This is surprising to most because the opposite is true if you were to use the old browse-click-cut-and-paste method (the obvious alternative to automated web extraction). So, why can’t auto downloading, or form submitting be as easy as using your web browser? Well, first of all, rest assured that we are working on that. But, the real reason, in a nutshell, is that every site is structured differently. Here are some key differences in site architecture that make automatic navigation more than meets the eye:

·      Sites are based on different, constantly evolving technologies (ASP, JSP, PHP, ASP.NET, etc.). Each of these platforms has their own design patterns.

·      Some sites have a unique url for each individual page such that you can copy the url from one browser and paste it into another and it will bring up the same page (these are the easiest to navigate automatically). Other sites use things like cookies and server or client side session variables to store where you are on a site and what data to send you (these are harder to navigate automatically).

·      All links are not created equal. Some links (the good ones) actually point to another web page. Others call javascript functions that launch a popup window or actually submit a form (ASP.net is notorious for the latter).

·      Frames. Some sites have them, some sites don’t. Since a frame is actually just web page embedding within another, we need to make sure we are dealing with the one that actually contains the data we need.

·      Some sites like commerce sites are very motivated to make their content easy to crawl so that their content will appear in search engines. Other sites like many real estate and government sites are motivated to make their content very difficult to crawl. These sites will actually put development time into making it hard to navigate automatically. Whois and Travel databases are another prime example of the latter.

 

Okay, so these are challenges, but they can be overcome; Velocityscape Consulting deals with them every single day. We will share with you the knowledge of our experience so that you can get the most out of WSP. First let’s take a look at task selection.

 

In a web automation package, there are 3 types of tasks: Download, Form, and Crawler. Which one you should use depends on the site architecture, data quality requirements, and the tradeoff between development time and efficiency of execution.