The Deep Web

The Deep Web was estimated at “500 times as much information as traditional search engines ‘know about’” Kay, R. (2005) and that Google’s index of 8.2 billion web pages was “the tip of the iceberg” with the Deep Web accounting for over 840 billion pages from the top sixty Deep Web providers alone and growing constantly. The Deep Web exists due the un-indexable structure of the site. This may be due to conscious protection by the owner of the information or system or that the site is only available to navigate by the use of search queries; each of these methods will stop a search engine indexing bot in its tracks. An example of this is the reference I use below by Kay, R. (2005) where I first found this by searching the University Library, something that has protected access and cannot be indexed by search engines (unless the search engines have their own system login, which I doubt) and is also search query driven. As a search engine follows links rather than tries search queries, it cannot be indexed. However when I tried the same search using Google, I found the article again as a static page that could be found on the publisher’s web site.

Personally I do not consider the existence of a Deep Web of concern to commercial web sites but they may be related, certainly when revenue of some form is the objective of the site. For example, the LexisNexis site (at http://www.lexisnexis.com), according to CompletePlanet (2010), is one of the world’s largest fee based Deep Web sites providing specialist information with a high value. It is important in this kind of situation that search engines are prevented from indexing even a portion of their Deep Web content for commercial reasons. This could also be the case for academic sites such as the University of Liverpool Library where licensing issues would prevent open publication of their content, despite their student portal being used to search the Deep Web itself.

There are many larger sites, both commercial and academic, that offer free content but the nature of navigating that content prevents indexing so we are already seeing the emergence of search engines that search in a different way by acting as a search query agent for the Deep Web; CompletePlanet.com claiming to be the leader. It is in the interest of content providers both academic and commercial, where intellectual property and revenue is not an issue, to provide easy access to their content and according to Wright, A. (2009), Google announced its Deep Web strategy where its bots will start to apply search queries to sites it cannot index traditionally based on what it estimates to be the relevant content behind that search routine. This is a huge challenge due to the size and search complexity of the Deep Web however this is the direction that search engines must take to access this valuable data.

References

CompletePlanet (2010) Largest Deep Web Sites [Online]. Available at http://aip.completeplanet.com/aip-engines/help/largest_engines.jsp (Accessed 5 September 2010).

Deitel, P. & H. (2010) Internet & World Wide Web: How To Program (4th Edition). Pearson Prentice Hall.

Kay, R. (2005) ComputerWorld: Deep Web [Online]. Available at British Library Document Supply Centre Inside Serials & Conference Proceedings database via the University of Liverpool Online Library and at http://www.computerworld.com/s/article/107097/Deep_Web (Accessed 5 September 2010).

Wright, A. (2009) The New York Times: Exploring a ‘Deep Web’ That Google Can’t Grasp [Online]. Available at http://www.nytimes.com/2009/02/23/technology/internet/23search.html (Accessed 5 September 2010).