A Million Search Apps

December 11, 2009

Many Roads to the Deep Web

Filed under: Deep Web,Search — Shree Pragada @ 2:51 pm
Tags: , , , , , , ,

Semantifi, Google, Microsoft, Kosmix, DeepPeep, DeepDyve, Socrata, Infochimps, Data.gov and many others are trying to search the Deep Web. Each has different technologies, tools, and most importantly some very different approaches. All their approaches can be broadly put into three categories.

1. Consolidate Datasets on the Web

Technically, this is not an approach to searching the Deep Web but these efforts are a valuable first step in organizing datasets and making them available online. They include various government repositories like Data.gov, SEC.gov, SFData.gov, DC Data Catalog, Infochimps.org, Socrata.com, etc. and are a significant force behind Deep Web Search. In the absence of data search capabilities these properties/websites remain as Data Catalogs only. Socrata goes one step further to allow users to browse datasets but browsing 1000s of rows can be challenging.

2. Look / Peek thru Web Forms or APIs to search Underlying Databases

This approach is driven by the belief that the Deep Web is simply the Web behind HTML forms when in fact it is a lot more than that. If you looking to buy a car, you might visit Edmunds.com and fill the search form by selecting the Manufacturer, Model, Price Range, Zip Code, etc. The information filled into the form is used to compose a database query which is then submitted to the one or more databases to present the results back to you as an HTML page. Because this page is created on demand current search engine can’t see the page.

Google’s approach to the Deep Web is to find HTML forms, send input to these forms, and index the resulting HTML pages. Google’s approach is fully automated, can easily scale and fits nicely with its crawl infrastructure. For further insights, read Alon’s VLDB paper published in 2008. Kosmix takes a similar approach of tapping into web forms as Google does but using API calls instead. DeepPeep follows a similar approach of tapping into web forms to search underlying databases.

While this approach offers some benefits, it has severe limitations in the scope of content that can be searched and level of analytics that can be conducted. Given the simplicity of this approach, it can be easily scaled to a large number of Web Forms or APIs.  Coming to the challenges: first, this approach leaves out a huge portion of the Deep Web comprising of datasets that do not have a Form or API in front of them as is the case with millions of government, finance, research, etc. datasets on the web; second, Forms & APIs offer only a limited window into the underlying databases hence allow only simple queries but not advanced analytics.

3. Make Datasets Searchable

Tim Berners-Lee explains in this TED video that the real value of data is in directly searching them. This he envisions will drive the next innovation on the web.

Semantifi is pursuing this vision. It uses pioneering technology to search databases. It can crawl any type of database including XML data sources, index multi-terabyte databases, search based on relevance, present automatic charts & tables. Then Semantifi is built like a marketplace for “Data Search Apps” on the lines of App Store where

  • Anyone can publish “Data Search Apps”
  • Users can search at Semantifi.com or on their website
  • Apps can be shared publicly or within groups
  • Apps can be FREE or FEE based
  • Revenue is shared from commercial Apps

All three paths add value.  We are big fans of the Data Catalogs, Government Transparency Initiatives and others who are organizing datasets and making them available on the Web. Google, Kosmix, DeepPeep, etc. have demonstrated that we can get “answers to more search questions” by indirectly searching databases via the Forms / APIs. Semantifi is focusing the directly searching databases and has built initial Apps to demonstrate that anyone can build data search apps and that the Deep Web can be wired one dataset at a time by a community of Publishers.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Theme: Customized Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.