A Million Search Apps

January 28, 2011

Curation is the New Search is the New Curation

Filed under: Uncategorized — Shree Pragada @ 11:35 am

Here is an exciting post by Paul Kedrosky on how curation will be key to future search. In the past, when the web was only a couple millions of pages one company like Yahoo could curate a directory of the web. One company couldn’t  curate as the web grew to billions of pages. Now, algorithmic search is also failing to deliver. And, the solution could be curation again. With billions of pages, current web will take more than one company to curate the web. Current web will need a large community curating small focused areas of web together covering a vast majority of the web — just as Semantifi envisions that the future web will be powered by millions of community build/curated search apps covering the broad spectrum of the web.

Here is the link or the complete article:

http://paul.kedrosky.com/archives/2011/01/curation_is_the.html?utm_source=feedburner&utm_medium=email&utm_campaign=Feed:+InfectiousGreed+(Paul+Kedrosky’s+Infectious+Greed)

In the beginning there was curation, and it was good. People found interesting things on the web, created directories of those things, and then you found what you were looking for inside those curated lists. That was the origins of the original lists and directories, from Yahoo on outward.

But then that got too hard. The web got bigger faster than anyone could keep track. Curation steadily gave way to algorithmic search, which at first was just spidering of the web, and then more intelligent spidering with keywords. And then it became Google, with ranking algorithms that placed websites into a hierarchies of keyword-related relevance based on things like authoritativeness, as defined, in part, by links from other sites — by those original hand-curated lists, ironically enough.

That model has now begun to give way too. Any algorithm can be gamed; it’s only a matter of time. The Google algorithm is now well and thoroughly gamed, as I first wrote about late last year, and as now become an entire genre of web writing, and that has grown to include my friend Vivek Wadhwa’s smart piece on TechCrunch not long ago. Google has, they argue, lost its mojo — which is true, but it’s more interesting and complicated than that.

What has happened is that Google’s ranking algorithm, like any trading algorithm, has lost its alpha. It no longer has lists to draw and, on its own, it no longer generates the same outperformance — in part because it is, for practical purposes, reverse-engineered, well-understood and operating in an adaptive content landscape. Search results in many categories are now honey pots embedded in ruined landscapes — traps for the unwary. It has turned search back into something like it was in the dying days of first-generation algorithmic search, like Excite and Altavista: results so polluted by spam that you often started looking at results only on the second or third page — the first page was a smoking hulk of algo-optimized awfulness.

There are two things that can happen now. (Okay, three. We could stop search, which won’t happen.). We could get better algorithms, which is happening to some degree, with search engines like Blekko and others. Or, we could head back to curation, which is what I see happening, and watch new algos emerge on top of that next-gen curation again. Think of Twitter as a new stab at curation, but there are plenty of other examples.

Yes, that sounds mad. If we couldn’t index 100,000 websites in 1996 by hand, how do we propose to do 234-million by hand today?

The answer, of course, is that we won’t — do them all by hand, that is. Instead, the re-rise of curation is partly about crowd curation — not one people, but lots of people, whether consciously (lists, etc.) or unconsciously (tweets, etc) — and partly about hand curation (JetSetter, etc.). We are going to increasingly see nichey services that sell curation as a primary feature, with the primary advantage of being mostly unsullied by content farms, SEO spam, and nonsensical Q&A sites intended to create low-rent versions of Borges’ Library of Babylon. The result will be a subset of curated sites that will re-seed a new generation of algorithmic search sites, and the cycle will continue, over and over.

In short, curation is the new search. It’s also the old search. And it’s happening again, and again. [-]

[Update] A friend points out in IM that all of this makes Yahoo mothballing Delicious, a directory of curated lists, more than a little mistimed. And it’s made pointed and ironic too when you look at what the #2 most bookmarked link is on Delicious right now: Google’s weakening search results

June 22, 2010

Government Transparency: Beyond just putting Government Datasets on the Web to making them Searchable!

Filed under: Uncategorized — Shree Pragada @ 1:29 pm

Returning from Gov 2.0 Expo in Washington got me more interested in this budding Government Transparency phenomenon. At this time it is mostly a lot of buzz and very little tangible progress. Here are my two cents on what it takes to really catalyze Government Transparency:

President Obama campaigned on the vision of Government Transparency and made it a priority for his administration. Last year his CIO, Vivek Kundra, launched DATA.GOV to begin putting government datasets on the Web.

Making government datasets available on the Web as “raw datasets” is a good first step. However, raw datasets are not accessible to the millions of Citizens that they are intended for as manipulating raw datasets can be very time consuming and technically challenging. For government datasets to be truly transparent, they must be searchable on the Web.

We understand the administration’s goal to help citizens become better informed and to better engage them in public discourse. Semantifi can help with this goal by enabling citizens socialize government data & insights using its “free data search platform” at www.SEMANTIFI.com.

Unlike Google, Bing and others that search Web Pages, Semantifi searches structured data and engages the community to publish datasets to make them searchable. Users can ask simple questions to search all published datasets, get meaningful answers in the form of charts & tables and share their observations & insights.

Launched earlier this year, Semantifi.com already hosts many popular government datasets covering Government Spending, Recovery, US Economic Metrics, SEC Filings, Earmarks, Recovery, US Aid, FDIC, US Census, etc.  At Semantifi.com, users can

  • Explore Recovery data with questions likes “Jobs recovered in New York and California”, “Award Amount by Recipient State and Performance District”
  • Discover government spending with questions like “Federal Funding to California versus New York”, “Federal funding amount for quarter 1 year 2008”
  • Research over 22,000 US Economic Metrics with questions like “Unemployed and Housing Starts for last 60 months”, “Business Loans between Jan 2005 and Dec 2009”
  • Investigate SEC Filings of publicly traded companies asking questions like “Sales and Income of Amazon and Best Buy”, “Net Sales of companies with Market Capital over 1 billion”

We believe Semantifi can catalyze the administration’s vision of government transparency by making vast number of government databases searchable using pioneering search technology and engaging a community of citizen publishers.

February 11, 2010

Data is the Future of Web: Latest Validation from Prominent Investors?

Filed under: Deep Web,Search — Shree Pragada @ 10:08 am
Tags: , , ,

Factual is the latest to join Semantifi and others in the chorus that the “Future of the Web is Data”. Founded by Gil Elbaz of the successful Applied Semantics and with many prominent investors including Marc Andreessen, Ben Horowitz, Esther Dyson, Bill Gross, Marten Mickos, Scott Kurnit, and others, it as a clear validation of the market and that the market is here“.

Socrata is the other startup before Factual. Both Socrata and Factual are quite similar in concept and both lack the technology to search datasets like Semantifi. Therefore, they allow users to browse data, primarily tabular data, page-by-page. While this is a lot of fun for simple and short datasets, the approach can be limiting for larger datasets.

Here is the complete article from VentureWire.

Google Vet Grabs $1M Seed For Open-Source Data Co. Factual

By Tomio Geron  2/5/2010

Gil Elbaz is a serial entrepreneur who is closely watched because of his last success.

He co-founded Applied Semantics, which was acquired by Google for about $100 million in April 2003 and later became the technology behind Google Adsense. Elbaz joined Google with the acquisition and then left in 2007.

Now he has founded Factual Inc., an open-source data technology company.

The company has raised about $1 million in angel financing from prominent investors including Marc Andreessen and Ben Horowitz through their firm Andreessen Horowitz, Founder Collective and Miramar Venture Partners and individual investors Esther Dyson, Bill Gross, founder of Idealab; Danny Rimer, a partner at Index Ventures; Marten Mickos, former chief executive of MySQL; Richard Rosenblatt, CEO of Demand Media; Scott Kurnit, founder of About.com; Thomas Lehrman, founder of Gerson Lehrman Group; and Tom Unterman, founder of Rustic Canyon Ventures. Valuation was not disclosed.

Factual is developing an open-source version of data sources for developers to tap into. For example, if a company has an application that uses restaurant information, it can connect to Factual’s application programming interface and access that data, rather than having to build that data itself or crowdsource it or license it from someone else.

In return, developers and their users would add or fix data in Factual’s databases. Currently Factual is offering this data free, but eventually could charge for parts of it.

“I saw the concept of open data and thought it was going to be extremely important and maybe change the landscape significantly,” Elbaz said. “Developers often have an idea for Web sites or iPhone apps but often they’re data-driven and require access to some form of data aggregation.”

Elbaz cited as an example, an iPhone developer who wants to create a pizza restaurant search. Rather than different developers creating separate databases for this information they could tap into Factual. This allows developers to focus on what they do best–creating great applications.

“The question developers have to ask is do you want data to be something completely proprietary or to share in the benefits of the wider community to curate and clean crowdsourced data?” he said. “Like open-source software, it doesn’t make sense for everyone to start from scratch and maintain their own version if the community already can share in the collective efforts to build something that can be used by everyone.”

Factual already has a partnership with Demand Media’s Livestrong Web site to provide data on physicians and hopes to expand to other categories.

The service is different from Wikipedia, in that Factual enables people to select any fact and drill down to see all the comments users from different sources have made on that particular fact.

Another start-up that is probably more similar is Metaweb Technologies Inc., backed by at least $57.5 million from Benchmark Capital, DAG Ventures, Goldman Sachs & Co., Millennium Technology Ventures and Omidyar Network. However, Metaweb’s Freebase service generally gathers data from end-users directly, while Factual is focusing on developers to gather and use its data.

Here is GigaOM’s  ..

Here is TechCrunch ..

Google Vet Grabs $1M Seed For Open-Source Data Co. Factual

By Tomio Geron

2/5/2010

Gil Elbaz is a serial entrepreneur who is closely watched because of his last success.

He co-founded Applied Semantics, which was acquired by Google for about $100 million in April 2003 and later became the technology behind Google Adsense. Elbaz joined Google with the acquisition and then left in 2007.

Now he has founded Factual Inc., an open-source data technology company.

The company has raised about $1 million in angel financing from prominent investors including Marc Andreessen and Ben Horowitz through their firm Andreessen Horowitz, Founder Collective and Miramar Venture Partners and individual investors Esther Dyson, Bill Gross, founder of Idealab; Danny Rimer, a partner at Index Ventures; Marten Mickos, former chief executive of MySQL; Richard Rosenblatt, CEO of Demand Media; Scott Kurnit, founder of About.com; Thomas Lehrman, founder of Gerson Lehrman Group; and Tom Unterman, founder of Rustic Canyon Ventures. Valuation was not disclosed.

Factual is developing an open-source version of data sources for developers to tap into. For example, if a company has an application that uses restaurant information, it can connect to Factual’s application programming interface and access that data, rather than having to build that data itself or crowdsource it or license it from someone else.

In return, developers and their users would add or fix data in Factual’s databases. Currently Factual is offering this data free, but eventually could charge for parts of it.

“I saw the concept of open data and thought it was going to be extremely important and maybe change the landscape significantly,” Elbaz said. “Developers often have an idea for Web sites or iPhone apps but often they’re data-driven and require access to some form of data aggregation.”

Elbaz cited as an example, an iPhone developer who wants to create a pizza restaurant search. Rather than different developers creating separate databases for this information they could tap into Factual. This allows developers to focus on what they do best–creating great applications.

“The question developers have to ask is do you want data to be something completely proprietary or to share in the benefits of the wider community to curate and clean crowdsourced data?” he said. “Like open-source software, it doesn’t make sense for everyone to start from scratch and maintain their own version if the community already can share in the collective efforts to build something that can be used by everyone.”

Factual already has a partnership with Demand Media’s Livestrong Web site to provide data on physicians and hopes to expand to other categories.

The service is different from Wikipedia, in that Factual enables people to select any fact and drill down to see all the comments users from different sources have made on that particular fact.

Another start-up that is probably more similar is Metaweb Technologies Inc., backed by at least $57.5 million from Benchmark Capital, DAG Ventures, Goldman Sachs & Co., Millennium Technology Ventures and Omidyar Network. However, Metaweb’s Freebase service generally gathers data from end-users directly, while Factual is focusing on developers to gather and use its data.

January 23, 2010

The Future of Search (Part 2)

Filed under: Deep Web,Search — Shree Pragada @ 6:31 pm

A little over a year ago I posted some thoughts on “The Future of Search” (Oct ’08). Since then, we learned a lot more and see that future a little more clearly. We still believe in all those predictions that “relevance” would be key to search in the future and knowledge based driven vertical search engines will provide the highest quality search results compared to keyword or NLP driven search engines. At that time, a few key questions weren’t very clear as they have become recently:

  1. “how” these vertical search engines will be built
  2. “what” this search ecosystem will look like
  3. “when” we can expect to experience this Future Search

First, lets talk about the “what”. Once we understand what it is, we could quickly see what the ecosystem looks like and even recognize potential stake holders who, often, are already existing and finally, how and when are probably much simpler questions.

Future Search will resemble nothing like the today’s Search. It will not be one or two engines searching all the Web for all the world wide audience but it will be many vertical search engines or millions of small/focused search engines working as a collection. Future Search will look a lot like Wikipedia and Apple’s App Store.

Today, Wikipedia and App Store have nothing in common with Search, so how can these models be synonymous with Future Search? Wikipedia is clearly the largest and most accessible knowledge base on the Web. It benefits millions of users by enabling several thousands of users who are knowledgeable and passionate about their field of expertise to publish that knowledge as wiki pages. In short, Wikipedia is just a platform that enables publishers and users to share knowledge of the world.  App Store, in concept, is no different. It is  also a platform that enables publishers and users to share Mobile Apps except it has a different profit motive, specifically, the model of sharing revenues of commercial Apps. Future Search will be  a hybrid of these two concepts. Like Wikipedia, it will be “knowledge & passion driven” to bring the community into Search and like App Store it will be “commercially driven” to compete and sustain for a very long time. Can it bring a “sense of permanence” to the field of Search resting the guessing game of “who/what is next big thing in search is” is a question for the pundits. In summary, Future Search, will …

  1. be a platform offering access to search technology; the technology itself could also be built by the community or be open source like MediaWiki, the technology that makes Wikipedia platform possible
  2. search both web pages and databases
  3. enable anyone to publish Search Apps using this search technology
  4. provide highest quality search results to all users by searching the collection of Search Apps
  5. share revenues of commercial Search Apps to offer value for all involved: Publishers, Users & The Platform.

The ecosystem will include many established and young stake holders providing:

  1. Platform(s) like Semantifi, Socrata, etc. to enable community to publish Search Apps
  2. Knowledge base driven Search technologies like that of Powerset (now Bing), Hakia, ExeCue
  3. Knowledge Repositories like Freebase, DBPedia, etc.
  4. Raw Data Catalogs like Data.gov, SEC.gov, etc.
  5. Linked Data Catalogs like LinkedData.org
  6. Data & Content Owners/Distributors like Thompson Reuters, Sungard, Acxiom, etc.
  7. Non-profit organizations like Sunlight Foundation, Apps for Democracy, etc. especially to evangelize content that may not be of high commercial value but has high community value
  8. Publishers driven by purpose, passion, and/or profit
  9. Internet Users who will benefit immensely from this collective effort, and,
  10. Capital to catalyze this

While not much is discussed of the current search leaders in this context, we strongly believe they will be significant stakeholders in this future search ecosystem as long as they adapt to these changes.

Coming to the “how”, Semantifi is built with this vision of the Future Search. Semantifi will soon launch its Publisher Console to enable anyone to publish Search Apps  which will need “NO PROGRAMMING SKILLS!” to publish Apps. Lastly, on “when” we can all come to experience this future, while no one knows the answer, I have a simple question: Why should it take any longer than what it did for Wikipedia or App Store?


January 15, 2010

Semantifi for the pros or the rest of us — a case study

Filed under: Deep Web — vishydasari @ 8:06 am

Here is an example to illustrate that you don’t need be a professional trader or investor or a research analyst to investigate company financials or SEC Filings data. Imagine you are a New York Times reporter researching political influence on big enterprise versus medium sized businesses during the Bush Administration. You have a choice. Publish your opinions or publish observations based on facts that you can investigate asking a few simple questions.

First find out how many big business (say with sales over $3B) are there in year 2000 when the Bush administration began by asking “year 2000 companies with sales > 3 billion“. There are about 600 public traded companies.

Next let’s find out how these companies fared in terms of income and taxes paid over the last 12 years — to cover both before and after year 2000. You can ask this questionlast 12 years Income Tax Payables, net income for year 2000 companies with sales > 3 billion” and  see that income shows steady and good growth (given the sharp slope) but taxes are fairly flat especially since year 2000.

Tax and Income of Major corporations

Are medium sized companies taxes also benefiting similarly? We can find out by revising sales range in earlier question to “between 300 million and 3 billion” as “last 12 years Income Tax Payables, net income for year 2000 companies with sales between 300 million and 3 billion“. The chart for income shows good slope (obviously discounting, 2001 recession). Looking at the slope of taxes it should be clear that mid sized companies were NOT benefiting the same way as big enterprises.

Tax and Income of Midsize companies

So you do not have to be a pro to get some quick facts. How much of this is political influence is where more journalistic experience may be needed.

January 11, 2010

Publisher Console soon to be available. Check out & Suggest Apps Ideas.

Filed under: Uncategorized — semantifi @ 11:13 am

We recently added a page for  App Ideas to our Wiki section of the website.

http://wiki.semantifi.com/index.php/App_Ideas

This page illustrates the kind of Apps anyone can build using Semantifi Search Platform. Since it is a Wiki page please feel free to add your ideas for potential Apps. Cheers.

December 12, 2009

On Government Data: Data Catalogs. Transparency Initiatives. What’s next?

Filed under: Government Transparency — Shree Pragada @ 3:55 pm
Tags: , ,

In a recent interview with Fast Company, Lisa Strausfeld who is a partner at Pentagram said “If we could be as obsessive with government data as we are with baseball stats, maybe it would change the form of democracy”. This says it all about the power of government data.

A vast number of government datasets are already available on the Web at various government data catalogs. A number of Government Transparency Initiatives like Global Integrity and Sunlight Foundation, Apps for Democracy, etc. are leading the charge in making these datasets accessible to citizens / internet users.

For instance, Sunlight Foundation’s is helping citizens become better informed and more involved in public discourses by making “select” government datasets accessible to the public through its millions of dollars in grants. Sunlight Foundation has created over a dozen of websites such as OpenCongress.org, FedSpending.org, OpenSecrets.org, EarmarkWatch.org, and LOUISdb.org. Some of these project are hugely successful. For instance, FedSpending.org had logged over 10 milions searches within a year of launch. This statistic should settle the argument on the value and whether anyone cares about government data. However, with many thousands of govenrment datasets coming out, a new approach is needed to suppliment current efforts of “enabling select datasets through grants”.

Semantifi offers a new approach using a new kind of technology to make vast number of databsets searchable via its OPEN & FREE data search platform. At Semantifi, anyone can configure any datasets (government or others) to its search engine and publish them as “Data Search Apps”. Apps can be shared publicly or within groups allowing citizens / users to search these datasets by asking simple questions like “Top 5 Senators requesting Earmarks in 2008″, “Federal Spending in New York versus Illinois”, etc.

While Semantifi can make datasets searchable with its data search platform, it’s the citizen publishers that can really catalyze government transparency movement and revolutionize democracy by making “vast numbers of government datasets searchable”.

December 11, 2009

Many Roads to the Deep Web

Filed under: Deep Web,Search — Shree Pragada @ 2:51 pm
Tags: , , , , , , ,

Semantifi, Google, Microsoft, Kosmix, DeepPeep, DeepDyve, Socrata, Infochimps, Data.gov and many others are trying to search the Deep Web. Each has different technologies, tools, and most importantly some very different approaches. All their approaches can be broadly put into three categories.

1. Consolidate Datasets on the Web

Technically, this is not an approach to searching the Deep Web but these efforts are a valuable first step in organizing datasets and making them available online. They include various government repositories like Data.gov, SEC.gov, SFData.gov, DC Data Catalog, Infochimps.org, Socrata.com, etc. and are a significant force behind Deep Web Search. In the absence of data search capabilities these properties/websites remain as Data Catalogs only. Socrata goes one step further to allow users to browse datasets but browsing 1000s of rows can be challenging.

2. Look / Peek thru Web Forms or APIs to search Underlying Databases

This approach is driven by the belief that the Deep Web is simply the Web behind HTML forms when in fact it is a lot more than that. If you looking to buy a car, you might visit Edmunds.com and fill the search form by selecting the Manufacturer, Model, Price Range, Zip Code, etc. The information filled into the form is used to compose a database query which is then submitted to the one or more databases to present the results back to you as an HTML page. Because this page is created on demand current search engine can’t see the page.

Google’s approach to the Deep Web is to find HTML forms, send input to these forms, and index the resulting HTML pages. Google’s approach is fully automated, can easily scale and fits nicely with its crawl infrastructure. For further insights, read Alon’s VLDB paper published in 2008. Kosmix takes a similar approach of tapping into web forms as Google does but using API calls instead. DeepPeep follows a similar approach of tapping into web forms to search underlying databases.

While this approach offers some benefits, it has severe limitations in the scope of content that can be searched and level of analytics that can be conducted. Given the simplicity of this approach, it can be easily scaled to a large number of Web Forms or APIs.  Coming to the challenges: first, this approach leaves out a huge portion of the Deep Web comprising of datasets that do not have a Form or API in front of them as is the case with millions of government, finance, research, etc. datasets on the web; second, Forms & APIs offer only a limited window into the underlying databases hence allow only simple queries but not advanced analytics.

3. Make Datasets Searchable

Tim Berners-Lee explains in this TED video that the real value of data is in directly searching them. This he envisions will drive the next innovation on the web.

Semantifi is pursuing this vision. It uses pioneering technology to search databases. It can crawl any type of database including XML data sources, index multi-terabyte databases, search based on relevance, present automatic charts & tables. Then Semantifi is built like a marketplace for “Data Search Apps” on the lines of App Store where

  • Anyone can publish “Data Search Apps”
  • Users can search at Semantifi.com or on their website
  • Apps can be shared publicly or within groups
  • Apps can be FREE or FEE based
  • Revenue is shared from commercial Apps

All three paths add value.  We are big fans of the Data Catalogs, Government Transparency Initiatives and others who are organizing datasets and making them available on the Web. Google, Kosmix, DeepPeep, etc. have demonstrated that we can get “answers to more search questions” by indirectly searching databases via the Forms / APIs. Semantifi is focusing the directly searching databases and has built initial Apps to demonstrate that anyone can build data search apps and that the Deep Web can be wired one dataset at a time by a community of Publishers.

Microsoft Is Losing Fight for Consumers, Analyst Says

Filed under: Consumer vs. Enterprise — Shree Pragada @ 10:12 am
Tags: , , , , ,

I saw this article in the New York Times this morning.

Mark Anderson, the writer behind the Strategic News Service, a predictive newsletter with a wide following among technology executives and venture capitalists, predicts that “Except for gaming, it is ‘game over’ for Microsoft in the consumer market. It’s time to declare Microsoft a loser in phones.”

While Mark’s comments were focused on the smartphones market, saying “If Microsoft loses in smartphones, it is pretty grim.”, what caught my attention is his explanation that the underlying problem is cultural. “Phones are consumer items, and Microsoft doesn’t have consumer DNA”.

So, does this DNA predict the outcomes of Microsoft’s other battles for Browsers, Search,  Semantic Search, Deep Web Search, …..?

See full article at … Mark Anderson says Microsoft Is Losing Fight for Consumers

Tim Berners-Lee on the next Web. Right on the Vision. What about the Approach?

Filed under: Deep Web — Shree Pragada @ 12:45 am
Tags: , , , , ,

Here is Tim Berners-Lee talking about how “data” makes up the next Web or so called  “Data or Deep Web”. This clearly validates our enthusiasm the Deep Web Search. While Semantifi shares his vision for the Deep Web, it does not follow the Linked Data approach. Here is why:

When Databases are converted into RDF triples they become graphs. Graphs are perfect for representing knowledge base, navigating /querying knowledge bases and, finally, connecting them with other knowledge bases. However, Graphs are not efficient in capturing record level data. Also, Graphs are very inefficient at aggregating columns of data which Databases and OLAP tools are optimized for. With the Linked Data approach, metadata or knowledge base as well as actual record level data are both converted into RDF / triples.  See example at http://www.rdfabout.com/demo/census/

We think only the Knowledge Base should be in converted to RDF to enable linkages with other knowledge bases and Data should be left inside Databases to ensure the most efficient search and retrieval.

Next Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.