Tuesday, May 25, 2004

On Information Quality

A fundamental difference that I have with some people is that I cannot insist that information quality be specified from the onset.  Rather, information must be injected into the systems with whaterver quality it currently has.
 
To me, this is like the debate of closed hypertext systems versus open hypertext systems that raged prior to the WWW taking off.  Closed systems required high quality data with full referential integrity.  Yes, this provided a nice clean environment and once complete, was generally of a high quality.  The problem was, that producing fully 'cleaned' information with full referential integriity from the outset is difficult.  The onus is put on the human to make all of this happen prior to the information even being captured in the system.
 
This is the critical break down.  The hurdles to adding information to the system must be low.  Information must be able to be injected into the system ad hoc with little preparation and cleansing.  It should be added, captured, and annotated.
 
The system should be responsible for helping determine the quality of the information.  It should have mechanisms in place to help 'clean' up the information.  It should have anti-entropy mechanisms, data validation mechanisms, anomolous data isolation mechanims, and sufficient feedback loops to ensure a constant increase in the overal quality of the information in the system.
 
You can certainly understand both the desire and the reasoning behind 'closed' information stores.  The database community has lived or died by the 'quality' of the information in the database.  If a foreign key constraint is missing, or a table is missing referential integrity, the whole thing starts to fall apart.  Why is this?  It's because the system was built to depend on the upfront quality of the data rather than relying on mechanisms and systems that handle 'dirty' data and are able to recover from situations in which everything is not pristine.
 
This model is still valuable for some domain of problems, but it is an ever decreasing set of problems.
 
The other thing that has driven the requirement for 'clean' data has been the cost of computing, networks, algorithms and storage.  All of these factors made it much more cost effective to demand data quality up front rather than worry about creating a system which can deal with uncertainty.  By storing only 'clean' data, the system stores only what it needs, no more.  The semantics and the relationships of all of the data are stored with in the structure of the information, thus not requiring any external representation of the relationships.
 
The Web as an open hypertext system, was able to grow super exponentially due in large part because of its open architecture which did not require any kind of referential integrity, quality criteria, or 'gatekeeper' function.  Information has been injected into the WWW at such a rapid rate that it is currently very difficult to even estimate the global size of the WWW.  There is no doubt that there is 'bad' data and low quality information on the Web today.  What is interesting to watch is some of the natural(?) or evovled mechanisms which have come into play to isolate and improve the overall quality of information on the Web.  Many of these mechanisms are human in nature today (such as manual linking), but there is much that can be done to improve the qualtiy through automated systems.
 
An early attempt was made with the Atlas Link Database that we worked on in '95.  The goal of this system was to use embedded knowlege within the system (link hrefs and referer information) to continuously update and correct broken links.  This is just a small element of the overall problem, but certainly starts to demonstrate the power of using the information to do self correction.
 
With the continual improvements in the Semantic Web, Web Services and other related technologies, the Web is moving from a massively linked open hypertext system to a massively linked open database system.  I believe that we are currently embroiled in some similar arguments about closed versus open in the database world that we had 10 years ago in the hypertext world.
 
Ultimately, the open model will prevail for its ability to scale across domains, networks, and technologies.  But this will not happen without a lot of work on the mechanisms to judge, rank, correct, and clean the information that is injected into the system.
 
Kipp
 

Kipp Jones - CTO
nuBridges, LLC - www.nubridges.com
eBusiness is Business

cell:  404.213.9293
work:  770.730.3722

 

No comments: