Thursday, December 30, 2004

Unstructered Information Management

UIM is something that IBM has been doing a lot of research and development on.  I caught this reference from an article in the NY Times (http://www.nytimes.com/2004/12/26/business/yourmoney/26techno.html) entitled “At I.B.M., That Google Thing Is So Yesterday”.  This article refers to some work being done under the direction of Arthur Ciccolo within IBM Research (http://www.research.ibm.com/UIMA/index.htm) at the T.J. Watson Research Center

 

Their goal is to create an infrastructure that provides the ability to combine the various information extraction and knowledge discovery techniques in an effective fashion – rather than each one having to re-process the information, is there a way to leverage each of these using some infrastructure and get better throughput and results?  I’ve only begun to dig through their publications, but they recently had an entire IBM Systems Journal issue dedicated to this area of research (http://www.research.ibm.com/journal/sj43-3.html).

 

Kipp Jones

 

Wednesday, December 22, 2004

I'd like to explore the idea behind CyberINFOstructure a bit further. How is this different from the CyberINFRAstructure? What is meant by this distinction?

To date, a lot of high performance computing and the related infrastructure is around computing -- cpu, performance, bandwitdth and throughput. The focus seems to be on the bottom layer, the infrastrcuture if you will.

The information that lives on top of this has been given some cycles, but generally in a specific manner related to a problem at hand. I believe there needs to be a more concerted effort to understand and create a more reliable infostrcture on top of this infrastructure. This layer should support:

  • Information and source discovery
  • Information pedigree
  • Information access
  • Information classification and semantics
  • Information composition
  • Information translation
These capabilities reside on top of the basic infrastructure and provide common facilities for applications to access and use information that resides within the grid/enterprise/world.

I'll dig deeper into each of these topics next.


Tuesday, December 21, 2004

Focus on Research

Key topics of interest include:
  • Web Technologies (web services, semantic web, search, retrieval, etc.)
  • Information Quality and Availability (get it when you need it, proper semantics, verifiable pedigree, etc.)
  • Cyberinfrastructure (or as I prefer cyberINFOstructure -- a focus on the infrastructure that provides access to the inforamation rather than the comptuing side of the infrastructure)

I intend to build on and improve these 3 pillars to understand how they can be applied towards two areas:
  • Enterprise Computing
  • Scientific Computing
I believe there are some fundamental process, algorithms, and capabilities that can be created to provide support for both of these activities. Clearly each has some specific needs that will be different from the other and we may need to go down one path quite a ways to fully undertand the particular needs, but ultimately I'd like to circle back to find the commonalities across these domains.

The next step is to identify and isolate relevant research related to these topics and begin to distill it into a more focused set of ideas.



Friday, June 11, 2004

Research, Development and Operations

I'm back from my hiatus. I spent a good 10 days on a whirlwind trip across the great plains of Nebraska and Kansas (with some Missouri time in there). It was a good break and time to spend with friends and family. Always a good idea in my book.

But, now that I'm back, been thinking a lot about a lot of things. In particular, I'm still going through the Atkins report, a report that was published by the NSF detailing what should be done to fund the CyberInfrastructure. There are a lot of good nuggets in this document. Below is one I thought elucidated some good distinctions between Research, Development, and Operations.


Research

Research is a competition of ideas. Allocation of resources starts with the program announcement and evaluation of the resulting proposals. This is bottom-up, stating the evaluation criteria with detailed initiatives arising from the research community. Overlap or duplication is acceptable where different researchers pursue competing visions for accomplishing similar ends. Post-evaluation is based on the intellectual quality and impact of the research outcomes.

Development

Development is a competition of plans. An overriding goal of development is to limit duplication of effort, and concentrate resources on a set of integrated and maintained software distributions collectively covering the scope of the ACP. Thus, development is partitioned and assigned to organizations based on the responsiveness to needs and credibility of their plan for pre-defined concrete outcomes. Post-evaluation is based on how effectively the plan has been implemented and also on how extensively the outcomes are adopted and used and on user satisfaction.


Operations

Operations is a competition for users. Operations serve end-users, domain scientists, and engineering researchers, responsively providing service and support. There should be two or more competitive operational options available to users. A primary point of post-evaluation should be the satisfaction of the users who are served, and to a lesser extent the number of users who are served, based on input from the user community.


-----------

Kipp




__________________________________
Do you Yahoo!?
Friends. Fun. Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/

Thursday, May 27, 2004

Semantic integration

Another article from David McGoveran[1] (gotta admit he writes some good stuff) notes that the primary types of semantic transofrmation are:



*
Combining two or more fields having different data types
*
Decomposing a field into two or more new fields
*
Aggregating multiple values
*
Disaggregating aggregate values
*
Consolidation
*
Synchronization
*
Generalization
*
Sub-typing

I include them here so I don't lose them.

He goes on to state something that fits in with my thoughts regarding TDQM (Total Data Quality Management) although my thoughts extend outside of an enterprise and really drive towards a supporting infrastructure for TDQM rather than the processes and responsibility questions that are addressed by TDQM.

Here are the elements of TDQM:


1. Fill the repository incrementally, never “upfront.” This is a pragmatic, not academic, effort.
2. Ensure that the repository supports a theory of semantic types. It should define data semantics by capturing constraints and not just data syntax, and relate them to existing types through dependencies.
3. Don’t accept application software unless data semantics has been defined in an importable data model.
4. Use versioning, partitioning, and type relationships to organize metadata. Never delete it.
5. Commit to driving application development and integration projects from the repository.
6. Use data integration tasks as opportunities to use, refine and validate the repository.

These are all very good points and certainly make a lot of sense. #3 is a tough one for companies to enforce, and we probably need to think of compensating actions to take for those instances where the absolutes are not an option.

I'd also take #5 a lot further to say that it is not just driving application development and integration projects from the repository, that a lot of the applications themselves can be made more valuable to the users by leveraging the repository. There is a lot of value in having this information available and we need to find ways to leverage it.


[1] Data Integration, Part VIII, eAI Journal, January 2003








Kipp Jones - CTO
nuBridges, LLC - www.nubridges.com
eBusiness is Business
cell: 404.213.9293
work: 770.730.3722



Tuesday, May 25, 2004

Business Semantics

Case in point:
 
 
An article published back in 2000 about the semantics of data and why it is important to understand the meaning of the data in order to do business to business integration.  The author, David McGoveran, argues the point that of the 2 of the 3 main ingedients that are required for an integrated information exchange have been adequately solved:
 
1) Connectivity
2) Timley capture and purveyance of data
3) Understanding the data
 
I agree with this premise, as well as the conclusion he draws that all "public data elements pertaining to a set of integrated applications, whether stored in a relational database or not, as though they were attributes of a formally designed relational database."
 
I disagree with the method by which we need to get there.  According to the author, "maintainable business semantics...is worth a little formal design effort involving application vendors and developers."  Yes, formal design is necessary, but I say it is insufficient.
 
Why?  Because the tools for formal design and the method for maintaining the semantics after the formal design and managing change to these formal designs are fundamentally missing.  It is like having CASE tools that don't do round trip.  Sure, it's valuable the first time, but the value degrades exponentially over time, and in fact, I would argue that the uncertainty that is caused by the lack of integrity of the documentation versus the acutal causes the documentation to be virtually worthless right off the bat.
 
If I don't have assurances that I have the right version of the documentation for the system that I'm using, I'll probably spend extra time verifying it...and thus I've lost a good amount of the value of having the documentation even if it is a small fraction of the documentation that is actually invalid.  I use the term documentation to refer to any artifcats regarding the semantics of the information, its relationship to other information and the methods by which I can gain access to and/or manipulate the information.
 
I would also point out that much of the metadata about an actual data model is embedded in databases, contained in their system catalogs.  And yet this information is not easily accessible to most humans, it's not easily exchanged, nor is it sufficient to specify the semantics of the information.
 
We need more, and it can't be solely dependent on the actions of the few, the proud, the "data architects."  We are dealing with living systems that evolve over time.  They can't rely on rigid constructs and upfront knowlege.  They must be able to adapt, grow, expand, connect and self-correct over time.
 
I regard this as the challenge.
 
Kipp
 

Kipp Jones - CTO
nuBridges, LLC - www.nubridges.com
eBusiness is Business

cell:  404.213.9293
work:  770.730.372

On Information Quality

A fundamental difference that I have with some people is that I cannot insist that information quality be specified from the onset.  Rather, information must be injected into the systems with whaterver quality it currently has.
 
To me, this is like the debate of closed hypertext systems versus open hypertext systems that raged prior to the WWW taking off.  Closed systems required high quality data with full referential integrity.  Yes, this provided a nice clean environment and once complete, was generally of a high quality.  The problem was, that producing fully 'cleaned' information with full referential integriity from the outset is difficult.  The onus is put on the human to make all of this happen prior to the information even being captured in the system.
 
This is the critical break down.  The hurdles to adding information to the system must be low.  Information must be able to be injected into the system ad hoc with little preparation and cleansing.  It should be added, captured, and annotated.
 
The system should be responsible for helping determine the quality of the information.  It should have mechanisms in place to help 'clean' up the information.  It should have anti-entropy mechanisms, data validation mechanisms, anomolous data isolation mechanims, and sufficient feedback loops to ensure a constant increase in the overal quality of the information in the system.
 
You can certainly understand both the desire and the reasoning behind 'closed' information stores.  The database community has lived or died by the 'quality' of the information in the database.  If a foreign key constraint is missing, or a table is missing referential integrity, the whole thing starts to fall apart.  Why is this?  It's because the system was built to depend on the upfront quality of the data rather than relying on mechanisms and systems that handle 'dirty' data and are able to recover from situations in which everything is not pristine.
 
This model is still valuable for some domain of problems, but it is an ever decreasing set of problems.
 
The other thing that has driven the requirement for 'clean' data has been the cost of computing, networks, algorithms and storage.  All of these factors made it much more cost effective to demand data quality up front rather than worry about creating a system which can deal with uncertainty.  By storing only 'clean' data, the system stores only what it needs, no more.  The semantics and the relationships of all of the data are stored with in the structure of the information, thus not requiring any external representation of the relationships.
 
The Web as an open hypertext system, was able to grow super exponentially due in large part because of its open architecture which did not require any kind of referential integrity, quality criteria, or 'gatekeeper' function.  Information has been injected into the WWW at such a rapid rate that it is currently very difficult to even estimate the global size of the WWW.  There is no doubt that there is 'bad' data and low quality information on the Web today.  What is interesting to watch is some of the natural(?) or evovled mechanisms which have come into play to isolate and improve the overall quality of information on the Web.  Many of these mechanisms are human in nature today (such as manual linking), but there is much that can be done to improve the qualtiy through automated systems.
 
An early attempt was made with the Atlas Link Database that we worked on in '95.  The goal of this system was to use embedded knowlege within the system (link hrefs and referer information) to continuously update and correct broken links.  This is just a small element of the overall problem, but certainly starts to demonstrate the power of using the information to do self correction.
 
With the continual improvements in the Semantic Web, Web Services and other related technologies, the Web is moving from a massively linked open hypertext system to a massively linked open database system.  I believe that we are currently embroiled in some similar arguments about closed versus open in the database world that we had 10 years ago in the hypertext world.
 
Ultimately, the open model will prevail for its ability to scale across domains, networks, and technologies.  But this will not happen without a lot of work on the mechanisms to judge, rank, correct, and clean the information that is injected into the system.
 
Kipp
 

Kipp Jones - CTO
nuBridges, LLC - www.nubridges.com
eBusiness is Business

cell:  404.213.9293
work:  770.730.3722

 

Wednesday, May 19, 2004

WWW2004

Arrived in NY (after a 2 hour delay in Atlanta due to a storm in DC) last night. Checked in and slept.

Registered this morning and ran into Joseph Hardin, now at U of Michigan. I bumped into him several times when he was at UIUC back in the day when I was working on the NCSA web server adding the atlas functionality to it...which was mere months after the whole Mosaic thing took place and everybody moved from Urbana-Champaign out to the Valley to make it rich.

We'll see what today has in store, should be a lot of semantics!

Kipp

Monday, May 17, 2004

Monday

A couple of items to note:
 
First, Ken pointed me to a very nice package for setting up sites: Mambo http://www.mamboserver.com/  This open source software seems to rock.
 
Second, started integrating Jess into BizCQ.  Need to continue working with that and expand the web service interface.  Need a better way to create rules for Jess related to BizCQ.
 
Third, heading to NY tomorrow for WWW2004 to present a poster on BizCQ. 
 
Lastly, good weekend with the kids.  Two soccer games and one Ballet picture day.  Also, had fun having Aunt LaLa in town.  Too bad Brian couldn't come and play...
 
Kipp
 

Kipp Jones - CTO
nuBridges, LLC - www.nubridges.com
eBusiness is Business

cell:  404.213.9293
work:  770.730.3722

 

Friday, May 14, 2004

On Expenses

Why is that when a company who you have asked to help you feels like it is okay to have you pay more for something than they would themselves?
 
Case in point. recently had a training course (which is another topic in and of itself) on site in which we were responsible for expenses.  The expenses for a 2 day training course came in at over $1000.  Granted this includes flight and hotel, but it certainly wasn't an international flight, and it's not like staying in NY.
 
It would seem to me that normal business sense would have you charge your customer what you would expect to be charged if the roles were reversed.  Just because somebody else is paying doesn't mean that there are no rules about reasonable expense.  Companies would do well to set a limit for 'reasonable' and require notification/approval for out of bounds items.
 
Rant, rant, rant
 

Monday, May 10, 2004

On another note...

I'm going to my 20th high school reunion in a few days.  Class of '84 which graduated with 14 people from Beaver Valley High School in Lebanon, Nebraska.  Probably not a lot of people out there have visited.  The town now has about 75 people in it, none of which are my classmates.
 
In fact we're having it Lincoln, Nebraska...which would be really great if a Husker game was going on...
 
In addition to that travel, I'm presenting a poster at the 13th International World Wide Web conference in NY, NY next week.  Haven't been to one of these since '96.  Looking forward to seeing where things are and getting a sense for what's next in the WWW world.
 
Kipp
 

Kipp Jones - CTO
nuBridges, LLC - www.nubridges.com
eBusiness is Business

cell:  404.213.9293
work:  770.730.3722

 
Checking to see if I can still blog...

Been awhile since I was out here, hopefully can start it up for real this time.

Some interesting things in technology I've been looking at:
- Space technology - interested in where business is going to go wrt space
- BPM - we've went deep into BPM for our new product release, where else is BPM and workflow technology useful
- Semantics - still interested in semantic mapping, semantics of change, information quality via semantics
- Rules engines - how can rules engines help with semantic problems?

On the science side:
- evolution - very interesting topic
- things nano
- anti-gravity

Other than that, keeping the kids going takes up any extra time I may have...