Title: Building An Ontology of the NHIN: Status Report 3
1Building An Ontology of the NHIN Status Report 3
- Brand Niemann
- Co-Chair, Semantic Interoperability Community of
Practice (SICoP) - Best Practices Committee (BPC), CIO Council, and
- Enterprise Architecture Team, Office of
Environmental Information - U.S. Environmental Protection Agency
- April 5, 2005
2Overview
- 1. The National Health Information Network (NHIN)
Request for Information (RFI) - 1.1 Scope Quality
- 1.2 Statistics
- 1.3 Analysis Reporting Strategy
- 1.4 Business Cases
- 1.5 Leadership Statements
- 1.6 Related Activities
- 1.7 Building Ontologies
- 2. Results and Next Steps
- Appendices
3NHIN RFI1.1 Scope Quality
- The NHIN RFI stimulated substantial and
unprecedented interest. - Cumulatively, the 512 responses yielded nearly
5,000 pages of information. - The National Coordinator established a federal
government wide RFI review task force (RTF) to
review, summarize and analyze the RFI responses. - The RTF consists of more than 120 Federal
officials from 17 agencies.
4NHIN RFI1.1 Scope Quality
- The responses to these initial questions yielded
the richest and most descriptive collection of
thoughts on interoperability and health
information exchange that has likely ever been
assembled in the United States. - The responses to the general questions are a
treasure trove of the best thinking on the topic.
5NHIN RFI1.2 Statistics
6NHIN RFI1.3 Analysis Reporting Strategy
- The NHIN RFI consisted of
- Twenty-four (24) questions, in
- Six (6) basic groups
- The NHIN Team divided the RFIs into two basic
groups - Individuals (283)
- Organizations (229)
- The NHIN Team organized the Organization
responses for review in - Thirty (30) sets with 2-3 reviewers for each set
- Templates (matrices) with 13 entities by about 4
categories of the 24 questions mapped to each of
the three Work Groups (see next slide). - For example WG1 Standards (Questions 4b,
14-18), Technical Development/Architecture
(Questions 2-4a, 23), Technical
Services/Operations (Questions 9-11), and General
Comments by Federal Government, Industry
Software/Hardware Vendors, etc.
7NHIN RFI1.3 Analysis Reporting Strategy
- NHIN Team divided the participants into three
Work Groups - Technical and Architecture
- Organization and Business Framework
- Finance, Privacy, Regulatory, and Legal
- Each Work Group created Major Themes
- WG1 3, WG2 2, and WG3 3
- Each Work Group reported out on Sub-teams
- WG1 5, WG2 5, and WG3 4
- NHIN Team mapped the Work Group results to new
structures for two reports - Report 1 - Sections 7, Sub-sections 17, and
Sub-Sub-sections 18 - Report 2 - Sections 4, Sub-sections 16, and
Sub-Sub-sections 86
8NHIN RFI1.3 Analysis Reporting Strategy
- There is and will be criticism
- It is important to note, in the front when
talking about the process, that approximately 270
RFIs were not reviewed by the interagency
process. The process that ONCCHIT used to select
and review these responses should be made clear.
(name withheld) - There will be responses to criticism
- Statistical Summary Analysis of Responses from
Individuals - 85 of the responses had strong concerns about
the potential loss of privacy along with 53 of
health officials who had the same concern. - 17 of health officials shared their experiences
with implementations of EHR systems. - Only about 4 expressed enthusiasm for the
creation of a system that would facilitate
interoperability.
9NHIN RFI1.4 Business Cases
- Veterans Can Personalize Medical Records on VA
Web Site, GCN, November 9, 2004 - My HealtheVet (also copy parts of VistA)
- Could allow the VA to share patient data with
other providers. - Patients can request changes to their medical
records and allow their loved ones or their
physicians to access portions of their records. - iHealthBeat, November 13, 2004.
10NHIN RFI1.4 Business Cases
- Canadian Health Infoway
- An EHR solution is a combination of people,
organizational entities, business processes,
systems, technology and standards that interact
and exchange clinical data. A network of
interoperable EHR solutionsone that links
clinics, hospitals, pharmacies, and other points
of carewill help enhance quality of care and
patient safety, improve Canadian's access to
health services, and make the health care system
more efficient. - Interoperability for electronic health records is
the capability of computer and software systems
to seamlessly communicate with each other. It is
central to Infoway's mission, making clinical
data available across the continuum of care and
across health delivery organizations and regions,
promoting reusable and replicable solutions that
can be aligned with jurisdictional priorities and
deployed across the country more
cost-efficiently. Without a common framework and
sets of standards, EHR systems across Canada
would be a patchwork of incompatible systems and
technologies.
Accelerating the development of Electronic
Health Information Systems for Canadians
http//www.infoway-inforoute.ca/ehr/index.php?lang
en
11NHIN RFI1.4 Business Cases
Canadian Health Infoway Standards
Collaboration http//www.infoway-inforoute.ca/ehr/
standards_overview.php?langen
12NHIN RFI1.4 Business Cases
- One recent study estimated a net savings from
national implementation of fully-standardized
interoperability between providers and five other
types of organizations could yield 77.8 billion
annually, or approximately 5 percent of the
projected 1.7 trillion spent on U.S. health care
in 2003 - Source J. Walker et al., The Value of Health
Care Information Exchange and Interoperability,
Health Affairs, January 19, 2005.
13NHIN RFI1.5 Leadership Statements
- HHS Administrator Leavitts Keynote Address at
AFCEA Internationals Homeland Security
Conference, February 22, 2005 (See
http//www.fcw.com/article88110) - The next frontier of human productivity is the
Interoperability Era. - Collaboration is the premium leadership skill
thats need in this new era. - Interoperability begins by setting standards and
should be organically grown through the "messy,
complex, difficult process called collaboration. - Several elements (8) will improve the chances for
success (a common pain, a convener of
stature, a committed leader, openness,
transparency, and voluntary participation, a
critical mass of stakeholders, representative of
substance, a clearly defined purpose and goal,
and a formally written and signed charter).
141. NHIN RFI1.5 Leadership Statements
- Dr. Brailers Keynote Address at HIMSS
Conference, February 17, 2005 - Interoperability Themes from RFIs
- Standards (WG1 WG2)
- Governance (WG2)
- Privacy (WG3)
- Regionalization (Initially none, then WG2)
- Financing (WG3)
- Architecture (WG1)
- Regulation (WG3)
Mappings to WGs added by author of this
presentation.
15NHIN RFI1.6 Related Activities
- Federal Health Architecture (FHA)
Interoperability Work Group, March 17 and 24,
2005 - Goal Technology Standards Harmonization
- Strive for consensus on some of the potential
technical specifications (see next slide) - Draft Health Information Interoperability
Standards Profile - Present standards to OMB as Draft Standards for
Trial Use (DSTU) - Follow-up with more detailed guidance on
implementation - Concern Narrow focus of Work Group is on the
less crucial aspect of interoperability
(technical standards)
16Approach for Technology Classification
HL7
V 3.0
- XML Digital Signature
- XKMS
- SAML
- WS-Security
- XACML
- PKI
- SSL
Data
V 2.x
XML
XSLT, XSL, etc.
Other
ASCII, Binary (e.g., image)
Business Process
BPEL
BPSS
Message Oriented Interchange
Registry (RIM)
Discovery
UDDI
Description
WSDL
CPP/A
Message
SOAP
SOAP w/ attach., ebMS
SOAP
Transport
HTTP
HTTP
HTTP, SMTP, FTP
Other
ebXML
Web Services
Security
Source FHA Health Interoperability Work Group,
March 24, 2005.
17NHIN RFI1.6 Related Activities
- FHA Architectural Peer Review Group (APRG)
Initial Meeting, February 11, 2005 - Scope Health Domains as identified by the FHA
Health Domain WG and incorporated into the FHA
BRM (see FEA06 Revision Summary, page 4). - Semantics Recommendations were made to consider
an ontology that is being developed for this
purpose by the CIO Council (actually by GSA,
TopQuadrant, and SICoP). - See Slide 18 for Example.
18NHIN RFI1.6 Related Activities
- Healthcare Informatics Online, January 2004 Cover
Story on Emerging Technologies - Concept introduced in 2001 Scientific American
article and described using the scenario of a man
who goes online, employing intelligent agents on
the Semantic Web to set up a series of physician
appointments and physical therapy sessions for
his ailing mother. (It could be 10 years before
such agent-enabled scenarios play out, but
simpler semantic functions are already emerging.) - My Note Semantic Web Applications for National
Security (SWANS), April 7-8, 2005, Crystal City,
Virginia.
19NHIN RFI1.6 Related Activities
- Healthcare Informatics Online, January 2004 Cover
Story on Emerging Technologies - Its not a Web replacement, its an evolution
based largely on eXtensible Markup Language (XML)
with added technologies that allow computers to
interpret and process data ontologies, or
relationships between disparate pieces of
information. - The Semantic Web would represent a worldwide Web
of connected data, radically different from
todays Web of discrete documents, which is why
it could be the affordable answer to the
electronic health record. - My Note The Semantic Web could also deal with
the privacy and security concerns expressed in
the RFI Individual Responses.
20NHIN RFI1.6 Related Activities
21NHIN RFI1.7 Building Ontologies
- The Mind Map Book How to Use Radiant Thinking to
Maximize Your Brains Untapped Potential (Tony
Buzan) - Before the web came hypertext. And before
hypertext came mind maps. - A mind map consists of a central word or concept,
around the central word you draw the 5 to 10 main
ideas that relate to that word. You then take
each of those child words and again draw the 5 to
10 main ideas. - Mind maps allow associations and links to be
recorded and reinforced. - The non-linear nature of mind maps makes it easy
to link and cross-reference different elements of
the map. - See next slide for examples from the Explorers
Guide to the Semantic Web, Thomas Passin,
Manning Publications, 2004, pages 106 and 141.
22Mind Maps for Searching and Ontologies
informal formal distinctions multiple trees hierar
chies taxonomies vocabularies
adhoc categories internet
hugh changing growing inconsistent
predefined
ENVIRONMENT
CLASSIFICATION
KINDS
Searching
Ontologies
ONTOLOGIES
keywords ontologies classification metadata semant
ic Focusing social Analysis multiple
Passes clustering
combining specifying committment
NAMES
STRATEGIES
LANGUAGES
properties relationships constraints identifiers
RDFS OWL DAML Description Logics
Note These are not complete.
23NHIN RFI1.7 Building Ontologies
standards governance privacy regionalization finan
cing architecture regulation
organizational technical semantic
general organizational business management
operational standards policies financial,
regulatory, legal other
DR. BRAILER
RFI
FRAMEWORKS
STANDARDS ORGANIZATIONS
NHIN
WORK GROUPS
NCVHS CCHIT Etc.
technical architecture organization
business financial, regulatory, legal
ORGANIZATIONAL STRUCTURE
OTHER
other
STRATEGIC PLAN GOALS
regional initiatives clinical practice population
health health interoperability Federal Health
Architecture
Possible/probable interrelationships
Inform Clinical Practice Interconnect
Clinicians Personalize Care Improve Population
Health
24NHIN RFI1.7 Building Ontologies
- An ontology is the organization of things into
types and categories with a well-defined
structure that are networks of concepts. - Specific ontologies must be constructed with
known vocabularies and rules of construction. - A good ontology requires
- The ability to conceptualize and articulate the
underlying ideas. - Skill at modeling abstractions.
- Knowledge of the syntax of the modeling language.
- OWL is poised to become the major ontology
language for the Web. - Use of well-developed and accepted ontologies
whenever possible. - The Suggested Upper Merged Ontology (SUMO) is a
best practice example. - A Community of Practice with all of these skills
that can collaborate to develop the ontology. - The Ontolog Forum is a best practice example (see
next slide).
25NHIN RFI1.7 Building Ontologies
- A key aspect of successful large scale
interoperability is shared meaning. - Shared meaning requires not only a common syntax
(XML), but a common vocabulary. - That common vocabulary should be defined in terms
of the broadest and most general foundation
concepts and be in a formal and computable
language not subject to human interpretation in
English alone. - Formal ontologies, defined in logic, and a
hierarchy of ontologies that build from a common
semantic foundations are needed (se next slide).
26Current Ontology-Driven Information System for
FHA/NHIN
Examples
SUMO
HL7 RIM FEA-RMO
EON SNOMED CT LOINC
Source Netcentric Semantic Linking (Mapping) An
Approach for Enterprise Semantic
Interoperability, Mary Pulvermacher, et. Al.
MITRE, October 2004.
27NHIN RFI1.7 Building Ontologies
- Strategy for the NHIN Ontology
- Compile repository/library of NHIN public and RFI
documents in their native file formats. - Repurpose the documents
- Proprietary to text formats.
- Proprietary to XML documents.
- Chunk large documents into sub-documents.
- Compile the NHIN Mind Maps for defining
searches and building the ontology. - Work with ontology community of practices to draw
in their expertise. - Proposed new Ontology and Taxonomy Coordinating
Group (ONTACG) of SICoP.
282. Results and Next Steps
- 2.1 The Challenge
- 2.2 A Suggested Solution
- 2.3 The Content
- 2.4 The Pilot
- 2.5 Sample Results
- 2.6 Next Steps
292.1 The Challenge
- Extract and organize the semantic concepts from
about 5000 pages of semi-structured content in
support of a comprehensive analysis to recommend
the plan for the National Health Information
Network (NHIN). - For example Dr. Brailer, ONCHIT Technical
Assistance Call December 6, 2004, NHIN refers to
a specific bundle of technologies, business
frameworks, financing arrangements, legal
contracting or other mechanisms, policy
requirements, organizational issues and related
things that allow for network interoperability.
So NHIN is the middleware in the grand schema of
these pieces.
302.2 A Suggested Solution
- Besides manual human extraction individually and
in the Work Group environment, there are
machine-aided extraction, analysis, and
visualization tools that could and should be
brought to bear on this problem that would lead
to the building on an ontology - This approach was taken with the Federal
Enterprise Architecture Reference Models to
produce an ontology that has been released. - http//web-services.gov/fea-rmo.html
312.3 The Content
- Indexing, categorization, and relationship
linking. - Indexing, keyword/concept extraction, and
taxonomy. - Same as (2).
322.4 The Pilot
- A Recommended Start to the NHIN Ontology
- The European Interoperability Framework
- Organisational
- Technical,
- Semantic
- Leavitt see interoperability ..interoperability
should be organically grown through the "messy,
complex, difficult process called collaboration. - http//www.fcw.com/article88110
332.4 The Pilot
- Tools
- Selection Criteria
- Selected for participation in the SWANS
Conference, April 7-8, 2005, because of support
for Semantic Technologies (RDF/OWL). - Willing to provide hardware, software, and advice
for proof of concept. - Two or more vendors initially more after SWANS
Conference - Selection
- NextPage FolioViews and LivePublish (recently
acquired by FAST Search Transfer) - FAST Data Search and ProPublish
- http//www.fastsearch.com
- Content Analyst
- http//www.contentanalyst.com
342.4 The Pilot
- Ontology Expertise
- Ontolog Forum
- Submitted Response to the RFI
- Available on the Internet
- Providing Ontology Engineering Advice
- Suggests Brainstorming Session
- Proposed New SICoP Ontology and Taxonomy
Coordinating Work Group (ONTACG)
352.5 Sample Results
http//web-services.gov, See Best Practices
362.5 Sample Results
http//web-services.gov, See Best Practices
372.5 Sample Results
http//web-services.gov, See Best Practices
382.5 Sample Results
http//web-services.gov, See Best Practices
392.5 Sample Results
Folio Views Infobase of RFIs
402.5 Sample Results
Content Analyst Compute Taxonomy
412.5 Sample Results
Content Analyst Run Queries
422.5 Sample Results
Content Analyst Set Training Documents
432.5 Sample Results
FAST ProPublish Production Manager
442.5 Sample Results
FAST ProPublish Build Progress
452.5 Sample Results
FAST Data Search Search View
462.5 Sample Results
FAST Data Search Taxonomy Results Saved in Excel
Spreadsheet
472.6 Next Steps
- NHIN Suggest a Series of Queries
- Results can be provided in Excel spreadsheets for
further analysis and reuse - Add content from those agencies interviewed by
the FHA Interoperability Work Group recently - VA, DoD, EPA, CDC, FDA, NIH-NCI/DHS/HIS
- See future demonstrations with the initial public
domain databases for semantic searching and
ontology building (see next slide) - SWANS Conference, April 7-8, 2005
- SICoP Meeting at KM Conference, April 22, 2005
482.6 Next Steps
Initial Public Domain Databases for Semantic
Searching and Ontology Building
49Appendices
- A. Ontology Engineering
- B. FAST Data Search and ProPublish
- C. Content Analyst
50Appendix A Ontology Engineering
- A.1 What Is An Ontology?
- A.2 Basic Requirements For an Ontology
- A.3 Ontology Examples
- A.4 Formal Taxonomies for the U.S. Government
- A.5 Medical Informatics Ontologies Examples and
Design Decisions - A.6 GLIF in Protégé
- A.7 Why Develop an Ontology?
- A.8 Ontology-Development Process
- A.9 What Is Ontology Engineering?
- A.10 Ontology-Driven Information Systems
51A.1 What Is An Ontology?
- An ontology is an explicit description of a
domain - concepts
- properties and attributes of concepts
- constraints on properties and attributes
- Individuals (often, but not always)
- An ontology defines
- a common vocabulary
- a shared understanding
52A.2 Basic Requirements For an Ontology
- 1. Finite controlled (extensible) vocabulary.
- 2. Unambiguous interpretation of classes and term
relationships. - 3. Strict hierarchical subclass relationships
between classes. - 4. Few others
Source Deborah McGuiness, Ontologies Come of
Age, in the Semantic Web Why, What, and How, MIT
Press, 2002, page 6.
53A.3 Ontology Examples
- Taxonomies on the Web
- Yahoo! categories
- Catalogs for on-line shopping
- Amazon.com product catalog
- Domain-specific standard terminology
- SNOMED Clinical Terms terminology for clinical
medicine - UNSPSC - terminology for products and services
54A.4 Formal Taxonomies for the U.S. Government
- OWL Listing
- lt?xml version"1.0"?gt ltrdfRDF xmlnsrdf"http//w
ww.w3.org/1999/02/22-rdf-syntax-ns"
xmlnsxsd"http//www.w3.org/2001/XMLSchema"
xmlnsrdfs"http//www.w3.org/2000/01/rdf-schema"
xmlnsowl"http//www.w3.org/2002/07/owl"
xmlnsdaml"http//www.daml.org/2001/03/damloil"
xmlns"http//www.owl-ontologies.com/unnamed.owl
" xmlnsdc"http//purl.org/dc/elements/1.1/"
xmlbase"http//www.owl-ontologies.com/unnamed.ow
l"gt ltowlOntology rdfabout""/gt ltowlClass
rdfID"Transportation"/gt ltowlClass
rdfID"AirVehicle"gt ltrdfssubClassOf
rdfresource"Transportation"/gt lt/owlClassgt
ltowlClass rdfabout"GroundVehicle"gt
ltrdfssubClassOf rdfresource"Transportation"/gt
lt/owlClassgt ltowlClass rdfabout"Automobile"gt
ltrdfssubClassOfgt ltowlClass rdfID"GroundVehicle
"/gt lt/rdfssubClassOfgt Etc.
Transportation Class Hierarchy
Source Formal Taxonomies for the U.S.
Government, Michael Daconta, Metadata Program
Manager, US Department of Homeland Security,
XML.Com, http//www.xml.com/pub/a/2005/01/26/formt
ax.html
55A.5 Medical Informatics Ontologies Examples and
Design Decisions
- Foundational Model of Anatomy (FMA)
- Developed at University of Washington as part of
the Digital Anatomist project. - Contains 70,000 distinct concepts, 110,000
terms, and 140 relations - Gene Ontology (GO)
- A controlled vocabulary for describing genes and
gene products with three organizing components
Molecular function, Biological process, and
Cellular component. - Health Level 7 (HL7) Data Types and Top-Level RIM
Classes - HL7 data types as Protégé classes
- Guideline Interchange Format (GLIF) (See next
slide) - A format for sharing clinical guidelines
independent of platforms and systems - Design to support multiple vocabularies and
medical knowledge bases. - Designed to work with different patient
information model.
56A.6 GLIF in Protégé
57A.7 Why Develop an Ontology?
- To share common understanding of the structure of
information - among people
- among software agents
- To enable reuse of domain knowledge
- to avoid re-inventing the wheel
- to introduce standards to allow interoperability
58A.8 Ontology-Development Process
In reality - an iterative process
59A.9 What Is Ontology Engineering?
- Ontology Engineering Defining terms in the
domain and relations among them - Defining concepts in the domain (classes)
- Arranging the concepts in a hierarchy
(subclass-superclass hierarchy) - Defining which attributes and properties (slots)
classes can have and constraints on their values - Defining individuals and filling in slot values
60A.10 Ontology-Driven Information Systems
- Methodology Side the adoption of a highly
interdisciplinary approach - Analyze the structure at a high level of
generality. - Formulate a clear and rigorous vocabulary.
- Architectural Side the central role in the main
components of an information system - Information resources.
- User interfaces.
- Application programs.
See for example Nicola Guarino, Formal Ontology
and Information Systems, Proceedings of FOIS 98,
Trento, Italy, 6-8 June 1998.
61Appendix B FAST Data Search
- B.1 Gartner Magic Quadrant for Enterprise Search,
2004 - B.2 FAST Data Search
- Categorization and Taxonomy Support
- Integration
- B.3 FAST ProPublish System Overview
- Gather Content
- Process Content
- Deliver Content
62B.1 Gartner Magic Quadrant for Enterprise Search,
2004
Source Gartner Research ID Number M-22-7894,
Whit Andrews, 17 May 2004.
63B.1 Gartner Analysis Leaders
- Fast Search Transfer (FAST) now is counted in
the Leaders quadrant, moving from the Visionaries
quadrant. The vendor has experienced explosive
growth, providing better-than-average means and
an expanding list of approaches of determining
relevancy. Its architecture is superior among
search vendors, and sales are strong. (Sales of
enterprise search technology were 42 million in
2003, up from 36 million in 2002.) Its
acquisition of the remainder of AltaVista's
business has had no real impact on operations. - Critical questions include whether FAST will
- 1) remain a specialist in search technologies
- 2) pursue "search-derivative applications"
FAST's term for the general application category
founded on search platforms, including customer
relationship management (CRM) knowledge base
support tools and scientific research managers
or - 3) focus on original equipment manufacturer
arrangements or on a broader suite of
applications, such as those included in a smart
enterprise suite. Search vendors typically follow
an arc that leads to their acquiring a company,
to failure or to a position as an enduring
leader. FAST has the opportunity to pursue the
last path. - Note added by Brand Niemann FAST acquired
NextPage in December 2004 which provides
electronic publishing software to 6 of the 9
leading electronic publishers in the world. I
have used NextPage in the pilots to date.
64B.2 FAST Data Search Categorization and Taxonomy
Support
65B.2 FAST Data Search Integration
66B.3 FAST ProPublish System Overview
Gather Content
Process Content
Deliver Content
67B.3 FAST ProPublish System Overview
- Searches in the online FAST ProPublish system are
powered by FAST proven search technology. Search
results are displayed on a results list and
additional navigation interfaces such as key
words, dynamic drill-down lists, metadata
structures, and hierarchy are also provided. When
documents are retrieved, they are pulled from the
content repository. Search hits are highlighted
in HTML and XML documents. - FAST ProPublish is designed to be a distributed
application. Nearly every component may be run on
a separate machine (or multiple machines) for
extreme scalability and reliability. However,
this same flexibility also allows all of the
components to be run on a single server. - FAST ProPublish provides the following services
- Search and query.
- Data and text mining and analysis.
- Exploration and static reporting.
68B.3 FAST ProPublish System Overview
- Gather Content
- The Production Manager is the tool you use to
create a collection. Also, through the Production
Manager graphical user interface, you can
establish a library. A library consists of a
collection or group of related collections and
enables you to structure content. That is, you
can define a library hierarchically with folders,
sub-folders, and collection nodes the way you
want the content to appear on your site. - Production Manager has the functionality and
capability to build libraries from existing
collections, or from collections that you define
and build within the Production Manager interface
from various sources of content.
69B.3 FAST ProPublish System Overview
- Process Content
- A collection is, as the name implies, a
collection of content/documents and is fully
indexed, structured, and searchable. Documents
within a collection reside in their native
formats. Collections house three "chunks" of
information - The table of contents (TOC)
- An index of the content
- A copy of the content
- Because collections contain this information,
they are self-contained and portable.
70B.3 FAST ProPublish System Overview
- Process Content
- Each node in the content tree is a library,
folder, sub-folder, or collection. - Folder nodes can contain other content nodes
(such as sub-folders and collections). - You can organize these nodes (folder and
collection) within this pane according to your
content and business needs to create a hierarchy
of content for the library.
71B.3 FAST ProPublish System Overview
Process Content Content Tab Icons and
Descriptions
72B.3 FAST ProPublish System Overview
- Deliver Content
- The user interface is composed of individual
components built using Velocity templates and the
Struts framework. Some of the components are - Search components search forms (simple,
advanced, and custom), search results page
(configurable), parametric search. - Navigation components hierarchical table of
contents, browse-by-category, dynamic drill down
for search refinement, breadcrumb trails. - Document display components document retrieval,
search hit highlighting, next / previous
document, next / previous hit document.
73B.3 FAST ProPublish System Overview
Deliver Content Default User Interface
74B.3 FAST ProPublish System Overview
Deliver Content Advanced Search Page
75Appendix CContent Analyst
- C.1 Definitions
- C.2 Conceptual Mapping
- C.3 Document Proximity ? Conceptual Similarity
- C.4 Term Proximity ? Conceptual Similarity
- C.5 No Auxiliary Structures Required
- C.6 Retrieval Using Conceptual Comparison
- C.7 Terminology Variant Clustering
- C.8 Conceptual Generalization
76Appendix CContent Analyst (continued)
- C.9 Deep Conceptual Generalization
- C.10 Cross-lingual Operations
- C.11 Cross-lingual Capabilities
- C.12 Automated Information Organization
- C.13 Category Creation by Example
- C.14 Automatic Categorization
- C.15 Categorizing Items of Interest
- C.16 Automated Taxonomy Generation
77Appendix CContent Analyst (continued)
- C.17 Instant Context Display
- C.18 Alias Identification
- C.19 Automated Thematic Decomposition
- C.20 Conceptual Interlingua
- C.21 Product Status
- C.22 Performance
- C.23 For More Information
78C.1 Definitions
- Content Analyst
- is a Machine Learning Technique
- that allows Conceptual Comparison of Text
Objects - based on the Technique of Latent Semantic
Indexing. - Latent Semantic Indexing is a patented machine
learning technique that enables technology to
identify, represent, and compare concepts that
exist within a collection of documents or data.
79C.2 Conceptual Mapping
Transportation
?
?
?
Documents
Biological Weapons
Agriculture
80C.3 Document Proximity ? Conceptual Similarity
?
?
Content Analyst Representation Space
81C.4 Term Proximity ? Conceptual Similarity
Car
Automobile
?
?
Content Analyst Representation Space
82C.5 No Auxiliary Structures Required
83C.6 Retrieval Using Conceptual Comparison
?
X
?
?
Documents In Relevance Order
Query
Proximity ? Conceptual Similarity ? Natural
Ranking
84C.7 Terminology Variant Clustering
Osama bin Laden
Osama BinLadin
Osama Binladen
Usama bin Ladin
?
?
Osama bin Laden
X
?
?
?
?
?
Usama bin Laden
Osama bin Ladin
Usama Binladin
Usama Binladen
85C.8 Conceptual Generalization
Bomb
Users Terminology
?
. devices that spread shrapnel ..
Authors Terminology
CA Space
86C.9 Deep Conceptual Generalization
Xxxxxxxxxxxxxx Xxxxxxxxxxxxxx Methods of
armed struggle not accepted internationally Xxxxxx
xxxxxxxxx Xxxxxxxxxxxxxxx
?
War Crimes
87C.10 Cross-lingual Operations
- Documents in Multiple Languages
- Retrieved Documents
- in Correct Relevance
- Order
88C.11 Cross-lingual Capabilities
Current
Future
- Arabic
- Chinese
- English
- Farsi
- French
- Korean
- Russian
- Spanish
- Pashtu
- Urdu
- Italian
- German
- Portuguese
- Dutch
89C.12 Automated Information Organization
- Sorting into Predetermined Categories
- Determining the Natural Topical Breakdown of
Information
90C.13 Category Creation by Example
Documents like this Correspond to the Category
Bioterrorism
Xxxxxxxxx Xxxxxxxxx ..anthrax.. Xxxxxxxxx ..smallp
ox.
CA Representation Space
91C.14 Automatic Categorization
- Document will
- be Assigned
- to this Category
CA Space
92C.15 Categorizing Items of Interest
Hamas
Precursors
Sept. Report
Hamas Exemplar Document
93C.16 Automated Taxonomy Generation
New Content
Taxonomy
94C.17 Instant Context Display
- gb
- sarin
- organophosphorous
- poisonous
- vapors
- cholinesterase
- resorptive
- bezhenar
Last February Qatada and seven other men, said to
be members of the GSPC's British cell, were
arrested in London after the discovery of plans
to bomb or use GB against an unspecified target
in Strasbourg. Charges against Qatada were not
pursued. During the investigation, codenamed
Operation Odin, Special Branch officers raided
Qatada's home in Acton, west London.
95C.18 Alias Identification
- ressam
- ressams
- ahmed
- benni
- charkaoui
- zubeir
- abdelrazik
- zoubeida
Five men, three of whom identified themselves as
Algerian, were arrested Thursday by federal
officials wanting to question them about their
possible links to Ahmed Ressam, an Algerian
arrested in Washington state on explosive
smuggling charges.
96C.19 Automated Thematic Decomposition
The hardware, software, and bandwidth currently
installed are adequate to support this level of
downloading activity. Three people currently are
engaged in developing a comprehensive list of
URLs to be monitored. This is a labor-intensive
task, as existing Internet indexes of online
newspapers are very incomplete. Final decisions
have not yet been made as to the eventual level
of caching that will be done, or the total number
of users to be supported. One of the most
important aspects of the existing implementation
is a web crawler that we have developed and
refined over the past five years that is
optimized for this application. This crawler can
deal with the many idiosyncrasies of this type of
download activity primitive communications in
some countries, bizarre naming conventions,
inconsistent and partial postings, and frequent
changes in web page structure. The current
implementation of this crawler reflects five
years of lessons learned in carrying out
newspaper downloads from the Internet. One of
the functions to be carried out with the
downloaded data is entity and relationship
extraction. In support of this effort, SAIC
personnel have conducted a comparison of current
entity and relationship software packages. The
test involved processing of actual downloaded
material. Of the half dozen packages tested, the
product from Attensity was, by far, the most
complete and accurate. This package is being
procured for use in the download processing. It
should be noted that even the best of the entity
and relationship packages still miss many
entities and relationships of interest and still
generate an undesirably high number of false
relations. We have a current task to examine the
ways in which Content Analyst and Attensity can
be used together to provide significantly
improved overall entity and relationship
extraction capabilities. Although not
addressed in the RFI, one topic that we have paid
considerable attention to is processing of images
of newspapers using optical character recognition
(OCR). At present, approximately 13 of all
foreign newspapers posted to the web consist of
imagesof pages, as opposed to character-encoded
representations. This includes some important
newspapers, for example, most of the Urdu
material on the web is only available as images.
In order to automatically filter these articles,
and to make them available for retrieval, an OCR
process must be carried out. At various times
over the past five years we have implemented such
capabilities for Arabic, Chinese, Farsi, and
Russian materials. OCR of newspaper articles is
a challenging, but not impossible task. The
biggest problem is caused by the low resolution
of images posted to the web
Topic 1
Topic 2
Topic 3
97C.20 Conceptual Interlingua
Transportation
?
?
?
Arbitrary Documents
Biological Weapons
Agriculture
98C.21 Product Status
- 6 Years Development
- 3 Years Operational Experience
- 24X7 Operations
- Multi-million Document Databases
- Conforms to Modern Standards
- J2EE
- UNICODE
- XML
99C.22 Performance
- Can Fully Index gt 1M Documents in 14 Hours on a
Single PC - Can Categorize gt 1 Million Documents per Day on a
Single PC - Can Distribute Index Creation and Retrieval
Operations across Multiple PCs
100C.23 For More Information
- Roger Bradford, 703-391-8700 x110,
rbradford_at_contentanalyst.com