Building An Ontology of the NHIN: Status Report 3 - PowerPoint PPT Presentation

1 / 100
About This Presentation
Title:

Building An Ontology of the NHIN: Status Report 3

Description:

Canadian Health Infoway ... one that links clinics, hospitals, pharmacies, and other points of care will ... safety, improve Canadian's access to health ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 101
Provided by: brandn
Category:

less

Transcript and Presenter's Notes

Title: Building An Ontology of the NHIN: Status Report 3


1
Building An Ontology of the NHIN Status Report 3
  • Brand Niemann
  • Co-Chair, Semantic Interoperability Community of
    Practice (SICoP)
  • Best Practices Committee (BPC), CIO Council, and
  • Enterprise Architecture Team, Office of
    Environmental Information
  • U.S. Environmental Protection Agency
  • April 5, 2005

2
Overview
  • 1. The National Health Information Network (NHIN)
    Request for Information (RFI)
  • 1.1 Scope Quality
  • 1.2 Statistics
  • 1.3 Analysis Reporting Strategy
  • 1.4 Business Cases
  • 1.5 Leadership Statements
  • 1.6 Related Activities
  • 1.7 Building Ontologies
  • 2. Results and Next Steps
  • Appendices

3
NHIN RFI1.1 Scope Quality
  • The NHIN RFI stimulated substantial and
    unprecedented interest.
  • Cumulatively, the 512 responses yielded nearly
    5,000 pages of information.
  • The National Coordinator established a federal
    government wide RFI review task force (RTF) to
    review, summarize and analyze the RFI responses.
  • The RTF consists of more than 120 Federal
    officials from 17 agencies.

4
NHIN RFI1.1 Scope Quality
  • The responses to these initial questions yielded
    the richest and most descriptive collection of
    thoughts on interoperability and health
    information exchange that has likely ever been
    assembled in the United States.
  • The responses to the general questions are a
    treasure trove of the best thinking on the topic.

5
NHIN RFI1.2 Statistics
6
NHIN RFI1.3 Analysis Reporting Strategy
  • The NHIN RFI consisted of
  • Twenty-four (24) questions, in
  • Six (6) basic groups
  • The NHIN Team divided the RFIs into two basic
    groups
  • Individuals (283)
  • Organizations (229)
  • The NHIN Team organized the Organization
    responses for review in
  • Thirty (30) sets with 2-3 reviewers for each set
  • Templates (matrices) with 13 entities by about 4
    categories of the 24 questions mapped to each of
    the three Work Groups (see next slide).
  • For example WG1 Standards (Questions 4b,
    14-18), Technical Development/Architecture
    (Questions 2-4a, 23), Technical
    Services/Operations (Questions 9-11), and General
    Comments by Federal Government, Industry
    Software/Hardware Vendors, etc.

7
NHIN RFI1.3 Analysis Reporting Strategy
  • NHIN Team divided the participants into three
    Work Groups
  • Technical and Architecture
  • Organization and Business Framework
  • Finance, Privacy, Regulatory, and Legal
  • Each Work Group created Major Themes
  • WG1 3, WG2 2, and WG3 3
  • Each Work Group reported out on Sub-teams
  • WG1 5, WG2 5, and WG3 4
  • NHIN Team mapped the Work Group results to new
    structures for two reports
  • Report 1 - Sections 7, Sub-sections 17, and
    Sub-Sub-sections 18
  • Report 2 - Sections 4, Sub-sections 16, and
    Sub-Sub-sections 86

8
NHIN RFI1.3 Analysis Reporting Strategy
  • There is and will be criticism
  • It is important to note, in the front when
    talking about the process, that approximately 270
    RFIs were not reviewed by the interagency
    process. The process that ONCCHIT used to select
    and review these responses should be made clear.
    (name withheld)
  • There will be responses to criticism
  • Statistical Summary Analysis of Responses from
    Individuals
  • 85 of the responses had strong concerns about
    the potential loss of privacy along with 53 of
    health officials who had the same concern.
  • 17 of health officials shared their experiences
    with implementations of EHR systems.
  • Only about 4 expressed enthusiasm for the
    creation of a system that would facilitate
    interoperability.

9
NHIN RFI1.4 Business Cases
  • Veterans Can Personalize Medical Records on VA
    Web Site, GCN, November 9, 2004
  • My HealtheVet (also copy parts of VistA)
  • Could allow the VA to share patient data with
    other providers.
  • Patients can request changes to their medical
    records and allow their loved ones or their
    physicians to access portions of their records.
  • iHealthBeat, November 13, 2004.

10
NHIN RFI1.4 Business Cases
  • Canadian Health Infoway
  • An EHR solution is a combination of people,
    organizational entities, business processes,
    systems, technology and standards that interact
    and exchange clinical data. A network of
    interoperable EHR solutionsone that links
    clinics, hospitals, pharmacies, and other points
    of carewill help enhance quality of care and
    patient safety, improve Canadian's access to
    health services, and make the health care system
    more efficient.
  • Interoperability for electronic health records is
    the capability of computer and software systems
    to seamlessly communicate with each other. It is
    central to Infoway's mission, making clinical
    data available across the continuum of care and
    across health delivery organizations and regions,
    promoting reusable and replicable solutions that
    can be aligned with jurisdictional priorities and
    deployed across the country more
    cost-efficiently. Without a common framework and
    sets of standards, EHR systems across Canada
    would be a patchwork of incompatible systems and
    technologies.

Accelerating the development of Electronic
Health Information Systems for Canadians
http//www.infoway-inforoute.ca/ehr/index.php?lang
en
11
NHIN RFI1.4 Business Cases
Canadian Health Infoway Standards
Collaboration http//www.infoway-inforoute.ca/ehr/
standards_overview.php?langen
12
NHIN RFI1.4 Business Cases
  • One recent study estimated a net savings from
    national implementation of fully-standardized
    interoperability between providers and five other
    types of organizations could yield 77.8 billion
    annually, or approximately 5 percent of the
    projected 1.7 trillion spent on U.S. health care
    in 2003
  • Source J. Walker et al., The Value of Health
    Care Information Exchange and Interoperability,
    Health Affairs, January 19, 2005.

13
NHIN RFI1.5 Leadership Statements
  • HHS Administrator Leavitts Keynote Address at
    AFCEA Internationals Homeland Security
    Conference, February 22, 2005 (See
    http//www.fcw.com/article88110)
  • The next frontier of human productivity is the
    Interoperability Era.
  • Collaboration is the premium leadership skill
    thats need in this new era.
  • Interoperability begins by setting standards and
    should be organically grown through the "messy,
    complex, difficult process called collaboration.
  • Several elements (8) will improve the chances for
    success (a common pain, a convener of
    stature, a committed leader, openness,
    transparency, and voluntary participation, a
    critical mass of stakeholders, representative of
    substance, a clearly defined purpose and goal,
    and a formally written and signed charter).

14
1. NHIN RFI1.5 Leadership Statements
  • Dr. Brailers Keynote Address at HIMSS
    Conference, February 17, 2005
  • Interoperability Themes from RFIs
  • Standards (WG1 WG2)
  • Governance (WG2)
  • Privacy (WG3)
  • Regionalization (Initially none, then WG2)
  • Financing (WG3)
  • Architecture (WG1)
  • Regulation (WG3)

Mappings to WGs added by author of this
presentation.
15
NHIN RFI1.6 Related Activities
  • Federal Health Architecture (FHA)
    Interoperability Work Group, March 17 and 24,
    2005
  • Goal Technology Standards Harmonization
  • Strive for consensus on some of the potential
    technical specifications (see next slide)
  • Draft Health Information Interoperability
    Standards Profile
  • Present standards to OMB as Draft Standards for
    Trial Use (DSTU)
  • Follow-up with more detailed guidance on
    implementation
  • Concern Narrow focus of Work Group is on the
    less crucial aspect of interoperability
    (technical standards)

16
Approach for Technology Classification
HL7
V 3.0
  • XML Digital Signature
  • XKMS
  • SAML
  • WS-Security
  • XACML
  • PKI
  • SSL

Data
V 2.x
XML
XSLT, XSL, etc.
Other
ASCII, Binary (e.g., image)
Business Process
BPEL
BPSS
Message Oriented Interchange
Registry (RIM)
Discovery
UDDI
Description
WSDL
CPP/A
Message
SOAP
SOAP w/ attach., ebMS
SOAP
Transport
HTTP
HTTP
HTTP, SMTP, FTP
Other
ebXML
Web Services
Security
Source FHA Health Interoperability Work Group,
March 24, 2005.
17
NHIN RFI1.6 Related Activities
  • FHA Architectural Peer Review Group (APRG)
    Initial Meeting, February 11, 2005
  • Scope Health Domains as identified by the FHA
    Health Domain WG and incorporated into the FHA
    BRM (see FEA06 Revision Summary, page 4).
  • Semantics Recommendations were made to consider
    an ontology that is being developed for this
    purpose by the CIO Council (actually by GSA,
    TopQuadrant, and SICoP).
  • See Slide 18 for Example.

18
NHIN RFI1.6 Related Activities
  • Healthcare Informatics Online, January 2004 Cover
    Story on Emerging Technologies
  • Concept introduced in 2001 Scientific American
    article and described using the scenario of a man
    who goes online, employing intelligent agents on
    the Semantic Web to set up a series of physician
    appointments and physical therapy sessions for
    his ailing mother. (It could be 10 years before
    such agent-enabled scenarios play out, but
    simpler semantic functions are already emerging.)
  • My Note Semantic Web Applications for National
    Security (SWANS), April 7-8, 2005, Crystal City,
    Virginia.

19
NHIN RFI1.6 Related Activities
  • Healthcare Informatics Online, January 2004 Cover
    Story on Emerging Technologies
  • Its not a Web replacement, its an evolution
    based largely on eXtensible Markup Language (XML)
    with added technologies that allow computers to
    interpret and process data ontologies, or
    relationships between disparate pieces of
    information.
  • The Semantic Web would represent a worldwide Web
    of connected data, radically different from
    todays Web of discrete documents, which is why
    it could be the affordable answer to the
    electronic health record.
  • My Note The Semantic Web could also deal with
    the privacy and security concerns expressed in
    the RFI Individual Responses.

20
NHIN RFI1.6 Related Activities
21
NHIN RFI1.7 Building Ontologies
  • The Mind Map Book How to Use Radiant Thinking to
    Maximize Your Brains Untapped Potential (Tony
    Buzan)
  • Before the web came hypertext. And before
    hypertext came mind maps.
  • A mind map consists of a central word or concept,
    around the central word you draw the 5 to 10 main
    ideas that relate to that word. You then take
    each of those child words and again draw the 5 to
    10 main ideas.
  • Mind maps allow associations and links to be
    recorded and reinforced.
  • The non-linear nature of mind maps makes it easy
    to link and cross-reference different elements of
    the map.
  • See next slide for examples from the Explorers
    Guide to the Semantic Web, Thomas Passin,
    Manning Publications, 2004, pages 106 and 141.

22
Mind Maps for Searching and Ontologies
informal formal distinctions multiple trees hierar
chies taxonomies vocabularies
adhoc categories internet
hugh changing growing inconsistent
predefined
ENVIRONMENT
CLASSIFICATION
KINDS
Searching
Ontologies
ONTOLOGIES
keywords ontologies classification metadata semant
ic Focusing social Analysis multiple
Passes clustering
combining specifying committment
NAMES
STRATEGIES
LANGUAGES
properties relationships constraints identifiers
RDFS OWL DAML Description Logics
Note These are not complete.
23
NHIN RFI1.7 Building Ontologies
standards governance privacy regionalization finan
cing architecture regulation
organizational technical semantic
general organizational business management
operational standards policies financial,
regulatory, legal other
DR. BRAILER
RFI
FRAMEWORKS
STANDARDS ORGANIZATIONS
NHIN
WORK GROUPS
NCVHS CCHIT Etc.
technical architecture organization
business financial, regulatory, legal
ORGANIZATIONAL STRUCTURE
OTHER
other
STRATEGIC PLAN GOALS
regional initiatives clinical practice population
health health interoperability Federal Health
Architecture
Possible/probable interrelationships
Inform Clinical Practice Interconnect
Clinicians Personalize Care Improve Population
Health
24
NHIN RFI1.7 Building Ontologies
  • An ontology is the organization of things into
    types and categories with a well-defined
    structure that are networks of concepts.
  • Specific ontologies must be constructed with
    known vocabularies and rules of construction.
  • A good ontology requires
  • The ability to conceptualize and articulate the
    underlying ideas.
  • Skill at modeling abstractions.
  • Knowledge of the syntax of the modeling language.
  • OWL is poised to become the major ontology
    language for the Web.
  • Use of well-developed and accepted ontologies
    whenever possible.
  • The Suggested Upper Merged Ontology (SUMO) is a
    best practice example.
  • A Community of Practice with all of these skills
    that can collaborate to develop the ontology.
  • The Ontolog Forum is a best practice example (see
    next slide).

25
NHIN RFI1.7 Building Ontologies
  • A key aspect of successful large scale
    interoperability is shared meaning.
  • Shared meaning requires not only a common syntax
    (XML), but a common vocabulary.
  • That common vocabulary should be defined in terms
    of the broadest and most general foundation
    concepts and be in a formal and computable
    language not subject to human interpretation in
    English alone.
  • Formal ontologies, defined in logic, and a
    hierarchy of ontologies that build from a common
    semantic foundations are needed (se next slide).

26
Current Ontology-Driven Information System for
FHA/NHIN
Examples
SUMO
HL7 RIM FEA-RMO
EON SNOMED CT LOINC
Source Netcentric Semantic Linking (Mapping) An
Approach for Enterprise Semantic
Interoperability, Mary Pulvermacher, et. Al.
MITRE, October 2004.
27
NHIN RFI1.7 Building Ontologies
  • Strategy for the NHIN Ontology
  • Compile repository/library of NHIN public and RFI
    documents in their native file formats.
  • Repurpose the documents
  • Proprietary to text formats.
  • Proprietary to XML documents.
  • Chunk large documents into sub-documents.
  • Compile the NHIN Mind Maps for defining
    searches and building the ontology.
  • Work with ontology community of practices to draw
    in their expertise.
  • Proposed new Ontology and Taxonomy Coordinating
    Group (ONTACG) of SICoP.

28
2. Results and Next Steps
  • 2.1 The Challenge
  • 2.2 A Suggested Solution
  • 2.3 The Content
  • 2.4 The Pilot
  • 2.5 Sample Results
  • 2.6 Next Steps

29
2.1 The Challenge
  • Extract and organize the semantic concepts from
    about 5000 pages of semi-structured content in
    support of a comprehensive analysis to recommend
    the plan for the National Health Information
    Network (NHIN).
  • For example Dr. Brailer, ONCHIT Technical
    Assistance Call December 6, 2004, NHIN refers to
    a specific bundle of technologies, business
    frameworks, financing arrangements, legal
    contracting or other mechanisms, policy
    requirements, organizational issues and related
    things that allow for network interoperability.
    So NHIN is the middleware in the grand schema of
    these pieces.

30
2.2 A Suggested Solution
  • Besides manual human extraction individually and
    in the Work Group environment, there are
    machine-aided extraction, analysis, and
    visualization tools that could and should be
    brought to bear on this problem that would lead
    to the building on an ontology
  • This approach was taken with the Federal
    Enterprise Architecture Reference Models to
    produce an ontology that has been released.
  • http//web-services.gov/fea-rmo.html

31
2.3 The Content
  • Indexing, categorization, and relationship
    linking.
  • Indexing, keyword/concept extraction, and
    taxonomy.
  • Same as (2).

32
2.4 The Pilot
  • A Recommended Start to the NHIN Ontology
  • The European Interoperability Framework
  • Organisational
  • Technical,
  • Semantic
  • Leavitt see interoperability ..interoperability
    should be organically grown through the "messy,
    complex, difficult process called collaboration.
  • http//www.fcw.com/article88110

33
2.4 The Pilot
  • Tools
  • Selection Criteria
  • Selected for participation in the SWANS
    Conference, April 7-8, 2005, because of support
    for Semantic Technologies (RDF/OWL).
  • Willing to provide hardware, software, and advice
    for proof of concept.
  • Two or more vendors initially more after SWANS
    Conference
  • Selection
  • NextPage FolioViews and LivePublish (recently
    acquired by FAST Search Transfer)
  • FAST Data Search and ProPublish
  • http//www.fastsearch.com
  • Content Analyst
  • http//www.contentanalyst.com

34
2.4 The Pilot
  • Ontology Expertise
  • Ontolog Forum
  • Submitted Response to the RFI
  • Available on the Internet
  • Providing Ontology Engineering Advice
  • Suggests Brainstorming Session
  • Proposed New SICoP Ontology and Taxonomy
    Coordinating Work Group (ONTACG)

35
2.5 Sample Results
http//web-services.gov, See Best Practices
36
2.5 Sample Results
http//web-services.gov, See Best Practices
37
2.5 Sample Results
http//web-services.gov, See Best Practices
38
2.5 Sample Results
http//web-services.gov, See Best Practices
39
2.5 Sample Results
Folio Views Infobase of RFIs
40
2.5 Sample Results
Content Analyst Compute Taxonomy
41
2.5 Sample Results
Content Analyst Run Queries
42
2.5 Sample Results
Content Analyst Set Training Documents
43
2.5 Sample Results
FAST ProPublish Production Manager
44
2.5 Sample Results
FAST ProPublish Build Progress
45
2.5 Sample Results
FAST Data Search Search View
46
2.5 Sample Results
FAST Data Search Taxonomy Results Saved in Excel
Spreadsheet
47
2.6 Next Steps
  • NHIN Suggest a Series of Queries
  • Results can be provided in Excel spreadsheets for
    further analysis and reuse
  • Add content from those agencies interviewed by
    the FHA Interoperability Work Group recently
  • VA, DoD, EPA, CDC, FDA, NIH-NCI/DHS/HIS
  • See future demonstrations with the initial public
    domain databases for semantic searching and
    ontology building (see next slide)
  • SWANS Conference, April 7-8, 2005
  • SICoP Meeting at KM Conference, April 22, 2005

48
2.6 Next Steps
Initial Public Domain Databases for Semantic
Searching and Ontology Building
49
Appendices
  • A. Ontology Engineering
  • B. FAST Data Search and ProPublish
  • C. Content Analyst

50
Appendix A Ontology Engineering
  • A.1 What Is An Ontology?
  • A.2 Basic Requirements For an Ontology
  • A.3 Ontology Examples
  • A.4 Formal Taxonomies for the U.S. Government
  • A.5 Medical Informatics Ontologies Examples and
    Design Decisions
  • A.6 GLIF in Protégé
  • A.7 Why Develop an Ontology?
  • A.8 Ontology-Development Process
  • A.9 What Is Ontology Engineering?
  • A.10 Ontology-Driven Information Systems

51
A.1 What Is An Ontology?
  • An ontology is an explicit description of a
    domain
  • concepts
  • properties and attributes of concepts
  • constraints on properties and attributes
  • Individuals (often, but not always)
  • An ontology defines
  • a common vocabulary
  • a shared understanding

52
A.2 Basic Requirements For an Ontology
  • 1. Finite controlled (extensible) vocabulary.
  • 2. Unambiguous interpretation of classes and term
    relationships.
  • 3. Strict hierarchical subclass relationships
    between classes.
  • 4. Few others

Source Deborah McGuiness, Ontologies Come of
Age, in the Semantic Web Why, What, and How, MIT
Press, 2002, page 6.
53
A.3 Ontology Examples
  • Taxonomies on the Web
  • Yahoo! categories
  • Catalogs for on-line shopping
  • Amazon.com product catalog
  • Domain-specific standard terminology
  • SNOMED Clinical Terms terminology for clinical
    medicine
  • UNSPSC - terminology for products and services

54
A.4 Formal Taxonomies for the U.S. Government
  • OWL Listing
  • lt?xml version"1.0"?gt ltrdfRDF xmlnsrdf"http//w
    ww.w3.org/1999/02/22-rdf-syntax-ns"
    xmlnsxsd"http//www.w3.org/2001/XMLSchema"
    xmlnsrdfs"http//www.w3.org/2000/01/rdf-schema"
    xmlnsowl"http//www.w3.org/2002/07/owl"
    xmlnsdaml"http//www.daml.org/2001/03/damloil"
    xmlns"http//www.owl-ontologies.com/unnamed.owl
    " xmlnsdc"http//purl.org/dc/elements/1.1/"
    xmlbase"http//www.owl-ontologies.com/unnamed.ow
    l"gt ltowlOntology rdfabout""/gt ltowlClass
    rdfID"Transportation"/gt ltowlClass
    rdfID"AirVehicle"gt ltrdfssubClassOf
    rdfresource"Transportation"/gt lt/owlClassgt
    ltowlClass rdfabout"GroundVehicle"gt
    ltrdfssubClassOf rdfresource"Transportation"/gt
    lt/owlClassgt ltowlClass rdfabout"Automobile"gt
    ltrdfssubClassOfgt ltowlClass rdfID"GroundVehicle
    "/gt lt/rdfssubClassOfgt Etc.

Transportation Class Hierarchy
Source Formal Taxonomies for the U.S.
Government, Michael Daconta, Metadata Program
Manager, US Department of Homeland Security,
XML.Com, http//www.xml.com/pub/a/2005/01/26/formt
ax.html
55
A.5 Medical Informatics Ontologies Examples and
Design Decisions
  • Foundational Model of Anatomy (FMA)
  • Developed at University of Washington as part of
    the Digital Anatomist project.
  • Contains 70,000 distinct concepts, 110,000
    terms, and 140 relations
  • Gene Ontology (GO)
  • A controlled vocabulary for describing genes and
    gene products with three organizing components
    Molecular function, Biological process, and
    Cellular component.
  • Health Level 7 (HL7) Data Types and Top-Level RIM
    Classes
  • HL7 data types as Protégé classes
  • Guideline Interchange Format (GLIF) (See next
    slide)
  • A format for sharing clinical guidelines
    independent of platforms and systems
  • Design to support multiple vocabularies and
    medical knowledge bases.
  • Designed to work with different patient
    information model.

56
A.6 GLIF in Protégé
57
A.7 Why Develop an Ontology?
  • To share common understanding of the structure of
    information
  • among people
  • among software agents
  • To enable reuse of domain knowledge
  • to avoid re-inventing the wheel
  • to introduce standards to allow interoperability

58
A.8 Ontology-Development Process
  • In this tutorial

In reality - an iterative process
59
A.9 What Is Ontology Engineering?
  • Ontology Engineering Defining terms in the
    domain and relations among them
  • Defining concepts in the domain (classes)
  • Arranging the concepts in a hierarchy
    (subclass-superclass hierarchy)
  • Defining which attributes and properties (slots)
    classes can have and constraints on their values
  • Defining individuals and filling in slot values

60
A.10 Ontology-Driven Information Systems
  • Methodology Side the adoption of a highly
    interdisciplinary approach
  • Analyze the structure at a high level of
    generality.
  • Formulate a clear and rigorous vocabulary.
  • Architectural Side the central role in the main
    components of an information system
  • Information resources.
  • User interfaces.
  • Application programs.

See for example Nicola Guarino, Formal Ontology
and Information Systems, Proceedings of FOIS 98,
Trento, Italy, 6-8 June 1998.
61
Appendix B FAST Data Search
  • B.1 Gartner Magic Quadrant for Enterprise Search,
    2004
  • B.2 FAST Data Search
  • Categorization and Taxonomy Support
  • Integration
  • B.3 FAST ProPublish System Overview
  • Gather Content
  • Process Content
  • Deliver Content

62
B.1 Gartner Magic Quadrant for Enterprise Search,
2004
Source Gartner Research ID Number M-22-7894,
Whit Andrews, 17 May 2004.
63
B.1 Gartner Analysis Leaders
  • Fast Search Transfer (FAST) now is counted in
    the Leaders quadrant, moving from the Visionaries
    quadrant. The vendor has experienced explosive
    growth, providing better-than-average means and
    an expanding list of approaches of determining
    relevancy. Its architecture is superior among
    search vendors, and sales are strong. (Sales of
    enterprise search technology were 42 million in
    2003, up from 36 million in 2002.) Its
    acquisition of the remainder of AltaVista's
    business has had no real impact on operations.
  • Critical questions include whether FAST will
  • 1) remain a specialist in search technologies
  • 2) pursue "search-derivative applications"
    FAST's term for the general application category
    founded on search platforms, including customer
    relationship management (CRM) knowledge base
    support tools and scientific research managers
    or
  • 3) focus on original equipment manufacturer
    arrangements or on a broader suite of
    applications, such as those included in a smart
    enterprise suite. Search vendors typically follow
    an arc that leads to their acquiring a company,
    to failure or to a position as an enduring
    leader. FAST has the opportunity to pursue the
    last path.
  • Note added by Brand Niemann FAST acquired
    NextPage in December 2004 which provides
    electronic publishing software to 6 of the 9
    leading electronic publishers in the world. I
    have used NextPage in the pilots to date.

64
B.2 FAST Data Search Categorization and Taxonomy
Support
65
B.2 FAST Data Search Integration
66
B.3 FAST ProPublish System Overview
Gather Content
Process Content
Deliver Content
67
B.3 FAST ProPublish System Overview
  • Searches in the online FAST ProPublish system are
    powered by FAST proven search technology. Search
    results are displayed on a results list and
    additional navigation interfaces such as key
    words, dynamic drill-down lists, metadata
    structures, and hierarchy are also provided. When
    documents are retrieved, they are pulled from the
    content repository. Search hits are highlighted
    in HTML and XML documents.
  • FAST ProPublish is designed to be a distributed
    application. Nearly every component may be run on
    a separate machine (or multiple machines) for
    extreme scalability and reliability. However,
    this same flexibility also allows all of the
    components to be run on a single server.
  • FAST ProPublish provides the following services
  • Search and query.
  • Data and text mining and analysis.
  • Exploration and static reporting.

68
B.3 FAST ProPublish System Overview
  • Gather Content
  • The Production Manager is the tool you use to
    create a collection. Also, through the Production
    Manager graphical user interface, you can
    establish a library. A library consists of a
    collection or group of related collections and
    enables you to structure content. That is, you
    can define a library hierarchically with folders,
    sub-folders, and collection nodes the way you
    want the content to appear on your site.
  • Production Manager has the functionality and
    capability to build libraries from existing
    collections, or from collections that you define
    and build within the Production Manager interface
    from various sources of content.

69
B.3 FAST ProPublish System Overview
  • Process Content
  • A collection is, as the name implies, a
    collection of content/documents and is fully
    indexed, structured, and searchable. Documents
    within a collection reside in their native
    formats. Collections house three "chunks" of
    information
  • The table of contents (TOC)
  • An index of the content
  • A copy of the content
  • Because collections contain this information,
    they are self-contained and portable.

70
B.3 FAST ProPublish System Overview
  • Process Content
  • Each node in the content tree is a library,
    folder, sub-folder, or collection.
  • Folder nodes can contain other content nodes
    (such as sub-folders and collections).
  • You can organize these nodes (folder and
    collection) within this pane according to your
    content and business needs to create a hierarchy
    of content for the library.

71
B.3 FAST ProPublish System Overview
Process Content Content Tab Icons and
Descriptions
72
B.3 FAST ProPublish System Overview
  • Deliver Content
  • The user interface is composed of individual
    components built using Velocity templates and the
    Struts framework. Some of the components are
  • Search components search forms (simple,
    advanced, and custom), search results page
    (configurable), parametric search.
  • Navigation components hierarchical table of
    contents, browse-by-category, dynamic drill down
    for search refinement, breadcrumb trails.
  • Document display components document retrieval,
    search hit highlighting, next / previous
    document, next / previous hit document.

73
B.3 FAST ProPublish System Overview
Deliver Content Default User Interface
74
B.3 FAST ProPublish System Overview
Deliver Content Advanced Search Page
75
Appendix CContent Analyst
  • C.1 Definitions
  • C.2 Conceptual Mapping
  • C.3 Document Proximity ? Conceptual Similarity
  • C.4 Term Proximity ? Conceptual Similarity
  • C.5 No Auxiliary Structures Required
  • C.6 Retrieval Using Conceptual Comparison
  • C.7 Terminology Variant Clustering
  • C.8 Conceptual Generalization

76
Appendix CContent Analyst (continued)
  • C.9 Deep Conceptual Generalization
  • C.10 Cross-lingual Operations
  • C.11 Cross-lingual Capabilities
  • C.12 Automated Information Organization
  • C.13 Category Creation by Example
  • C.14 Automatic Categorization
  • C.15 Categorizing Items of Interest
  • C.16 Automated Taxonomy Generation

77
Appendix CContent Analyst (continued)
  • C.17 Instant Context Display
  • C.18 Alias Identification
  • C.19 Automated Thematic Decomposition
  • C.20 Conceptual Interlingua
  • C.21 Product Status
  • C.22 Performance
  • C.23 For More Information

78
C.1 Definitions
  • Content Analyst
  • is a Machine Learning Technique
  • that allows Conceptual Comparison of Text
    Objects
  • based on the Technique of Latent Semantic
    Indexing.
  • Latent Semantic Indexing is a patented machine
    learning technique that enables technology to
    identify, represent, and compare concepts that
    exist within a collection of documents or data.

79
C.2 Conceptual Mapping
Transportation
?
?
?
Documents
Biological Weapons
Agriculture
80
C.3 Document Proximity ? Conceptual Similarity
?
?
Content Analyst Representation Space
81
C.4 Term Proximity ? Conceptual Similarity
Car
Automobile
?
?
Content Analyst Representation Space
82
C.5 No Auxiliary Structures Required
83
C.6 Retrieval Using Conceptual Comparison
?
X
?
?
Documents In Relevance Order
Query
Proximity ? Conceptual Similarity ? Natural
Ranking
84
C.7 Terminology Variant Clustering
Osama bin Laden
Osama BinLadin
Osama Binladen
Usama bin Ladin
?
?
Osama bin Laden
X
?
?
?
?
?
Usama bin Laden
Osama bin Ladin
Usama Binladin
Usama Binladen
85
C.8 Conceptual Generalization
Bomb
Users Terminology
?
. devices that spread shrapnel ..
Authors Terminology
CA Space
86
C.9 Deep Conceptual Generalization
Xxxxxxxxxxxxxx Xxxxxxxxxxxxxx Methods of
armed struggle not accepted internationally Xxxxxx
xxxxxxxxx Xxxxxxxxxxxxxxx
?
War Crimes
87
C.10 Cross-lingual Operations
  • Documents in Multiple Languages
  • English Query
  • Farsi
  • Results
  • English
  • Arabic
  • English
  • Retrieved Documents
  • in Correct Relevance
  • Order
  • English
  • Doc

88
C.11 Cross-lingual Capabilities
Current
Future
  • Near-term
  • Arabic
  • Chinese
  • English
  • Farsi
  • French
  • Korean
  • Russian
  • Spanish
  • Pashtu
  • Urdu
  • Italian
  • German
  • Portuguese
  • Dutch
  • Japanese

89
C.12 Automated Information Organization
  • Sorting into Predetermined Categories
  • Determining the Natural Topical Breakdown of
    Information

90
C.13 Category Creation by Example
Documents like this Correspond to the Category
Bioterrorism
Xxxxxxxxx Xxxxxxxxx ..anthrax.. Xxxxxxxxx ..smallp
ox.
CA Representation Space
91
C.14 Automatic Categorization
  • Exemplar Document
  • Newly
  • Acquired
  • Document
  • Document will
  • be Assigned
  • to this Category

CA Space
92
C.15 Categorizing Items of Interest
  • Newly Acquired Document

Hamas
Precursors
Sept. Report
Hamas Exemplar Document
93
C.16 Automated Taxonomy Generation
New Content
Taxonomy
94
C.17 Instant Context Display
  • gb
  • sarin
  • organophosphorous
  • poisonous
  • vapors
  • cholinesterase
  • resorptive
  • bezhenar

Last February Qatada and seven other men, said to
be members of the GSPC's British cell, were
arrested in London after the discovery of plans
to bomb or use GB against an unspecified target
in Strasbourg. Charges against Qatada were not
pursued. During the investigation, codenamed
Operation Odin, Special Branch officers raided
Qatada's home in Acton, west London.
95
C.18 Alias Identification
  • ressam
  • ressams
  • ahmed
  • benni
  • charkaoui
  • zubeir
  • abdelrazik
  • zoubeida

Five men, three of whom identified themselves as
Algerian, were arrested Thursday by federal
officials wanting to question them about their
possible links to Ahmed Ressam, an Algerian
arrested in Washington state on explosive
smuggling charges.
96
C.19 Automated Thematic Decomposition
The hardware, software, and bandwidth currently
installed are adequate to support this level of
downloading activity. Three people currently are
engaged in developing a comprehensive list of
URLs to be monitored. This is a labor-intensive
task, as existing Internet indexes of online
newspapers are very incomplete. Final decisions
have not yet been made as to the eventual level
of caching that will be done, or the total number
of users to be supported.   One of the most
important aspects of the existing implementation
is a web crawler that we have developed and
refined over the past five years that is
optimized for this application. This crawler can
deal with the many idiosyncrasies of this type of
download activity primitive communications in
some countries, bizarre naming conventions,
inconsistent and partial postings, and frequent
changes in web page structure. The current
implementation of this crawler reflects five
years of lessons learned in carrying out
newspaper downloads from the Internet.   One of
the functions to be carried out with the
downloaded data is entity and relationship
extraction. In support of this effort, SAIC
personnel have conducted a comparison of current
entity and relationship software packages. The
test involved processing of actual downloaded
material. Of the half dozen packages tested, the
product from Attensity was, by far, the most
complete and accurate. This package is being
procured for use in the download processing. It
should be noted that even the best of the entity
and relationship packages still miss many
entities and relationships of interest and still
generate an undesirably high number of false
relations. We have a current task to examine the
ways in which Content Analyst and Attensity can
be used together to provide significantly
improved overall entity and relationship
extraction capabilities.   Although not
addressed in the RFI, one topic that we have paid
considerable attention to is processing of images
of newspapers using optical character recognition
(OCR). At present, approximately 13 of all
foreign newspapers posted to the web consist of
imagesof pages, as opposed to character-encoded
representations. This includes some important
newspapers, for example, most of the Urdu
material on the web is only available as images.
In order to automatically filter these articles,
and to make them available for retrieval, an OCR
process must be carried out. At various times
over the past five years we have implemented such
capabilities for Arabic, Chinese, Farsi, and
Russian materials. OCR of newspaper articles is
a challenging, but not impossible task. The
biggest problem is caused by the low resolution
of images posted to the web
Topic 1
Topic 2
Topic 3

97
C.20 Conceptual Interlingua
Transportation
?
?
?
Arbitrary Documents
Biological Weapons
Agriculture
98
C.21 Product Status
  • 6 Years Development
  • 3 Years Operational Experience
  • 24X7 Operations
  • Multi-million Document Databases
  • Conforms to Modern Standards
  • J2EE
  • UNICODE
  • XML

99
C.22 Performance
  • Can Fully Index gt 1M Documents in 14 Hours on a
    Single PC
  • Can Categorize gt 1 Million Documents per Day on a
    Single PC
  • Can Distribute Index Creation and Retrieval
    Operations across Multiple PCs

100
C.23 For More Information
  • Roger Bradford, 703-391-8700 x110,
    rbradford_at_contentanalyst.com
Write a Comment
User Comments (0)
About PowerShow.com