About
A large portion of knowledge in companies was stored in unstructured textual formats, with millions of documents often spread across various locations and accessible through intranets. For instance, Boeing had 1MB of internal web pages where employee knowledge was stored. These unstructured documents couldn’t be queried easily, preventing the automatic use and effective management of the contained knowledge, reducing company efficiency and competitiveness. Additionally, in a time when companies were increasingly valued for their “intangible assets” like knowledge, this lack of management led to a loss in company value.
The Semantic Web initiative aimed to address these issues by adding structured content to web documents, allowing knowledge to be managed automatically. Efforts were focused on defining standards for organizing knowledge (e.g., XML, RDF), creating knowledge structures (e.g., ontologies), and populating these structures. While manual annotation of new documents was possible, the problem of unstructured old documents remained. Automatic or semi-automatic methods for extracting information from such documents were crucial for improving Knowledge Management (KM), especially in the context of the Semantic Web.
Information Extraction (IE) from texts emerged as a promising area within Human Language Technologies (HLT) for KM. IE involved automatically identifying key facts in electronic documents for future use, such as annotating documents or populating ontologies. It supported knowledge identification and extraction from web documents, either automatically or by assisting human annotators. However, a significant challenge was adapting IE systems to new scenarios and tasks, as most required expert intervention. This made it difficult for non-experts, particularly in small and medium enterprises, to implement IE. A key challenge for the future was making IE accessible to those with knowledge of the Semantic Web but limited expertise in IE or Computational Linguistics, broadening its application within KM.
Objectives
The consortium studied, designed, and implemented innovative methodologies for Knowledge Management (KM) based on Information Extraction (IE). From a scientific perspective, they focused on two aspects: how KM requirements posed challenges for IE, and how IE influenced KM practices. Practically, they defined tools and methodologies for IE-based KM. They concentrated on developing user-driven IE systems that could be easily adapted to new application domains with minimal or no Natural Language Processing knowledge. Their approach supported users throughout the entire IE application development process, from design to deployment. The consortium emphasized adaptive IE technology using Machine Learning, leveraging the expertise of its members, who were leaders in the field.
For KM, they designed and implemented methodologies that used IE to capture knowledge from textual documents. They studied how IE impacted the reuse, sharing, and diffusion of knowledge within companies, and how it could be integrated with existing KM tools.
The project provided several benefits: adaptive IE reduced the burden of manual knowledge identification and extraction, making the process more effective and efficient. Human annotators received support in creating semantic web annotations, allowing ontology-based tools to seamlessly integrate into KM environments. IE also provided basic information for KM applications, such as augmented browsers, which produced on-the-fly annotations for web pages, creating additional links or associated knowledge.
Team
Project Partners
IST – Information Society Technologies
The University of Sheffield (project co-ordinator)
The University of Karlsruhe
The Open University
Ontoprise
ITC-irst
Quinary
Publications
Zhu, J., Goncalves, A., Uren, V., Motta, E., Pacheco, R., Song, D. and Rueger, S. (2007) Community Relation Discovery by Named Entities, International Conference on Machine Learning and Cybernetics 2007, Hong Kong, China, IEEE.
Lopez, V. and Motta, E. (2004) Ontology Driven question answering in AquaLog, 9th International Conference on Applications of Natural Language to Information Systems (NLDB 2004), Manchester, UK.