Text-grammatical foundations for the (semi)automated text-to-hypertext conversion

Research topics and goals

Project phase I (finished)

Project phase II

Research topics and goals

Converting linear text documents into documents publishable in a hypertext environment is a complex task requiring conversion software on the technical side as well as conversion strategies and methods on the conceptual side. While most of the research on text-to-hypertext conversion has concentrated on technical aspects or was related to specific projects and systems, there is now a growing need for general principles and strategies for handling conceptual problems of text-to-hypertext conversion such as:

  • Segmentation: What are the criteria for segmenting documents into text segments to be used as hypertext units?
  • Reorganization: What are the guidelines for reorganizing the content of these segments in a way such that the resulting hypertext modules are semantically "autonomous", i.e. unchained from the text sequence in the original document, and thus may be integrated into different user-selected pathways?
  • Linking: What are the guidelines and principles for reconnecting the modules via hyperlinks and linking patterns?

The project focuses on these conceptual problems, using XML as the technical basis for hypertext modelling and viewing. The central idea of the project is to base conversion strategies on annotations which explicitly mark-up the text-grammatical structures and relations between text segments, e.g. co-reference relations, semantics of connectives, text-deictic expressions, and expressions indicating topic handling. The project developed a methodology which (semi)-automatically constructs hypertext layers and views, using the text-grammatical annotations.

Our conversion approach operates on two levels:

  • On the document level we annotate the documents in our corpus on different linguistic and text-grammatical annotation layers. This markup is then used for automatic segmentation, linking and reorganization.
  • On the domain knowledge level , we represent the main concepts of our subject domains in a WordNet style semantic net, called TermNet.

In this approach, we store the hypertext views as an additional document layer. Since we preserve structure and content of the original text documents, the reader still has the choice between sequential and selective (hypertext-driven) reading modes.

The users that we have in mind in generating our hypertext views are in search for information in a scientific domain in which they have previous but no expert knowledge. Their time is constrained, and they have to solve a very specific type of problem. Such situations are typical for many contexts, e.g. interdisciplinary research, scientific journalism, or specialised lexicography. In scenarios like these, users often read excursively and perceive only parts of longer documents. When these documents are sequentially organised, i.e. designed to be read from the beginning to the end, this selective reading may result in coherence problems. For example, a reader, jumping right in the middle of a sequential document, may not understand (or may misunderstand) a paragraph because he lacks the prerequisite knowledge given in the preceding text. The goal of our conversion approach is to generate hypertext views on sequential documents which avoid these coherence problems and make selective reading and browsing more efficient and more convenient than it would be possible with printmedia.

Feasability and performance of the conversion methodology are tested and evaluated using a sample text corpus containing documents of the domain "hypertext research" and "text-technology".

=> back to top

Project phase I (finished)

The textgrammar-based conversion methodology was developed and tested using 20 German documents from a corpus dealing with the subject domains "text technology" and "hypertext research".

On the document level of our two-level architecture, these documents were annotated on three annotation layers:

  • Document structure layer: Structural units, such as chapters, paragraphs, footnotes, enumerated and unordered lists. The annotation scheme used was derivated from  DocBook .
  • Terms and definitions layer: Occurrences of technical terms as well as text segments in which these terms are explicitly defined.
  • Cohesion layer: Text-grammatical information of various types, e.g. co-reference, connectives, text-deictic expressions.

The annotations of these three levels were then combined using a unification approach developed in our partner project  Sekimo .

On the domain knowledge level we represent all technical terms occurring in these documents in a WordNet-style representation. The technical basis for this representation is  XML Topic Maps . All technical terms related to their definitions and term occurrences in the documents.

On this basis we implemented our segmentation, linking, and reorganization procedures that generate hypertext views on these documents. Our conversion strategies were implemented and evaluated in a demonstration prototype.

For further information please consult our list of publications .

=> back to top

Project phase II

The HyTex project is in its second phase since August 2005. The main issues of this phase are:

  • Automatic detection and annotation of technical terms and definitions in domain-specific documents.
  • Methods extracting semantic relations (hyponymy, meronymy) from definitions.
  • Usage of lexical chains for segmentation and linking.
  • Connecting our domain-specific word net(TermNet) to other lexical resources (e.g. Germanet) using the Web Ontology Language OWL
  • Topic-based linking strategies using lexical and thematic chains.
  • Evaluation of the approach with respect to our usage scenario.

For further information please consult our list of publications .

=> back to top