IBM based Legacy Document Conversion to XML





Document electronic or otherwise is now the preferred media for dissipating and distributing information. Postscript from Adobe System was the language for most printers, which was used for printing paper documents. Postscript print file can be transformed into PDF( portable document format ) file with the help of Adobe's Distiller, which can be viewed by Adobe's Acrobat. However it is not possible to edit PDF file.

The language of Internet is HTML. It is vastly superior to PDF but it is not for document transformation but for document distribution. HTML cares less regarding the document content and is for document rendering. 

Documents can be simple with a few component elements or very complex. Thus it is desirable to have a framework for creating new document type as demanded by Information to be stored. It is required that the information viewer or browser is customizable on the fly whenever a new Information with unknown document type arrives. XML is the language of choice for describing documents and documents type. XML is simplified form of SGML with some discipline regarding mark up nesting and ending. XML need not have an explicit Document Type Definition. In XML, there are no implicit exclusion or inclusion of document elements; this is allowed in SGML. XML document can be handled OS independent way with client and server side Script and Java programs.

SGML and its many subsets such as XML and HTML are structuring techniques, available today, are increasingly understood by publishing software such as Internet Browsers, Word Processors etc. Some of this software also provides application-programming interfaces to manipulate SGML documents and present the same to the world audience. 

Legacy Document Conversion

Current technology is adequate to convert any legacy data, stored in digital archives during last four decades, in electronic documents so that it can be further manipulated and published. Four components are required to transform legacy data as a publishable document.

·         Analysis of data in order to discover the latent document structure.

·         Define correspondence between set of ordinal numbers with character glyphs or appearance

·         Transform the data to incorporate its latent structure and make it a document.

·         Presentation of the document in a publishing media.

There are several reasons for conversion of legacy documents to XML, HTML, PDF, Word or Framemaker format. It is possible then to transport and distribute documents over Internet or in CDROM easily. It is desirable that the conversion is quick, economical and error free. The documents converted should either retain same appearance or given a new look. When the documents to be converted in many hundred thousand pages then it is further desired that the conversion should be automatic and inherently error free. SGML document is an intermediate time independent representation of any source document. Format specification can be provided by Style sheet in DSSSL. Formatting can also be specified in any text processor, which can read and process SGML documents. SGML document has a structure definition called DTD. It is required to identify and formalize the structure latent in Legacy documents in terms of DTD. It is required to specify formatting styles for each component elements in DTD. This style has to be implemented into Text processing environment such as Word, Framemaker where the document is being formatted. It may also be required to transform the document into Internet  document such as HTML, XML.


The text processor in IBM mainframe is SCRIPT. There are nearly 200 Script commands. Documents are created with Script commands and macros (extension of GML starter set supplied by IBM). BOOKMASTER is again Script based system to produce any kind of technical documents containing graphics, equations etc. GML stands for Generalized Markup Language. In some way it is fore runner of Standard Generalized Markup Language. Pure Script document may not have any structure. GML defines some manually imposed structuring in the document composition. Structuring has some advantage. All documents with same structure can be browsed by a single browser and printed by the same. Legacy documents created with Script, GML and BookMaster may not strictly adhere to a single structure but may be different for different documents.

 Technology for Document Conversion

SGML is the only way to archive documents for future. Legacy Text Processors or Filters can be used to transform documents to SGML documents, that is, document will be tagged with SGML element tags. IBM Script processor can produce three kind of outputs - (Terminal) Text file, PostScript file or a Printer File. One need not create an SGML documents. One can recreate documents using Text File. One can even write an interpreter for Script and Script macros for producing formatting statements for a preferred Word Processor such as MS Word or Adobe FrameMaker etc., instead of actual formatting for the Text to be formatted. Such converted documents can be imported into the Word Processor for which it is intended. Documents thus converted will naturally lack structure. It shall need further work in future to transform it into SGML document. An SGML document is a Browser independent representation. It is required to find out the latent structure of the document in case of SGML. SGML elements will have to be given Formatting Style. Formatting style is not very important. Different Element Tags can be given different Format or over rides of some formatting properties and rest they inherit from the context. SGML can itself be used for defining formatting style as in XML. SGML Elements do have some associated meaning such as Glossary. Styles can be predefined for SGML elements and they may be changed as per composer or publisher's taste. But greater efforts are required to find the document structure in terms of SGML DTD.

IBM Starter Set GML has a simple DTD. Any extension of GML is an extension of this DTD. Following iterative steps are taken to arrive at DTD for an IBM based Legacy Document.

·         Filter the document and replace tags with SGML Element Tags of IBM supplied GDOC DTD

·         Parse the converted document with the current DTD

·         If error then modify DTD or the Tagged Document and execute the step above

Else, the SGML Document is produced - Tagged Document and corresponding DTD

Mainframe Document Conversion tool set

A tool set has been developed to perform large-scale conversion of IBM mainframe based documents written in Script, GML to convert IBM based legacy documents into FrameMaker + SGML, Word, HTML, XML or PDF documents. Large scale Conversion is a multiple step process.

  1. Study Document Management System in existence. 

  2. Analyze the SCRIPT Localization for the installation. 

  3. A preprocessor, interpreter for GML tags, executes under SCRIPT with same localization for processing files. Result of processing under this arrangement is a single SGML document for a preconceived DTD supporting a large class of document elements. 

  4. Some document elements must be converted using special filters (example IBM PSEG graphics files to windows BMP format). Such document element will be existing as external entity reference and they shall be referring to the filtered file. Such files need be downloaded separately and filtered. The SGML converter provides the report of such files, which needs special filters.

  5. The document may report parsing errors. It will be required then to change DTD or make minor modification to the SGML document. The DTD is very rich and can be regarded as superset of all kind of legacy document Structure. Converted SGML document may have different and unknown element tag. It may be required to locate a similar document element in the DTD and add a new production rule with the changed Tag.

  6.  It is required to modify or render the element using styles for the converted system. Attributes of the document element shall be used for this. When the document is parsed clean, SGML document and corresponding DTD are produced.

A template and corresponding Element Definition Document( EDD) of Adobe FM+SGML for the DTD are part of the tool set. The styles for any new elements in the DTD are incorporated into the corresponding EDD of Adobe FM+SGML. EDD further can be customized with user preferred style for document elements. A FM+SGML template is then generated which is used for rendering the SGML document.

By carefully choosing the initial documents to be converted it is possible to exhaust all the document elements existing in an installation for its set of documents. Thus single Preprocessor, DTD, Template may be generated for an installation which would convert all its documents to FM+SGML. Some document elements in SGML document may need to be further transformed for correct rendering. PERL script can be developed for the same.

It is possible to define HTML document elements for SGML elements in FM+SGML. It is possible then to convert SGML documents to HTML or XML documents with Style sheets for rendering. Index, Table of Contents etc. can be generated in FM+SGML using special templates for the same. It is possible to transform SGML document into XML document and render the same using XSL using Jade and SP software developed by James Clark.

The tool set consist of GML Interpreter, DTD, EDD, Template, Filters, EBCDIC to Windows ANSI Character table. It takes about 2-minute�s turn around time in IBM mainframe to convert 1000 pages Legacy document to an SGML document in VM/CMS. It takes about 15 minutes to convert this same SGML document to FM+SGML document. It is expected when the application is fully configured and customized for an installation, 1000 pages of conversion can be done in one hour.

Script text processor in IBM mainframe is intended for print media. Footnotes, Cross-reference and Index elements are created with SCRIPT and GML in Legacy documents occur anywhere and everywhere freely. They are picked up during Text processing time by SCRIPT processor and replaced with page number etc. But transformed document is intended not only for Paper media but also Screen viewing in CDROM or in Web. It is required to make hyper jump to reference point. XML as a language does not permit any document element to be inserted anywhere or everywhere. So preprocessor converts Legacy document to SGML. FM+SGML is used to transform the resultant document to XML or HTML in straightforward manner after the document has been converted to Frame Maker structured document.

WebWorks Professional Edition can be used to convert FM+SGML document with the same or changed appearance for browsing by Internet Explorer or Netscape. The documents can be broken up into smaller files for fast transmission on web. Quality of appearance of the document can be enhanced as per user requirement. Documents are converted to HTML or XML with Cascading Style Sheets.

Process Flow of IBM based Legacy Document Conversion




IBM Based Legacy Document Conversion in Snap Shots



SGML Converter


Converted SGML Documents




DTD for SGML Document






Parsing and rendering






Report on Conversion





Converted Books






Generated TOC and Index






Front matter Snap






Hyper Text Links






Table Snap





Flow Snaps









Code Snap






Chapter Snap






TOC Snap






Index Snap








Back matter





Converted HTML documents











  1. Adobe FrameMaker+SGML 6.0 Developer's Guide, Online manual
  2. WebWorks Publisher's Professional Edition User Guide
  3. GML Starter set User's Guide, IBM Document No. SH20-9186-07
  4. SCRPT/VS User's Guide, IBM Document No. S5444-3191-01
  5. James Clark's Home page;
  6. ISO/IEC 10179:1996 
    Information technology -- Text and office systems --
    Document Style Semantics and Specification Language (DSSSL),
    dated April 1, 1996, © 1996 ISO/IEC
  7. ArbourText, Inc.,, SGML Exceptions and XML
  8. Clark, James,, Comparison of SGML and
  9. World Wide Web Consortium (W3C),, Extensible Markup Language (XML) 1.0: W3C Recommendation 10-February-1998

Author: Kankan Roy

This paper was submitted to XML 2000 Conference in Washington.

Links to some recent works done by Kankan Roy:

A Mobile System Simulator

Engineering Document Management System

Collaborative Platform for Healthcare

Test site for Secured Web Communication

A Short Note on Security, Privacy and Encryption

          Interesting programming problems

  Security Study

Some Clocks for your XP Desktop or using Internet Explorer:

http// : A PDF generating Web service : Specification for Message Authentication Message authentication for Enterprise SOA tool On SOA