CODEX RFC/Proposal

Benjamin (Mako) Hill

Version 0.1.1 | Sun, 29 Sep 2002 16:51:28 -0400

The Problem

There is no good, free, version control system for documents or documentation. The creation of any piece of literature, especially one that involves multiple authors working on the document concurrently, involves adding, removing, and changing text constantly. There is no good way to keep track of these changes in the way that there is for source code. Existing solutions are proprietary and/or kludgey.

The Consequences

An author has no easy way to isolate and display a list of changes that an editor has made to their document;
When multiple authors work on a document simultaneously, authors create changes that conflict--they have both rewritten or edited the same paragraph, sentence, or word. Perhaps this alone is not a problem but there is no good way to resolve these conflicts. This acts to discourage collaborative creation--at least in an asynchronous fashion (Edit this document and I'll wait until you send it back to me);
An author is unable to keep an easily accessible history of changes they've made to a document. An author may be hesitant to remove or delete a paragraph or section that they think they may want to retrieve in the future; They may simply delete parts they want later;
Authors can not branch a document in the way that programmers can with source code. A branch may involve an author who wants to do major work or reorganization on a document may want to it do it separately while letting work on the document (spelling and grammar fixes, minor additions, continue). Then he/she would merge the two branches back together when the major changes were ready.

Current Solutions

Current solutions seem to fall into a couple major categories. Each has his own benefits and shortcomings. Some of these include:

Source-level VCSs: The technique I currently use involves putting my documents into a VCS intended for source code like RCS, CVS or subversion (SVN) one of many proprietary alternatives with similar functionality.

CVS will track changes and allow multiple authors to work on a document simultaneously--just as it does with source code. However, there are at least two major drawbacks:
- text-only formats: CVS works only with plain text formats. CVS works great for text/plain, HTML, SGML, or XML formats but is not particularly useful for binary formats. Because CVS was written for source code and because source is always text/plain, this shortcoming can not be worked around. Since many word processors seem to moving slowly toward XML-based formats, this is a shortcoming I may be able overlook.
- line-by-line diffs: CVS and subversion each approach diffs in a line-by-line fashion. The smallest change one can make is to one line. This is problematic for those tracking changes to documents. One might change a single word or a piece of white-space and reformat a seven-line paragraph and the CVS would display up as seven changed lines. To an editor or an author, finding the change or changes within this paragraph often proves prohibitively difficult.
Microsoft Word's Track Changes Function: This method provides perhaps the most successful solution I've seen so far. However, because Word is proprietary software, it is unacceptable for my purposes. On the other hand, OpenOffice.Org (a free alternative to Word) provides similar types of functionality in a free package.

Track Changes acts as the name implies: using the function, Word or OpenOffice will log changes made to a document. When the document is sent to a friend, the word processer can display the changes in an intelligent, interactive interface that will let the author check the changes one by and one and approve, deny, or edit them.

I've seen offices and organizations make heavy use this feature, in conjunction with a SMB (windows file-sharing) network share folder to work on documents as group. However, this collaboration must be tightly controlled and well organized as there as this solution introduces several major drawbacks.
- collisions: I've yet to see a piece of software that could present a useful list of word-for-word or markup-for-markup collisions between two documents that have each been edited from the same source document--or two that have been edited from different versions of a source document. This means that authors must work under a system where only one author or editor can have a document checked out at any one point.
- branching: As I mentioned in the consequences section above, literary creation, especially the creation of documentation, can benefit heavily from the ability to branch.
- elegance, scalability: It may be possible to use Word to keep lists of the last 2,000 levels of changes to a document but it was not intended to do this. At best, it will be slow, ugly, and require massive .docs. I'd like a solution specifically designed to follow and track changes changes to a document over it's entire lifespan.
GNU wdiff: While this program can show diff's in a useful word-by-word basis, it inflexible attitude toward whitespace as mark-up is limiting. It also provides no real VCS features to speak of. Hacking wdiff to sit on top of CVS, wdiff would be a less fleixble version of my solutions.

My Solution

I propose a robust, free version control system specifically designed for working with documents--especially in a asynchronous collaborative environment. I'll refer to this (non-existent) system as CODEX, or the Collaborative Online Documentation (D)ifference Extractor. The software will be free software and will be distributed under the terms of the GNU GPL. The core engine will be written in either Perl or Python.

Since my software will be free software, I will seek to not duplicate effort where-ever possible. I think that building off a system like CVS or subversion will be the logical first step. Since a diff will show every changed line, it will by default show every changed word and piece of white space. A contextual diff (which both CVS and subversion can provide) will include even more information. Either of these programs will be able to provide information useful for resolving conflicts and will provide the ability to commit, checkout, release or watch a project. They also both provide servers with several methods to use interface over a LAN or the Internet. A future version of subversion will allow for different client-side diff programs.

I do NOT want this project to involve creating a new word processor. There are more than enough of them, most of them bad. I would almost certainly create another bad one. I want my software to be able to work with many other word processors so that it might be picked up an incorporated as a back-end to existing pieces of software.

Taking this method, my software will act as interface between the user (or their word processing software) and the VCS.

Along the lines I'm considering right now, the software might:

Interpret and relay commands from the user to commit, checkout, and update to/from the VCS system, running on the local machine or a server on a LAN or the Internet;
Parse diffs from CVS output to display changes in terms of units of language or mark-up. This would require that the software be aware of the particular markup system--at least enough to distinguish mark-up from text/data. For SGML and XML-based languages, this will be easier. For some other documentation languages, this is a much more difficult problem;
Relay these changes to the word-processor, editor, or viewer in a format where it can then display the changes usefully to the user and allow them to edit or interact with them. Because I want to interface with many different existing projects, I will probably need to write this data into an intermediary format. I might use or create an XML DTD toward this end;

To accomplish this, my software will actually need to be two distinct pieces.

Back-end with modular plug-ins for:
- different VCSs
- different markup-languages
Frontend that is wholly editor/word processor specific. This might be written in an editors extension language. It could also be a web application or a viewer specifically written to display the content of the XML. Hopefully, there will be many of these. It will be able to call CODEX with a simple set of commands and it will receive and parse codeXML (the current working name for my proposed XML DTD).

In this way, what I aim to create in this project will be more of a framework for creation, transmission, and handling of this type of data. I will aim to define this framework and get example code written as a proof of concept. Hopefully, with this out there, other developers will be able to contribute and expand the scope, and usability of the software.

This diagram shows how some of the internals of the CODEX engine might work.

In creating my proof of concept this year, I'll aim to create (in this order):

The codeXML XML DTD;
A working diff parser and and codeXML generator;
One mark-up specific driver for the diff parser (probably for HTML, DocBook SGML, XML or any easier target);
One back-end specific interface (for either CVS or SVN);
One CODEX Frontend. This will probably be an Emacs mode written in Elisp. Unless I can find a better/easier interactive target. Perhaps I will simply write an HTML or wiki parser-driver and then use a web front-end. I think an Elisp front-end would buy the most mileage at this early stage of development.

This is what I have so far and a lot it is right off the top of my head. This is a RFC. Please email me back at mako@debian.org.

Mako Hill

Last modified: Fri Sep 27 18:55:16 EDT 2002