All of the web applications I’ve worked on have had at least one textarea where a client could type in HTML (or perhaps use a WYSIWYG editor), submit the form and save that information. Sometimes it was abstracted using Wiki formatting or some other faux HTML markup. The textareas are often related to a blog post, comments, rental properties, vacation spots or what have you.

Even now, most of the web applications I use do the same thing – this blog being the perfect example. All of these posts end up saving HTML into some column in a table in the database.

Perhaps the separation of presentation and data have gone to my head, but doesn’t the whole idea of saving HTMLified text seem wrong? It seems like it is corrupting the data. For example, take a search query; why should a database query have to search through HTML mark up to find matches? Why should word counts have to understand that there are tags involved?

It also seems like a big waste to store all that extra tag text in a database. How many <p> tags do you think are in the average web application database? I am sure most databases have compression techniques to handle duplicated data, but still it seems like extra work for no real gain.

Converting HTML into XML + a stylesheet makes the separation a bit better, but you’re still storing a bunch of markup text – which isn’t too bad especially if the database you are working with is XML savvy. However, there is still the problem of having to deal with the markup tags when text searching or mining the data. Unless you actually load the XML in a DOM during the search, then search again inside each XML tree, you are drudging though tags and creating the need for more if statements.

Add to that the fact I am not really talking about structured data. With the data I am talking about, it is highly unlikely that some one is going to mark text as <address> – more likely they will pick bold and red.

One solution I am kicking around is trying to write / find some sort of text style markup language that is stored separate from the text data (This has to exist somewhere, probably an old school Unix format, but I am not even sure where to start looking). I am thinking it could work something like this:

The stylesheet, in its most basic form, would be a type and position-length pair. So for the text:

This is example text, man.

A parser would sniff out the tags, and make a stylesheet that could look like:

(sheet (bold (5,2), (22,3)), (italic (8,7)) )

The tags would be stripped from the original data, and both the plain text and stylesheet stored separately. On display the sheet could be reapplied to the text – like CSS (of course it wouldn’t have to recreate HTML. It could make a PDF out of it for example).

With the HTML tags stripped from the text, the text could be stored, searched, converted, faxed, put though text-to-speech, or mined with a lot less cruft to deal with. The stylesheet could be stored along side the plain text, and applied right before it was shown in a UI (most likely HTML, but it could be anything). The key thing, I think, is to store the text and the formating separately. With TeX, PDF, and all the other formats I know of the formatting is stored in the text itself. Because, I would imagine, those formats are generally used in files, and files in the past only had one place to store information.

So this is a bit of a stream of conscience, I just had this thought as I was drifting off to sleep and got up and wrote this entry. Know of any tools that do this, or have any thoughts on the matter?