Posted by Julius
| On
June 11th, 2012
| In Big Data, Database Migration, Social Commerce, Web & Software Development
| Tags: BSON, document storage, JSON, protobuf, serialization, XML -->
Click here to download this post as Word Document
Document Storage Formats – An Introduction
Document storage databases are all the buzz these days. Interestingly enough they are actually not a very recent invention. Already 20 years ago there were object oriented databases using the very same concepts.
In this four part mini blog series I will take a bird’s eye view at the connection and relationships between object oriented databases, object stores, serialization and document storage. I will present and explain the most common document storage formats and will try to find the reason why object stores are all the buzz today but were not 20 years ago.
Episode 2 – Binary and XML document formats
You could argue that there is no such thing as a binary document storage format. Binary serialization has been around for a long time, not just since developers want to persist objects directly to long term storage (without first having to map between object properties and database columns). When passing objects across process boundaries, which is known as marshaling, developers have to first serialize the objects’ state. In its simplest form the binary document would be a string of bytes representing the values of all variable object properties. For the application to be able to interpret the binary document it had to somehow know about the object’s class definition. The class definition would either be part of the application code, as a declarative class definition, or the application would learn about it from type libraries. Either way, the binary documents were neither humanly readable, nor machine interpretable without explicit knowledge of its source or destination object’s class definition.
Recent implementations for binary document formats are more sophisticated. They contain meta data describing and identifying the structure from which the object’s variable parameters have been serialized. While this increases the resulting document’s size, it makes it more portable and allows its data to be interpreted even after the original object’s class definition has long been forgotten.
Obviously knowing the structure of any document is very useful in many respects. That is why structured document formats contain more or less meta data which self-describes and “communicates” their structure; binary document formats are no exception anymore.
While binary document formats usually contain only very little meta data and therefore are considered the “leaner” of the document formats, XML might be placed at the other end of that scale, being a very “bloated”, or nicer put, a very “verbose” document format.
According to Wikipedia “XML” is defined as follows:
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards. (Source: http://en.wikipedia.org/wiki/XML#Comments)
XML examples:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumbers>
<phoneNumber type="home">212 555-1234</phoneNumber>
<phoneNumber type="fax">646 555-4567</phoneNumber>
</phoneNumbers>
</person>
<person firstName="John" lastName="Smith" age="25">
<address streetAddress="21 2nd Street" city="New York" state="NY" postalCode="10021" />
<phoneNumbers>
<phoneNumber type="home" number="212 555-1234"/>
<phoneNumber type="fax" number="646 555-4567"/>
</phoneNumbers>
</person>
The XML document format is probably the most mature of all formats. It is very popular because it is very humanly readable and therefore enables easy debugging of application code working with XML documents. There have been numerous technologies developed to query and manipulate XML documents, XML schemas and other entities related or derived from the XML format. Some database management systems fully support the XML document format as a data type and may even enable transparent indexing of XML elements or attributes. XML is very flexible and versatile, and, by design, very extensible. Nonetheless, when it comes to storing and managing large amounts of object data, XML is not commonly the document format of choice. The CPU overhead for processing and the inflation of data size when using XML documents often make it difficult to justify its use for object storage. Applications often cache object documents in memory, perform searches or index operations in memory. The greater the size overhead from your document format, the fewer object documents fit into a finite amount of memory. Therefore application developers seek more compact document formats and avoid XML object storage in such use cases. Still, to store configuration data for a relatively small count of entities, the XML document format is very suitable – and popular.
Preview of next week’s Episode 3: In my next blog episode I will look closer at the JSON and BSON document formats. Currently those are probably the most popular document formats. MongoDB, for example, uses both formats: BSON for internal storage and JSON for communicating with your application.
Resources and references: If you are interested in more information please visit the following links:
http://en.wikipedia.org/wiki/XML_Namespace
http://en.wikipedia.org/wiki/XML_Base
http://en.wikipedia.org/wiki/XPath
http://en.wikipedia.org/wiki/XSLT
http://en.wikipedia.org/wiki/XQuery
http://www.w3schools.com/xml/xml_whatis.asp
http://google-styleguide.googlecode.com/svn/trunk/xmlstyle.html
http://primates.ximian.com/~lluis/dist/binary_serialization_format.htm#intro
http://msdn.microsoft.com/en-us/library/72hyey7b(v=vs.71).aspx
Posted by Julius
| On
May 23rd, 2012
| In Big Data, Database Migration, Social Commerce, Web & Software Development
| Tags: BSON, document storage, JSON, protobuf, serialization, XML -->
Click here to download this post as Word Document
Document Storage Formats – An Introduction
Document storage databases are all the buzz these days. Interestingly enough they are actually not a very recent invention. Already 20 years ago there were object oriented databases using the very same concepts.

In this four part mini blog series I will take a bird’s eye view at the connection and relationships between object oriented databases, object stores, serialization and document storage. I will present and explain the most common document storage formats and will try to find the reason why object stores are all the buzz today but were not 20 years ago.
Episode 1 – Introduction and definitions
Document Storage: A document in context of
software development is considered, plain and simple, a computer data-file or data-set. For different purposes the data file or set might contain different content. Often, when the content of the document requires a predictable structure, meta data is being introduced to define the document data’s formatting. In such case you consider the document’s data to be structured, otherwise unstructured. An example for unstructured data is a word document; an example for structured data is an XML document.
An application has different options as to where to store its document’s data. Most commonly applications employ a combination of disk storage and memory caching. Obviously the choice for storage will impact performance and scalability of your application. In more recent days there are also options to store the data in the cloud or on solid state disk (SSD).
Object Storage: Within the application all object instances live in random access memory
(unless they are memory mapped to e.g. your swap file, which for now, we assume they are not). If the power goes down your random access memory will lose all its data and therefore all of the application’s object instance data (amongst other data). If you want to hold on to that data you have to persist it; that is where Object Storage comes into play – as an option to persist your application’s object instance data to.
Most software and web development frameworks (for example .Net) provide ready-to-use, boilerplate source code or complete implementations of object interfaces for your custom classes which enable you to easily persist their instance data to a document of a particular structure. Such code
automatically takes care of finding your instance data and converting it to the appropriate format (e.g. for date/time data types). Some of them even take enumerations, complex data types and even object hierarchies and arrays into consideration. Common are implementations for Binary format and maybe XML. But online you can find boilerplate source code to persist your objects to pretty much any format you have ever heard of.
Some challenges for Object Storage are dealing with object versioning and object inheritance. Also handling Object references rather than instances can be tricky. Don’t assume that those “advanced features” are naturally implemented in all development frameworks. Always make sure to verify before using; and benchmark after coding. Default implementations are often not the best performing ones.
Serialization: According to Wikipedia “serialization is the process of converting a data structure or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and “resurrected” later in the same or another computer environment.”
(Source: http://en.wikipedia.org/wiki/Serialization)
Most commonly serialization is referred to as what you have to do when you want to persist an object instance in your application onto a storage medium other than memory. The basis of the motivation for serialization is that in opposite to random access memory, all other storage types (like disk, cloud or SSD) do NOT provide a fast mechanism to access any byte, anywhere at any time. Performance suffers heavily when trying to read e.g. from a harddrive using random access. In order to make up for some of the natural performance inferiority of e.g. harddrive storage, it is better to “stream” the data to and from the storage device in a sequential access fashion; therefore, at the end of a serialization you usually receive a structured document, which is suitable for “streaming” it to its storage destination, which is document storage.
It is obvious that serialization comes at a price. Depending on which document storage format you choose you observe different impact on CPU processing utilization and transmission bandwidth. Serialization tends to “bloat” the amount of data you have to transmit and store.
Sometimes memory mapped disk storage can be an alternative to serialization but, I believe, this is not widely used. Maybe as SSDs become cheaper and faster there may be a day when memory mapping becomes a viable alternative to serialization (I have pitched this idea to FusionIO but they didn’t seem to be impressed).
Common Document Storage Formats: As mentioned before there are many different document formats for structured document storage. Most famously probably XML and Binary. With MongoDB becoming a “household document store” JSON and BSON are also becoming more widely known (yes, I know, the Java programmers out there will disagree with me that it took MongoDB to make JSON famous); and a more exotic one is Protocol Buffers, a very compressed, binary, structured document storage format introduced and used by Google.
(JSON sample document)
Preview of next week’s Episode 2: In my next blog post I will look closer at the XML and Binary document formats. Also I will start to look at different aspects of storage formats in general which have an impact on performance and scalability and provide reasons as to why one format might be more suitable for a certain use case than another.
Resources and references: If you are interested in more information please visit the following links:
http://www.json.org/
http://bsonspec.org/
http://code.google.com/p/protobuf/
http://www.w3.org/XML/
http://msdn.microsoft.com/en-us/library/72hyey7b(v=vs.71).aspx
http://en.wikipedia.org/wiki/Serialization
http://www.cs.cornell.edu/info/people/chichao/ccc-ch5.pdf
http://www.teamjohnston.net/blogs/jesse/post/2007/04/08/Serialization-Problems-and-Solutions.aspx
http://www.boost.org/doc/libs/1_35_0/libs/serialization/doc/special.html
http://java.dzone.com/articles/object-serialization-evil
(“consider the source” on this one!) http://www.versant.com/pdf/wp_vsnt_serialization.pdf