call us toll-free +1 855 855 3600
 
  • Home
  • >
  • Blog
  • >
  • Extract-Transform-Load (ETL) Technologies – Part 1
DB Best Chronicles
 
Talks on Big Data, Mobile Apps, Web and Software Development

Extract-Transform-Load (ETL) Technologies – Part 1

Posted by | On December 31st, 2012 | In Big Data, Database Migration | Tags: , , -->
 

Big Data – Extract-Transform-Load (ETL) 001 (click to download)

*~:~* Happy New Year *~:~*

My last blog (Column Oriented Database Technologies) discussed the differences between Row and Column oriented databases and some key players in this space.  Concepts and technologies on Big Data have been discussed in previous blogs (Big Data & NoSQL Technologies & NoSQL .vs. Row .vs. Column).  From these blogs one should surmise that deciding upon the best database technology (or DBMS vendor) really depends on schema complexities, how you intend to retrieve your data, and how to get it there in the first place.  We’re going to dive into this next, but before we do it is imperative that we briefly examine the differences between OLTP and OLAP database designs.  Then let’s leave OLTP details for a future blog as I expect most readers already know plenty about transactional database systems.  Instead we’ll focus here on OLAP details and how we process Big Data using ETL technologies for data warehouse applications.

Generally the differences between OLTP and OLAP database applications center upon how frequently data must be stored and retrieved, the integrity of that data, and how much of there is and its growth.  OLTP database schemas are optimized for processing transactions by an increasing number of users while OLAP database schemas are optimized for aggregations against an increasing amount of data and exponential query permutations.  Design considerations involve normalization, indexing, datatypes, user load, storage requirements, performance, and scalability.  We will need to defer the many interesting details on these considerations for a future blog.

      OLTP-DW

On-Line Transactional Processing (OLTP) applications, like Customer Relationship Management (CRM), (ERP), Corporate Financials (AR/AP/GL), e-Commerce, or other enterprise systems use traditional SQL queries that are either embedded within application code (not a great practice in my humble opinion), or within stored-procedures (a much better practice) in order to store and retrieve data.  Where data integrity and performance is critical, an OLTP database is the most appropriate.  OLTP schema designs are generally highly normalized, comprehensive structures that are optimized for fast transactional data processing.  Typically they represent the source data that feed a data warehouse.

OLTP-100410-300x242

On-Line Analytical Processing (OLAP) applications however focus on Data Warehouse and Business Intelligence systems where the volume of data can grow quite large.  Performance and scalability become the driving factors and data integrity is generally inherited from the source system reducing its importance.  Typically OLAP schema designs follow well known modeling practices like Ralph Kimball’s Star Schema or Dan Linstedt’s Data Vault (read my blog Data Vault – What is it?).  Perhaps the best way to think of an OLAP system is to think of the wizard behind the curtain who as all answers, for everyone, all the time.  Boy if were only true!  Instead, OLAP schema designs are generally refactored data structures based upon source OLTP systems that are optimized for fast query processing.

star schemathe data vault——————————————————————————————————————————————

And so we arrive at the crux of this blog.  How do we get data from the source OLTP system into the target OLAP data warehouse in a practical, efficient way?  And what do we need to do to that data to conform and resolve the clearly different schema designs?  So along comes ETL, or Extract-Transform-Load; fundamentally understandable, but potentially very hard to do once you dive into all the complexities involved; like peeling an onion or melting the wicked witch!  Let’s examine what is really involved.

etlOLTP-Systems-to-OLAP-Systems

————————————————————

Extracting data from a source dataset, transforming that data in potentially many particular ways, and loading the resulting data into a target dataset is the essence of ETL.  Understanding that is simple.  Yet consider that the source data may originate from several different files, tables, views, or databases; furthermore these are potentially varied in structure, location, and host systems (ie: Oracle, MS SQL Server, MySQL, etc.).  Also consider that transformations can include a myriad of different requirements from normalization to de-normalization of source data, lookups from other datasets, merging, sorting, truncating, datatype conversion, inner joins, outer joins, matched and/or unmatched records, etc.  These requirements are anywhere from simple to daunting.  Consider the data load; perhaps there is one target, perhaps many; data might reside in a data warehouse, and/or data marts; maybe it’s a Star Schema, maybe not!  Finally the entire ETL data process may be a full data set, or an incremental one.  OMG!  ETL data processing permutations are endless and one thing is certain, getting it right is critical!

ETL_input_output

——————————————————————————————————————————————

hand-power-toolsWhat should we do then to deal with these comprehensive and complex data processes?

We use tools of course!

Some may build their ETL data processing with SQL scripts, or maybe embedded SQL in another scripting language like PHP or Python.  Others may actually craft programs using high level languages like C#, Java, or Visual Basic.  These solutions are fine, of course, but they present a cumulative burden of development, maintenance, and cost that can easily exceed the alternative of using tools designed specifically for ETL data processing.  Not that there is no burden if you use ETL tools, but a greatly reduced burden, in my humble opinion.  There are several ETL tool vendors out there all providing the various functionalities needed.  I found this survey result, which is about as I would expect.

mostWantedETLFeatures

Keep in mind however that there are several other key features that should be considered when looking at ETL tools, including:

  • Automation
    • Task Scheduler & Triggers
    • Environment Control (ie: DB Connections, Global Variables)
    • Restart/Recover/Abort
    • Load Balancing
  • Administration
    • Admin Console
    • Project, User, Task Management
    • Integrated Source Control
    • Distributed Processing
    • Task Execution Analysis
  • Monitoring & Logging
    • Real-Time execution logging
    • File and Database error capture

Using ETL tools with the appropriate features and usability is very important, yet there is one more aspect I submit must be considered before crafting any ETL process.  Data Warehouse Schema design! All the objects involved within a Data Warehouse are really driven by the overarching architecture.  This aspect is so easily overlooked, or under-valued.  Many just slap something together and hope for the best.  Maybe this contributes to the almost 80% failure rates of first DW/BI project efforts.  Let’s take a step back then and look at this from 35,000 feet for a moment.

Ok, we all know that an OLAP Data Warehouse commonly supports analytics and reporting, often referred as Business Intelligence.  We typically understand what our data sources are and that we will use ETL data processes to move data through a system that in turn must be a sustainable, pliable, expanding DW/BI solution.  Yet what is the architecture of that system?  Do you notice we started from a comfortable position of factual information of what is known and quickly culminated in a place of genuine uncertainty about what will be constructed?  The good news is that we also do know what the results should be.

figureII01

Consider carefully then goals and requirements of the business and the metrics involved and then decide what the end data stores should be.  From there you can fill in the gaps; ask yourself these many questions:

  • What are the business metrics and how should they be defined?
  • What permutations are involved? (ie: reporting periods, frequency, filters, etc.)
  • Should the target data stores be column, or row based?
  • Should Star Schemas or Data Vaults be used?  and when?
  • What, if any, Data Marts would be employed?
  • Do they inter-relate or are they stand-alone?
  • How much source data exists already?
  • How big will the target data grow?  and how fast?
  • What are the expected complexities of the transformations involved?
  • How about hardware and software requirements?
  • Optimizations?  Scalability?  Functionality?  ~~ you get the picture …

I surmise that anyone who stops to think about the architecture involved, many more questions will easily come to mind.  The more questions you can consider and answer regarding any DW/BI system architecture the better; and sooner is better than later.  One principle I always try to employ in any design I craft, is that change is inevitable so DW/BI system architectures must be pliable and extensible.  Achieve this in your solution and you’re at least half way there.  Just remember to click your heels together three times first, close your eyes, and repeat: there’s no place like home!

One final good practice common to many DW/BI system architectures aims at staging data first, before it goes into the data warehouse.  Often called an ODS, or Operational Data Store, having a pre-emptive place to prepare, or stage, data before it moves into the data warehouse can be a tremendous advantage to any DW/BI system architecture.  It does introduce an additional ETL step, but usually quite worth it.  A decent ODS will contain a reduced dataset from the source with few to no transformations thus eliminating data you know you don’t need.  Then the next ETL step does the real work in transformation and eventually loading into the data warehouse.

BI_ETL

ETL data processing provides an essential methodology for moving data from point A to point B.  Crafting these processes can be straight-forward to highly complex.  I often wonder how we all got by with SQL scripting before ETL tools came along; today when faced with a data migration and/or date warehouse population task, I break out these tools and get to work.  Best advice: Follow the yellow brick road…

thumbnailCAY614JNAs this topic is quite extensive, my next blog, “Extract-Transform-Load (ETL) Technologies – Part 2”, will focus on ETL vendors so we can examine who the key players are and what they offer.

As always, don’t be afraid to comment, question, or debate… I learn new things every day!

Column Oriented Database Technologies

Posted by | On July 24th, 2012 | In Big Data, Web & Software Development | Tags: -->
 

Column Oriented Database Technologies (click to download)

My recent blog (Big Data & NoSQL Technologies) discussed various NoSQL technologies and market vendors.  Today let’s dive into column-oriented databases and why they should play an important role in any data warehouse whose focus is on aggregations or metrics (and whose isn’t?).

So you are all probably familiar with row-oriented databases.  Tables of data where rows of fields (also called columns) represent the structural storage and the corresponding SQL queries that select, insert, update, and delete that data.  Most database vendors like Oracle, Microsoft, Sybase, Informix, and many others all base their technology on this ANSI standard.  Column-oriented databases are indeed what you might surmise; tables of data where columns of data values represent the structural storage.  What you might not expect is that on the surface many column-oriented databases look and feel like row oriented databases also using SQL queries in much the same way.  Creating tables, storing data, querying them are all pretty much identical.  They may appear similar, but two principal things to understand is that the significant differences under the hood, in particular, physical storage and query optimization.

As noted in my previous blogs on NoSQL, there is also a column-store technology out there.  Let’s not confuse that with column oriented databases.  They are different.  Since several NoSQL column-store vendors were highlighted before, we will focus instead on the column oriented database vendors here.

First, some key benefits to column oriented databases:

  • High performance on aggregation queries (like COUNT, SUM, AVG, MIN, MAX)
  • Highly efficient data compression and/or partitioning
  • True scalability and fast data loading for Big Data
  • Accessible by many 3rd party BI analytic tools
  • Fairly simple systems administration

Due to their aggregation capabilities which compute large numbers of similar data items, column oriented databases offer key advantages for certain types of systems, including:

  • Data Warehouses and Business Intelligence
  • Customer Relationship Management (CRM)
  • Library Card Catalogs
  • Ad hoc query systems

Column oriented database technology has actually been around for many years originating in 1969 with an application called TAXIR which provided abstracts for mathematical biosciences.  In 1976, Canada implemented the RAPID system for processing and retrieval of population and housing census statistics.  Sybase IQ was the only commercially available column-oriented database for many years, yet that has changed rapidly in the last few years.  Let’s take a quick look at some of today’s key players:

SAP Sybase IQ  (www.sybase.com)

A highly optimized analytics server designed specifically to deliver superior performance for mission-critical business intelligence, analytics and data warehousing solutions on any standard hardware and operating system.  Its column oriented grid-based architecture, patented data compression, and advanced query optimizer delivers high performance, flexibility, and economy in challenging reporting and analytics environments.

Essentially a data partitioned, index based storage technology, Sybase IQ’s engine offers several key features which include:

    • Web enabled analytics
    • Communications & Security
    • Fast Data Loading
    • Query Engine supporting Full Text Search
    • Column Indexing Sub System
    • Column Storage Processor
    • User Friendly CUI based Administration & Monitoring
    • Multiplex Grid Architecture
    • Information Live-cycle management

The Sybase IQ Very Large Data Base (VLDB) option provides partitioning and placement where a table can have a specified column partition key with value ranges.  This partition allows data that should be grouped together to be grouped together and separates data where they should be separated.  The drawback to this methodology is that it is not always known which is which.

Infobright  (www.infobright.com)

Offering both a commercial (IEE) and a free community (ICE) edition, the combination of a column oriented database with their Knowledge Grid architecture delivers a self-managed, scalable, high performance analytics query platform.  Allowing 50Tb using a single server, their industry-leading data compression (10:1 up to 40:1) significantly reduces storage requirements and expensive hardware infrastructures.  Delivered as a MySQL engine, Infobright runs on multiple operating systems and processors needing only a minimum of 4Gb of RAM (however 16Gb is a recommended starting point).

Avoiding partition schemes, Infobright data is stored in data packs, each node containing pre-aggregated statistics about the data stored within them.  The Knowledge Grid above provides related metadata providing a high level view of the entire content of the database.  Indexes, projections, partitioning or aggregated tables are not needed as these metadata statistics are managed automatically.  The granular computing engine processes queries using the Knowledge Grid information to optimize query processing eliminating or significantly reducing the amount of data required for decompressing and access to answer a query.  Some queries may not need to access the data at all, finding instead the answer in the Knowledge Grid itself.

The Infobright Data Loader is highly efficient so data inserts are very fast.  This performance gain does come at a price so avoid updates unless absolutely necessary, design de-normalized tables, and don’t plan on any deletes.  New features to the data loader include a reject option which allows valid rows to commit while invalid rows are logged.  This is highly useful when loading millions of rows and only having a few rows with bad data.  Without this feature the entire data load would be rolled back.

Vertica (HP)  (www.vertica.com)

Recently acquired by Hewlett Packard, this platform was purpose built from the ground up to enable data values having high performance real-time analytics needs.  With extensive data loading, queries, columnar storage, MPP architecture, and data compression features, diverse communities can develop and scale with a seamless integration ecosystem.

Claiming elasticity, scale, performance, and simplicity the Vertica analytics platform uses transformation partitioning to specify which rows belong together and parallelism for speed.  Several key features include:

    • Columnar Storage & Execution
    • Real-Time Query & Loading
    • Scale-out MPP Architecture
    • Automatic High Availability
    • Aggressive Data Compression
    • Extensible In-Database Analytics Framework
    • In-Database Analytics Library
    • Database Designer & Administration Tools
    • Native BI & ETL support for MapReduce & Hadoop

The Vertica Optimizer is the brains of the analytics platform producing optimal query execution plans where several choices exist.  It does this through traditional considerations like disk I/O and further incorporates CPU, memory, network, concurrency, parallelism factors and the unique details of the columnar operator and runtime environment.

ParAccel  (www.paraccel.com)

Analytic-driven companies need a platform, not just a database where speed, agility, and complexity drive the data ecosystem.  The ParAccel Analytic Platform streamlines the delivery of complex business decisions through its high performance analytic database.  Designed for speed, its extensible framework supports on-demand integration and embedded functions.

The ParAccel Database (PADB) present four main components: the ‘Leader’ node, the ‘Compute’ node, the Parallel Communications Fabric, and an optional Storage Area Network (SAN).  The ‘Leader’ controls the execution of the ‘Compute’ nodes and all nodes communicate with each other via the ‘Fabric’ running on standard x86 Linux servers.  Each ‘Compute’ node is subdivided into a set of parallel processes called ‘slices’ that include a CPU  core, and thier allocation of memory, and local disk storage.  The ‘Communication Fabric’ provides a low-level MPP network protocol for increased performance.

Key PADB features include:

    • High Performance & Scalability
    • Columnar Orientation
    • Extensible Analytics
    • Query Compilation
    • High Availability
    • Solution Simplicity

ParAccel Integrated Analytics Library and Extensibility Framework incorporates advanced functions along with an API to add your own functions to help address complex business problems right in the core database enabling customers to focus upon their specific data complexities.

Microsoft SQL Server 2012  (www.microsoft.com)

Released this year, Microsoft has now embraced the columnar database idea. The latest SQL Server release 2012 includes xVelocity, a column-store index feature that stores data similar to a column-oriented DBMS.  While not a true column oriented database, this technique allows for the creation of a memory optimized index that groups and stores data for each column then and joins them together to complete the index.  For certain types of queries, like aggregations, the query processor can take advantage of the column-store index to significantly improve execution times.  Column store indexes can be used with partitioned tables providing a new way to think about how to design and process large datasets.

The column-store index can be very useful on large fact tables in a Star schema improving overall performance, however the cost model approach utilized may choose the column-store index for a table when a row based index would have been better.  Using the IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX query hint will work around this if it occurs.  When data is stored with a column-store index, data can often be compressed more effectively over a row based index.  This is accomplished as typically there is more redundancy within a column than within a row.  Higher compression means less IO is required to retrieve data into memory which can significantly reduce response times.

There are several restrictions and limitation in using a column-store index.  For example, which data types are supported or not and that you can only create one column-store index on any table can be problematic.  Become familiar with what it can do and where best to use it.  Currently the column-store index is not supported on Microsoft Azure.

Column-oriented databases provide significant advantages over traditional row oriented system applied correctly; In particular for data warehouse and business intelligence environments where aggregations prevail.  It would not be fair however to ignore the disadvantages.  Let’s look at these two:

  • Column-Oriented Advantages
    • Efficient storage and data compression
    • Fast data loads
    • Fast aggregation queries
    • Simplified administration & configuration
  • ‘Column-Oriented Disadvantages
    • Transactions are to be avoided or just not supported
    • Queries with table joins can reduce high performance
    • Record updates and deletes reduce storage efficiency
    • Effective partitioning/indexing schemes can be difficult to design

The real value in using column-oriented database technology comes from high performance, scalable storage and retrieval of large to massive datasets (Big Data) focused on aggregation queries.  Simply put: Reports!  You can design Star schema’s or Data Vaults (The Data Vault – What is it? – Why do we need it?) incorporating these technologies and you will find that column-oriented databases provide a clear solution in data warehouse and business intelligence.

Look for future blogs on Hadoop/Hive/HBase and Extract-Transform-Load (ETL) technologies, and don’t be afraid to comment, question, or debate, there is always room to learn new things…

The Data Vault – What is it? – Why do we need it?

Posted by | On June 25th, 2012 | In Big Data -->
 

The Data Vault – What is it? – Why do we need it? (click to download)

As mentioned in my last blog (Big Data & NoSQL Technologies) understanding and using the right tools for the job is essential.  Database systems and applications are no exception. For Big Data or Small Data, knowing and using the right data modeling and methodology is also critical. Today, business uses the Internet to make money yet harvesting the data in an efficient way often develops into a considerable IT challenge. The ongoing debate behind the many issues involved are not the focus of this blog however, the concepts presented here are.

Data Warehouse and Business Intelligence (DW/BI) Systems have become a solution of choice for industries to address the many critical business questions ingrained in this vital data. Providing an essential data integration process supporting a variety of reporting requirements, DW/BI Systems often involve significant construction effort, on-going maintenance, and reengineering effort when data structures and business rules change. And they do change; presenting us with problem #1.

Building effective and efficient DW/BI Systems can quickly become a daunting and difficult technical ordeal even for veteran engineering teams. Several integrated technologies are required from database systems, data transformation (ETL) tools, various programming languages, administration and reporting software to high performance networks and computers having very large storage capacities. In many cases even the sheer volume of data can be overwhelming; Yup, problem #2.

Once reporting is delivered to diverse user communities that consume the data simplified usability is expected yet technological realities today often require highly trained end-users. The design, creation, delivery, and support of a robust, effortless DW/BI System for intelligent use, sadly, are complex. You guessed it; problem #3.

Furthermore, as upstream systems change, and as DW/BI technology plows ahead, and as the dynamic complexities involved prevail, we also find that every so often new data sources need to be added to the mix. These are usually unpredicted and unplanned for. The integration impact can be enormous of requiring complete regeneration of the aggregated data; hence, problem #4.

So how do we solve these problems?  Well …

Bill Inmon widely regarded as the father of data warehousing, defines a data warehouse as:

a subject oriented, nonvolatile, time-variant collection of data in support of management’s decisions (http://en.wikipedia.org/wiki/Bill_Inmon).

Ralph Kimball, a pioneering data warehousing architect, developed the “dimensional modeling” methodology now regarded as the de-facto standard in the area of decision support. The Dimensional Model (called a “star schema”) is different from Inman’s “normalized modeling” (sometimes called a “snowflake schema”) methodology.  In Kimball’s Star Schema, transactional data is partitioned into aggregated “facts” with referential “dimensions” surrounding and providing descriptors that define the facts.  The Normalized Model (3NF or “third normal form”) stores data in related “tables” following relational database design rules established by E. F. Codd and Raymond F. Boyce in the early 1970’s that eliminate data redundancy.  Fostering vigorous debate amongst DW/BI Architects as to which methodology is best, both have weakness when dealing with inevitable changes in the systems feeding the data warehouse and in cleansing data to conform to strict methodology requirements

Further, the OLAP cube (for “online analytical processing”) is a data structure that allows fast analysis of data from multiple perspectives.  The cube structure is created from either a Star or Snowflake Schema stored as metadata from which one can view or “pivot” the data in various ways.  Generally cubes have one time based dimension that supports a historical representation of data.  Creating OLAP cubes can be very expensive and often create a significant amount of data that is of little or no use.  The 80/20 rule appears in many cases to hold true (where only 20% of the OLAP cube data proves useful) which begs the question: Built upon a traditional architecture does an OLAP cube truly deliver sufficient ROI?  Often, the answer is a resounding, NO!  Durable DW/BI systems must deliver real value.

** A Different Approach **

Data Vault Modeling is a hybrid methodology providing historical data storage from multiple sources designed to be resilient to environmental changes.  Dan Linstedt, the creator of this methodology, describes the resulting database as:

A detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business.  It is a hybrid approach encompassing the best of breed between 3NF and Star Schemas.  The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.  (http://en.wikipedia.org/wiki/Data_Vault_Modeling)

One Data Vault philosophy is that all data is relevant, even if it is wrong. Data being wrong is a business problem, not a technical one.  I agree!  This simple premise of the Data Vault is to ingest 100% of the source data 100% of the time, good, bad, or ugly. Relevant in today’s world, auditability and traceability of all the data in the data warehouse thus become a standard requirement. This data model is architected specifically to meet the needs of today’s enterprise DW/BI Systems:

To understand the Data Vault is to understand the business

Focused on the business process, the Data Vault as a data integration architecture, has robust standards and definitional methods which unite information in order to make sense if it. The Data Vault model is comprised of three basic table types:

HUB       – containing a list of unique business keys having its own surrogate key

LNK        - establishes relationships between business keys (typically hubs, but links can link to other links)

SAT         - holding descriptive attributes that can change over time (similar to a Kimball Type II slowly changing dimension)

 

There are several advantages to the Data Vault approach:

    • Simplifies the data ingestion process
    • Removes the cleansing requirement of a Star Schema
    • Instantly provides auditability for HIPPA and other regulations
    • Puts the focus on the real problem instead of programming around it
    • Easily allows for the addition of new data sources without disruption to existing schema

Take a look at this short video showing the power of the Data Vault model. Visit www.danlinstedt.com and http://learndatavault.com for much more on Data Vault modeling and methodology.

In conclusion, the Data Vault modeling and methodology addresses most elements of the problems we identified above:

    • It adapts to a changing business environment
    • It simplifies the DW/BI design complexities
    • It increases usability by business users because it is modeled after the business domain
    • It allows for new data sources to be added without impacting the existing design

Very Cool!

As the architect of TrinityIDC, a DV/BI cloud based solution, I can say that the more I use the Data Vault modeling and methodology, the more I love to use it!  Ask us at DB Best to help you design yours!

Big Data and NoSQL Technologies

Posted by | On June 15th, 2012 | In Big Data -->
 

Big Data & NoSQL Technologies (click to download)

As a follow-up to my last blog (NoSQL .vs. Row .vs. Column) let’s take a closer look at Big Data and the emerging ‘NoSQL’ technologies in today’s marketplace.  As a refresher, two points to reiterate first:

      • NoSQL means ‘Not Only SQL’, as opposed to ‘Not SQL’, as many perceive
      • There are three main variations of NoSQL out there:
        • Key Value
        • Document Store
        • Column Store

It is these variants that we can now examine in more detail, plus have a first look at some of the vendors playing in this field.  As organizations increasingly capture large amounts of both structured and non-structured data, the NoSQL phenomena has come at a perfect time.  Database architects today are wise to consider several factors in choosing the right tool for the job as Big Data is found everywhere:

      • Daily Weather
      • Energy geophysical
      • Pharmaceutical drug  testing
      • Telecom traffic
      • Social media messaging
      • On-line gamming
      • Stock Market
      • Court documents
      • Email/text messaging

So what is Big Data?  This new industry ‘catch phrase’ currently has about as many definitions as there are people talking about it.  Perhaps the best definition may be:

Massive datasets that are organized, manipulated, and managed by tools, processes, procedures, and storage facilities

Realistically today’s Big Data will be tomorrow’s little data, as it is growing at an increasing pace.  Big Data offers a competitive advantage presenting a formidable opportunity, yet it also presents a daunting challenge to IT experts as a transformative technology.  Making Big Data complexity even more complex is the many commonly used document/data storage formats.  Read about them in another blog (Episode 1- Introduction and definitions) posted by my associate, Julius Gabby.

Big Data datasets are largely greater than 100GB yet are expected to easily reach the terabyte to petabyte range and beyond (exabytes & zettabytes).  So when a traditional RDBMS just isn’t the answer, NoSQL may in fact be.  In processing big data one should consider what each NoSQL technology provides and then choose the right vendor.  Let’s take a look:

  • Key Value

This variant is best suited for fast transactions where append/remove operations are essential, like an online shopping cart.  Usually a persistent, in-memory data store, web applications that need heavy I/O operations are good candidates for Key Value vendors like:

An open source, advanced key-value store, often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.  Written in C/C++ this product is blazingly fast which makes it great for real-time data collection.

A powerful open-source, distributed database, that scales capacity predictably and simplifies development through features that quickly prototype, test, and deploy applications.  A ‘Master-less’ clustering (no node is special), no-sharding, most boring database you’ll ever run in production (their words, not mine).  Written in Erlang and C this product provides transparent fault-tolerant/fail-over capability and a robust and flexible API which makes it great for point-of-sale and factory control systems. 

A scalable in-memory database providing full ACID transactional consistency and ultra-high throughput self-referred to as the NewSQL (setting itself apart from other NoSQL vendors).  Using Java stored procedures this product relies upon partitioning and replication to achieve high-availability data snapshots and durable command logging (for crash recovery) which makes it great for capital markets, digital networks, network services, and online games.

  • Document Store

This variant is best suited for highly unstructured data where more functionality is needed.  Data is restructured into document objects using named-value pairs thus supporting more detailed information as yet undefined.  Usually data is grouped into collections with simple query mechanisms, so web applications that need performance and have frequently changing data structures are good candidates for Document Store vendors like:

From “humongous” this scalable, high-performance, open-source NoSQL database features document-oriented storage (JSON-like), full index support, replication, and fast in-place updates.  Written in C/C++ this product is best for dynamic queries, dynamic data structures, and if you prefer indexes over Map/Reduce.

An open-source database that focuses on ease of use storing data in a collection of JSON documents each maintaining its own schema definitions.  ACID semantics implement eventual consistency which avoids locking database files during writes.  Written in Java, this product is ideal for web based applications that handle huge amounts of loosely structured data.

  • Column Store

This variant is focused upon massive amounts of unstructured data across distributed systems (think Facebook and Google), trading ACID semantics for advantages in performance, availability, and operational manageability.  Also called ‘Extensible Record Store’, applications that write more than read data are good candidates for Column Store vendors like:

An open-source distributed database management system designed to handle very large amounts of data spread across many servers while providing a highly available service with no single point of failure.  Written in Java, with linear scalability and proven fault-tolerance coupled with column indexes, this product is best for no-transactional real-time data analysis.

HBase is the Hadoop database which is a distributed, scalable, Big Data Store modeled after Google’s BigTable technology.  Running on top of HDFS (Hadoop Distributed File System) accessed with Map/Reduce, compression, in-memory operation, and space-efficient probabilistic data structures are some key features.  Written in Java, this product is great when you need random, real-time read/write access to your Big Data.

Remember that NoSQL is not competitive to traditional RDBMS technologies (either row or column based), it is complementary.  And as such have both strengths and weaknesses.  Let’s look at these too:

  • NoSQL Strengths
    • The clear winner when you need the ability to store and look up Big Data
    • Is Application focused
    • Supports HUGE data capacity
    • Fast Data Ingestion (loads)
    • Fast Lookup Speeds (across clusters)
    • Streaming Data
  • NoSQL Weakness
    • Conceivably an expensive infrastructure
    • Is very complex
    • Engineering talent still hard to find
    • Generally there is no SQL interface
    • Limited programmatic interfaces
    • Inadequate for Analytic Queries (aggregations, metrics, BI)

There you have it; in a nutshell, right??  Well, I for one believe there will be tremendous growth in this marketplace for several years to come, misinformation will likely proliferate, and many Big Data projects may unknowingly risk failure depending upon how well informed the architects are and how well business expectations are set.

Perhaps having the right tool is agreeably the right thing, but having the right data modeling and methodology is also critical.  Think Data Vault; more on that in a future blog!

Other future blogs will include a deep dive into Hadoop/Hive/HBase and Column based databases.  So come back for more…

NoSQL .vs. Row .vs. Column

Posted by | On May 30th, 2012 | In Big Data -->
 

The hype and disinformation that grudgingly prevails in the data warehouse world today brings me to raise the debate to a rational level. Let’s set aside 3NF and STAR schemas for a moment and the many flavors of analytics along with all their technologies. Let’s temporarily ignore e-commerce, database migrations, business intelligence, and data collection and processing systems. Instead let’s look at three different data storage methodologies. These are:

  • NoSQL - very new, lots of hype, and which really means ‘NOT ONLY SQL’
  • ROW - your traditional record database, well known and loved
  • COLUMN - still relatively new, widely misunderstood, yet still feels like normal SQL

To look at these three together I think we must first look at them separately.  So here goes…

The ROW based database storage methodology is one most of us are already familiar with.  Depending upon your vendor of choice (like Oracle, Microsoft, MySQL, DB2, etc…) DDL and DML syntax creates tables that stores and retrieves records. Largely based upon some form of key, be it natural or surrogate (let’s debate the many issues of schema design another time). The relational data model thrives upon the ROW based database and is widely used for many OLTP and OLAP system and/or applications.  Highly efficient in complex schema designs and SQL queries, ROW based database engines offer a tried and true way to build solid solutions. We should not throw this away, I won’t!

The COLUMN based database storage methodology has been around for a while as an alternative to ROW based databases from various new vendors (like InfoBright, Vertica, Sybase IQ, etc…). Generally the DDL and DML syntax is similar to ROW based databases, yet under the hood things are usually radically different, and much more efficient for processing aggregations. This is the main thing that sets it apart from ROW based engines.  Some of these column based technologies also provide high data storage compression which allows for a much smaller disk footprint.  In some cases as much as 10/1 over their row based counterpart.  We should adopt this where appropriate, I am!

The NoSQL based storage methodology (notice I don’t call it a database) is the new kid on the block which many vendors vying for your attention (like Hadoop, Cassandra, MongoDB, etc…). Many people view the NoSQL technology as the replacement to ROW or COLUMN based databases, but let me say right off, this is the wrong way to think of NoSQL.  Instead, as a highly optimized, highly scalable, high performance Distributed File System.  Yet the NoSQL storage capabilities offer striking features simply not practical with ROW or COLUMN databases.

Let’s however be very clear about what NoSQL is.

While there are three main variants (which I will cover shortly), NoSQL technologies address narrow yet important business needs.  Most NoSQL vendors support structured, semi-structured, or non-structured data which can be very useful indeed. The real value, I believe, comes in the fact that NoSQL can ingest HUGE amounts of data, very fast.  Forget Gigabytes, and even Terabytes, we are talking Petabytes!  Gobs and gobs of data!  With clustering support and multi-threaded inner-workings, scaling to the future expected explosion of data will seem a no-brainer with a NoSQL environment in play.  Let’s get excited, but temper it with the understanding that NoSQL is COMPLIMENTARY and not COMPETITIVE to ROW and COLUMN based databases. And also note that NoSQL is NOT A DATABASE but a high performance distributed file system and really great at dealing with lots and lots of data; did I say BIG DATA!

I mentioned that there are three main variations of NoSQL.  These include:

  • Key Value  – which support fast transaction inserts (like an internet shopping cart); Generally stores data in memory and great for web. applications that need considerable in/out operations
  • Document Store - which stores highly unstructured data as named value pairs; great for web traffic analysis, detailed information, and applications that look at user behavior, actions, and logs in real time.
  • Column Store  – which is focused upon massive amounts of unstructured data across distributed systems (think Facebook & Google); great for shallow but wide based data relationships yet fails miserably at ad-hoc queries

So to bring these three very different database storage technologies into a conjoined perspective, I think it behooves us all to consider that essentially we need all three.  Regardless of what type of system being built, I’ve always subscribed to the notion that one should use the right tool for the job.  You just gotta know what those are!

In summary let me say this:  I believe, generally speaking, that each of these three data storage technologies offer specific features and therefore should be used in specific ways.

  1. ROW based databases should prevail when you want a complex, but not too-huge data set that requires efficient storage and retrieval for OLTP and even some OLTP usage;
  2. COLUMN based database are clearly aimed at analytics; optimized for aggregations coupled with huge data compression and should be adopted for most business intelligence usage;
  3. NoSQL based data solutions step in when you need to ingest BIG DATA,  fast, Fast, FAST… and when you only really need to make simple correlations across the data quickly;

Well, there you have it…  most of it, in my humble opinion anyway!