Tuesday, February 14, 2017

RDBMS to Big Data Hadoop Via Cloudera




Consumer Electronics Database

A manufacture of consumer electronics (local - so safe from Donald) was concerned about his supply chain. Why? As with most supply chain systems, there are multiple factors to risk. Geo-political risk is usually one of the factors - to track risk to supply chain from other regions (riots, couplet,Coup d'état, etc).   But now with the new White House administration, who would have thought that a "far away" geo-political risk factor originates from us - Donald potential blocking heavily taxing parts imports. The company needed to track all parts, its suppliers, and the region of the suppliers.


Graduate To  RDBMS From Excel

They had a list of their suppliers stored in an Excel spread sheet - first on a local disk drive, then later "upgraded" to cloud (Google & Microsoft 365). But soon the need to write extensive queries outweighed what pivot tables, sorts, filters, and macros can do.  The decision was made to implement the XLS into a RDBMS.  The first shot was done using Oracle MySQL  Workbench CE running on Windows 10. The data cleaning and ingestion was done via Python (another topic). Once ingested into RDBMS, standard MySQL queries can be used. Here is an example:


An supplier database on a RDBMS running on W10


Big Data = Volume. Velocity. Variety.

The CEO was happy - he finally graduated from a spreadsheet to  RDBMS.  But he also wanted to deploy it as "Big Data". The list of supplier (and parts) will grow as we start to migrate other product lines into the RDBMS. He also wanted to use all the goodies of Big Data - deployment in the cloud, advanced analytics, different data types (like pictures). Volume. Velocity. Variety.


Cloudera - First Dip Into "Big Data" Apache Hadoop

So using the same trusty W10 machine, I decided to prototype a Big Data for the supplier RDBMS. Cloudera offers a VM (in multiple flavors) and Container as a prototyping vehicle. So after installing Oracle VirtualBox  (I have better luck with it over VMWare), my own Cloudera was running.

Cloudera VM (Guest OS'd on CentOS), Running on W10 VirtualBox



Cloudera Allows Steps Of Migration 

The neat thing about the Cloudera setup is that RDBMS is already setup. So you can check out your RDBMS in the Cloudera VM before you migrate to Hadoop.

MySQL In The Cloudera VM

Apache Sqoop : RDBMS -> Hadoop

Once I had confidence that the RDBMS setup was good in the Cloudera VM, I started the unknown path of converting it to Hadoop. Apache Sqoop supports this endeavor - but Cloudera made it super easy.

A little CShell Script To Convert RDBMS Into Hadoop


Once the script is launched, the map reduce takes over for hours. You can look at the progress using a web browser.

Using A Web Browser To Check On Apache Sqoop


In Big Data Land!

After the process, we can now use Cloudera Hue (its version of Apache Hive) to reuse many of the MySQL queries.

The Supplier Database, Originally In RDBMS, Is Now In "Big Data"!


Conclusion:

The steps from RDBMS to Hadoop is manageable if taken in baby steps. Cloudera's environment makes that easy to do. For my next task, I will create clusters.


No comments :

Post a Comment