Friday, December 6, 2013

SSIS Package to extract data from Hortonworks to SQL Server

I am in the midst of comparing a few different architectural scenarios, one if which led me to test out the functionality of extracting data from Hortonworks (running on a local VirutalBox) using an SSIS Package and the Hive ODBC Driver 1.2. 

 To get you up to speed on high level architecture options I am considering, here is an overview:


The goal is to be able to ingest nearly any data source, join it with internal metadata, aggregate and expose it to our users via our application (web based) and/or export slices back into any format customers may require. To determine the best ETL solution for our needs, we are comparing talend and SQL Server SSIS. This was after some thorough research into other solutions as well, but for our particular needs, these two options seemed viable.

I like the ability of the talend Data Integration tool that comes with connectors to over 400 different data types, but to use it in an enterprise setting, with shared source control, we'd need to license it correctly which becomes costly as it's on a per-user subscription basis. The other alternative is to utilize tools that we are already paying for with our SQL Server licensing, Integration Services which I've been using since SQL Server 2005 (and before 2005 it was DTS).

If you haven't already played with Hortonworks (which can be done on Windows if you need, easiest way is to download their Sandbox on VirtualBox), I thoroughly encourage you to do so. Their tutorials are extremely easy to follow. Technical disclosure: all of these pieces I'm testing initially are running on my laptop on Windows 7 Pro (64-bit) with 32GB RAM. I set the VirtualBox to use 8GB RAM. The goal would be to initially test the functionality and then create a working prototype to benchmark and fully test to make a final decision.

One of the tutorials resulted in installing the Hortonworks ODBC 1.2 Driver (I installed the 32-bit for this test) to pull data into Excel. I then uploaded a test file into Hortonworks, HCatalog which consisted of a 16 column, approx. 48K row .csv file for my initial dataset.

I configured SSIS with the ODBC connection previously created, and an OLE DB Connection to a local SQL Server 2012 database installation. Because I installed the 32-bit ODBC driver, I needed to update the debug configuration and set the Run64BitRuntime to False:


SSIS Data Flow task successfully ran, fairly quickly considering all the objects are locally on the same machine which is far from an optimal solution:



Next I'll be attempting to do a similar test in talend. I'll post the results of that one soon. 

No comments: