Thursday, December 19, 2013

Unofficial benchmark results comparing SSIS to Talend

As a follow up to my previous post, here is the output of my various tests comparing the extraction only processing rates between SSIS and talend.

Flat to Flat
I created a flat file (.csv) with 10M rows with the following fields (sizes to indicate approx size of expected data)
Id - Bigint
OtherId - Int
SomeOtherNumber - Int
RandomName - VarChar(250)
AnotherNumber - Int
BitField - true or false
EmailAddress - VarChar(250)

Three runs to verify consistency were done, results are below:

SSIS: Average duration 44.5 seconds
Talend: Average duration 1 minute 9 seconds

Large Flat File to SQL Server
Same flat file used above, but loaded into SQL Server 2012 located on my laptop. So Flat File -> SQL Server 2012, with no transformations just straight data columns to data fields. Again, three runs to verify consistency, results are as follows:

SSIS: Average duration 59.6 seconds
Talend: Average duration 6 minutes 56 seconds
(nope, that's not a typo, that's nearly 7 minutes)

Many Files to SQL Server
Next I attempted ingesting from a folder containing 3 subfolders containing 79 files, which contained a total of 2,187,842 records. Three runs, results are below:

SSIS: Average duration 22.8 seconds
Talend: Average duration 42.2 seconds

This is a very unofficial benchmark testing in trying to determine if we should start focusing on Talend as an ETL tool over SSIS. As Talend boasts over 450+ connectors out of the box and is platform independent, it was worth a look. I did not expect for SSIS to outperform in all scenarios. I expected better integration and speed with MSFT products (Flat File to SQL Server), which is what SSIS was designed for and which performance meets my expectations. 

This is not an indication that Talend is or may be a poor solution, but in our environment, SSIS is the clear winner so far.
To be completely fair, and if I'd had time, I would have installed Talend on linux and run similar tests, but we are a mostly MSFT shop so I went with what we have access to, locally.

Hardware/Software configurations:
Dell Precision M4700, 32GB Ram, Win7 64-bit
[Flat file storage, Talend and SSIS installed on laptop]
Virtual Server with 32GB Ram, Server 2012 64-bit
[SQL Server 2012]
Talend Open Studio - Data Integration v 5.4.1
I'm being pretty vague, as the title states: unofficial benchmark.

I attempted to ONLY have minimal processes running on both, etc. I also ran tests without connecting to SQL Server, i.e., local flat file to flat file ETL, small files, large files, etc., trying to vary the scenarios we may have at work. 

Friday, December 6, 2013

SSIS Package to extract data from Hortonworks to SQL Server

I am in the midst of comparing a few different architectural scenarios, one if which led me to test out the functionality of extracting data from Hortonworks (running on a local VirutalBox) using an SSIS Package and the Hive ODBC Driver 1.2. 

 To get you up to speed on high level architecture options I am considering, here is an overview:

The goal is to be able to ingest nearly any data source, join it with internal metadata, aggregate and expose it to our users via our application (web based) and/or export slices back into any format customers may require. To determine the best ETL solution for our needs, we are comparing talend and SQL Server SSIS. This was after some thorough research into other solutions as well, but for our particular needs, these two options seemed viable.

I like the ability of the talend Data Integration tool that comes with connectors to over 400 different data types, but to use it in an enterprise setting, with shared source control, we'd need to license it correctly which becomes costly as it's on a per-user subscription basis. The other alternative is to utilize tools that we are already paying for with our SQL Server licensing, Integration Services which I've been using since SQL Server 2005 (and before 2005 it was DTS).

If you haven't already played with Hortonworks (which can be done on Windows if you need, easiest way is to download their Sandbox on VirtualBox), I thoroughly encourage you to do so. Their tutorials are extremely easy to follow. Technical disclosure: all of these pieces I'm testing initially are running on my laptop on Windows 7 Pro (64-bit) with 32GB RAM. I set the VirtualBox to use 8GB RAM. The goal would be to initially test the functionality and then create a working prototype to benchmark and fully test to make a final decision.

One of the tutorials resulted in installing the Hortonworks ODBC 1.2 Driver (I installed the 32-bit for this test) to pull data into Excel. I then uploaded a test file into Hortonworks, HCatalog which consisted of a 16 column, approx. 48K row .csv file for my initial dataset.

I configured SSIS with the ODBC connection previously created, and an OLE DB Connection to a local SQL Server 2012 database installation. Because I installed the 32-bit ODBC driver, I needed to update the debug configuration and set the Run64BitRuntime to False:

SSIS Data Flow task successfully ran, fairly quickly considering all the objects are locally on the same machine which is far from an optimal solution:

Next I'll be attempting to do a similar test in talend. I'll post the results of that one soon.