Thursday, December 19, 2013

Unofficial benchmark results comparing SSIS to Talend

As a follow up to my previous post, here is the output of my various tests comparing the extraction only processing rates between SSIS and talend.

Flat to Flat
I created a flat file (.csv) with 10M rows with the following fields (sizes to indicate approx size of expected data)
Id - Bigint
OtherId - Int
SomeOtherNumber - Int
RandomName - VarChar(250)
AnotherNumber - Int
BitField - true or false
EmailAddress - VarChar(250)

Three runs to verify consistency were done, results are below:

SSIS: Average duration 44.5 seconds
Talend: Average duration 1 minute 9 seconds

Large Flat File to SQL Server
Same flat file used above, but loaded into SQL Server 2012 located on my laptop. So Flat File -> SQL Server 2012, with no transformations just straight data columns to data fields. Again, three runs to verify consistency, results are as follows:

SSIS: Average duration 59.6 seconds
Talend: Average duration 6 minutes 56 seconds
(nope, that's not a typo, that's nearly 7 minutes)

Many Files to SQL Server
Next I attempted ingesting from a folder containing 3 subfolders containing 79 files, which contained a total of 2,187,842 records. Three runs, results are below:

SSIS: Average duration 22.8 seconds
Talend: Average duration 42.2 seconds

This is a very unofficial benchmark testing in trying to determine if we should start focusing on Talend as an ETL tool over SSIS. As Talend boasts over 450+ connectors out of the box and is platform independent, it was worth a look. I did not expect for SSIS to outperform in all scenarios. I expected better integration and speed with MSFT products (Flat File to SQL Server), which is what SSIS was designed for and which performance meets my expectations. 

This is not an indication that Talend is or may be a poor solution, but in our environment, SSIS is the clear winner so far.
To be completely fair, and if I'd had time, I would have installed Talend on linux and run similar tests, but we are a mostly MSFT shop so I went with what we have access to, locally.

Hardware/Software configurations:
Dell Precision M4700, 32GB Ram, Win7 64-bit
[Flat file storage, Talend and SSIS installed on laptop]
Virtual Server with 32GB Ram, Server 2012 64-bit
[SQL Server 2012]
Talend Open Studio - Data Integration v 5.4.1
I'm being pretty vague, as the title states: unofficial benchmark.

I attempted to ONLY have minimal processes running on both, etc. I also ran tests without connecting to SQL Server, i.e., local flat file to flat file ETL, small files, large files, etc., trying to vary the scenarios we may have at work. 

Friday, December 6, 2013

SSIS Package to extract data from Hortonworks to SQL Server

I am in the midst of comparing a few different architectural scenarios, one if which led me to test out the functionality of extracting data from Hortonworks (running on a local VirutalBox) using an SSIS Package and the Hive ODBC Driver 1.2. 

 To get you up to speed on high level architecture options I am considering, here is an overview:

The goal is to be able to ingest nearly any data source, join it with internal metadata, aggregate and expose it to our users via our application (web based) and/or export slices back into any format customers may require. To determine the best ETL solution for our needs, we are comparing talend and SQL Server SSIS. This was after some thorough research into other solutions as well, but for our particular needs, these two options seemed viable.

I like the ability of the talend Data Integration tool that comes with connectors to over 400 different data types, but to use it in an enterprise setting, with shared source control, we'd need to license it correctly which becomes costly as it's on a per-user subscription basis. The other alternative is to utilize tools that we are already paying for with our SQL Server licensing, Integration Services which I've been using since SQL Server 2005 (and before 2005 it was DTS).

If you haven't already played with Hortonworks (which can be done on Windows if you need, easiest way is to download their Sandbox on VirtualBox), I thoroughly encourage you to do so. Their tutorials are extremely easy to follow. Technical disclosure: all of these pieces I'm testing initially are running on my laptop on Windows 7 Pro (64-bit) with 32GB RAM. I set the VirtualBox to use 8GB RAM. The goal would be to initially test the functionality and then create a working prototype to benchmark and fully test to make a final decision.

One of the tutorials resulted in installing the Hortonworks ODBC 1.2 Driver (I installed the 32-bit for this test) to pull data into Excel. I then uploaded a test file into Hortonworks, HCatalog which consisted of a 16 column, approx. 48K row .csv file for my initial dataset.

I configured SSIS with the ODBC connection previously created, and an OLE DB Connection to a local SQL Server 2012 database installation. Because I installed the 32-bit ODBC driver, I needed to update the debug configuration and set the Run64BitRuntime to False:

SSIS Data Flow task successfully ran, fairly quickly considering all the objects are locally on the same machine which is far from an optimal solution:

Next I'll be attempting to do a similar test in talend. I'll post the results of that one soon. 

Sunday, July 7, 2013

Columnstore Indexes: the best thing since cake (and I love cake!)

Most times with software upgrades comes a few "oh, that's great I think I may be able to utilize that" going through my mind. When I read and dove deeper into SQL Server 2012's new feature, Columnstore Indexes, my heart actually started to race. It takes a true love of data warehouses to feel this way, but it's as if SQL Server answered some of my questions and frustrations over the past few years. These frustrations grew around dealing with TBs worth of data and how normal row indexes (normal custered and non-clustered indexes are actual row indexes) just were not optimized for VLDB optimized querying. 

From msdn online, here is a summary of the difference: 

"An xVelocity memory optimized columnstore index, groups and stores data for each column and then joins all the columns to complete the whole index. This differs from traditional indexes which group and store data for each row and then join all the rows to complete the whole index."

The biggest difference is that data is grouped and stored one column at a time. The benefits are: because only the columns needed are read, this results in less disk reads, better compression, improved buffer pool usage (reducing I/O), and utilizes batch processing which reduces CPU.

Wah? Yes! Wah? I'll say it again, YES! And yes, it works with table partitions. Of course, there are logical limitations to it, like the most important being the table must be Read Only. But, for normal data warehouse architecture, data is not transactional and is loaded at certain times, most normally once or maybe a few times a day. So dropping and recreating indexes are something most data warehouse engineers are familiar with, in detail. Or you can always implement partitions and then switch, as well as other options to get around the read only requirement. There are other limitations, which I encourage you to explore, as well as some gotchas.

A fantastic, and thorough walk through of this new, exciting, feature in SQL Server 2012 can be found here: What's new in SQL Server 2012: Using the new Columnstore Index, it's a YouTube video by Kevin S. Goff.

My hat's off to the SQL Server development team, warms my heart that you addressed the growing needs of us data warehouse minions.

Friday, June 21, 2013

SharePoint 2013 - allow others to edit a discussion (not just reply)

One of many SharePoint tips...

Some sites may want more collaboration, the ability to edit discussion topics, etc. I would highly recommend turning on version control for the list you modify in order to see who changed what. Here are the steps:

You need to be in the Site Owner group to make these changes. If you don't have the appropriate permissions, contact your SharePoint administrator to  assist.

Go to Site Contents, click on the discussion list name and in the ribbon click on List, then List Settings.

Once in List Settings, click on Versioning Settings and make sure Create a version each time you edit is set to Yes. I typically do not require content approval for submitted items, but this is an option. It would not allow anything to show until one member of the Approval Members group approves the content. 

Once you click "OK" on the Versioning Settings screen, go into Advanced Settings.

Under the Create and Edit access, select the "Create and edit all items" instead of the "...created by the user" like so, then be sure to click the OK button.

Next, we're going to alter the Site Members contribute rights. Click on the Gear in the upper right hand corner and select Site Settings, then under Users and Permissions, select Site permissions. Check the box next to the [Site Name] Members name, and the "Edit User Permissions" option on the ribbon will become available.

Click the Edit User Permissions, the default rights should be set to only Contribute. Check the boxes next to Edit and Approve like so:

Then click the "OK" button. Now have a site member test the ability to "Edit" a discussion but finding a discussion topic, clicking the ellipses and seeing the "Edit" ability like so:

Hope this helps. 

Tuesday, June 18, 2013

SharePoint 2013 - How to make the discussion edit window wider without altering the Master Pages.

One of many SharePoint tips...

When creating or editing a discussion, the text field for the body of the discussion is way too narrow. Can it be changed?

There are a few ways, the easiest is to implement JQuery in a web part on the page, that way no-one is mucking with the templates in SharePoint - which can be overwritten in patches, etc.

This following way would need to be done at each site or sub-site as the editform.aspx is unique (copied from the site collection level once a site is created). Which means implementing this at one site will not affect this form on any other site.

Note: You need design rights or site ownership rights in order to perform this modification.

Go to the site you would like to implement this on.
Go to a discussion, click the Ellipses (...) and then the Edit choice and you should see a layout similar to this where the "Body" textarea is fairly narrow.

You'll want to edit the page this form displays in, which is why I've highlighted the gear in the upper right hand corner, click on it and then select "Edit Page":

Now you will want to click the "Add a Web Part", we're going to be adding a Content Editor

After you click "Add a Web Part", chose Media and Content, then Content Editor and click "Add":

Now you will see your page again with your newly added Content Editor web part. Click the "Click here to add new content":

It will then place your cursor in the Content Editor, but what we want to do is edit the HTML, so select the Edit Source button as shown here:

In the HTML Source editing window, type the following code:

Your window should now look like this:

Click the "OK" button at the bottom of the HTML Source window. Then you are back on the edit page, click the "Page" on the ribbon and select "Stop Editing" as shown here:

It will redirect you to the discussion list, so go back into a discussion, click the Ellipses and Edit, and now your text areas should be wider like this:

Happy Customizing.