Comparing Simple Efficiencies: T-SQL UDF vs SQCLR UDF for Splitting Strings
Recap of T-SQL vs SQLCLR (pseudo) Debate
There have been numerous posts about benefits of using T-SQL vs SQLCLR, and vice versa. And we all know the mantra – T-SQL for data access, SQLCLR for anything that is computationally intense tasks.
SQL Server is a relational database, and works best with set based operations and direct data access. If you need to do straightforward INSERTs, UPDATEs, DELETEs, SELECTs, stick with T-SQL unless you want to do lots of overtime trying to figure out why you’re data access suddenly became slow.
So when do we use SQLCLR instead of T-SQL?
There are already lots of discussions – even books – about SQLCLR advantages. I will defer you to them (check out the references section), but I will provide a very brief list of scenarios when you might want to consider SQLCLR:
- Interaction outside SQL Server If you need to work with the OS, files, registry etc.
- Validation If you need to validate phone numbers, email addresses, postal codes, or any patterns
- Complex computations If you need running aggregates, complex math equations (what is the square root of x to the nth power divided by 2 * pi?), financial analytics maybe?
- Custom Data Types If you need to create your own custom business-specific data types. We also have to remember that XML and GEOGRAPHY/GEOMETRY are great additions to SQL Server, and these are technically CLR data types.
The debate between T-SQL vs SQLCLR sometimes can be taken out of context. I call it a pseudo debate, because sometimes it’s made to seem that SQLCLR is meant to replace T-SQL, when it’s not. SQLCLR is *not* evil. It’s just another tool to help you do your job. It *can* become evil though, if you misuse it.
While there are some overlaps in scenarios where you can use both, these two should really be complementary. Where T-SQL is slow or lacking, SQLCLR should at least be considered and tested. Again, consider the right tool for the right job.
In cases where some projects will clearly benefit from using SQLCLR, project teams still hesitate using it. Some of the common reasons I’ve heard that causes SQL Server DBAs and Developers to hesitate using SQLCLR are:
- It is unfamiliar ground. Let’s face it. Not every SQL Server professional will know how to program in .NET. It’s becoming increasingly popular, but we still have traditional database professionals
- It compromises security. Well, it’s yes and no. If you code it properly and if you deploy it properly, it should be as secure as your most secure database objects. However if you take the easy route, such as
Examining Simple Efficiencies
Let’s take a deep dive into a specific scenario, and let’s analyze simple performance differences between a T-SQL user defined function (UDF) vs a SQLCLR UDF. I emphasize simple differences for now, because sometimes, that’s all it takes to convince someone
This is one scenario where technically, either the T-SQL way or the SQLCLR way will work.
You have an incoming string that contains character delimited set of files. Let’s assume it’s a pipe delimited set of strings for employees, like this:
This is a legacy system, and you have no control over how the incoming data is formatted. The legacy system will also need to call a stored procedure, which in turn will parse the strings and output a table.
Of course you can argue there are other options to parse the file, such as pushing this functionality to an intermediate app etc, but for now let’s focus on two feasible options to parse the strings and output a table using SQL Server:
- Transact-SQL UDF
- SQLCLR UDF
Using T-SQL UDF – dbo.fn_Split
What we can do is create a Table Valued UDF that parses the incoming string based on a delimiter, and output a table.
IF OBJECT_ID('fn_Split') IS NOT NULL DROP FUNCTION dbo.fn_Split GO CREATE FUNCTION dbo.fn_Split ( @items NVARCHAR(max), @mainDelimiter CHAR(1) = '|' ) RETURNS @itemTable table ( [item] VARCHAR(4000) ) as BEGIN DECLARE @tempItemList NVARCHAR(max), @i int, @item NVARCHAR(max) SET @tempItemList = @items -- escape all single quotes SET @tempItemList = REPLACE(@tempItemList, ''' ''', '') -- get index where our delimiter was found SET @i = CHARINDEX(@mainDelimiter, @tempItemList) -- loop while all the characters in the list have not been traversed yet WHILE (LEN(@tempItemList) > 0) BEGIN IF @i = 0 -- if there are no delimiters, then this is the only item in our list SET @item = LTRIM(RTRIM(@tempItemList)) ELSE -- get the first word (from the left) less the delimiter character SET @item = LTRIM(RTRIM(LEFT(@tempItemList, @i - 1))) INSERT INTO @itemTable([type],[item]) VALUES('outer', LTRIM(RTRIM(@item))) IF @i = 0 SET @tempItemList = '' ELSE -- remove the word we just added to the table SET @tempItemList = LTRIM(RTRIM(RIGHT(@tempItemList, LEN(@tempItemList) - @i))) -- lather, rinse, repeat SET @i = CHARINDEX(@mainDelimiter, @tempItemList) END RETURN; END GO
Using SQLCLR UDF – dbo.fn_CLRSplit
For the SQLCLR UDF version, again there are a few variations. Normally we would use a String.Split method, or even Regex to split strings based on character delimiters (please see links below for additional samples)
In my case I will be borrowing Adam Machanic’s fast and scalable string splitting UDF. If you’re interested in his code, it’s here: Faster, More Scalable SQLCLR String Splitting.
In this article, Adam also discusses differences between different ways of parsing and splitting strings in the SQLCLR world – differences between the String.Split, Regex, and his way which traverses characters one by one. He does a great job explaining why and where one breaks down, and why his particular version is more scalable.
The tests for this experiment are really simple. Take a string that contains n items, then pass it to both the T-SQL and SQLCLR UDFs, measure CPU time and elapsed time, and compare. Ours will be a simple metric comparison – time – which is sometimes the most effective metric to use when battling to use SQLCLR with business users, sometimes even DBAs.
-- start with a string of 10 items DECLARE @str10 VARCHAR(MAX) = 'EMP1~Doe~John~F.~Supervisor|EMP2~Smith~Mary~~Manager|EMP3~Cargan~Jeff~~|EMP4~Ricks~Minnie~~Accountant|EMP8~Lowe~Bill~D.~Accountant|EMP10~Crowe~Laura~G.~Accountant|EMP5~Hall~Suzy~~Accountant|EMP5~Hall~Suzy~~Accountant|EMP5~Hall~Suzy~~Accountant|EMP5~Hall~Suzy~~Accountant|' -- this is how I am generating the massive strings to be split DECLARE @str100 VARCHAR(MAX) = REPLICATE(@str10, 10) DECLARE @str1000 VARCHAR(MAX) = REPLICATE(@str10, 100) DECLARE @str10000 VARCHAR(MAX) = REPLICATE(@str10, 1000) DECLARE @str100000 VARCHAR(MAX) = REPLICATE(@str10, 10000) SET STATISTICS TIME ON SET STATISTICS IO ON -- this I have repeated for each of the strings above -- lather, rinse, repeat SELECT * FROM dbo.fn_Split(@str100000, DEFAULT, DEFAULT) SELECT * FROM dbo.fn_CLRSplit(@str100000, DEFAULT, DEFAULT) SET STATISTICS TIME OFF SET STATISTICS IO OFF GO
They say pictures paint a thousand words, so I’ll let these graphs do the initial talking.
There is clearly a performance difference between the T-SQL way and the SQLCLR way. It’s not so noticeable on a small scale, and perhaps some of you would bear to have this rather than turning on SQLCLR. However, on a bigger scale, the difference is exponential. My T-SQL UDF seems to have noticeably slowed down at the 1,000 item string, and died at the 100,000 item string (or maybe I was just impatient and cancelled the query altogether).
Right now we’re just considering simple efficiencies – CPU and execution times. We haven’t even considered I/Os, memory clerks etc. And clearly, SQLCLR offers the more efficient way to complete the task at hand.