Processing JSON in MS SQL 2014 / 2012

One of the new additions to MS SQL 2016 is support for JSON data. It’s pretty much similar to what we have in SQL 2014 for XML processing – similair functions and processing features.

But what if we need to process JSON data but only SQL 2014 server is available? Well, for simple reading of properties we could just use some string manipulation – it will be fast and more or less readable:

SELECT *
FROM tb1
JOIN tb2 on tb2.bvin = 
    SUBSTRING(
        tb1.json
        ,CHARINDEX('"bvin":"', tb1.json) + LEN('"bvin":"')
        ,CHARINDEX('"', tb1.json, CHARINDEX('"bvin":"', tb1.json) + LEN('"bvin":"')) - CHARINDEX('"bvin":"', tb1.json) - LEN('"bvin":"')
    )

-- from: https://stackoverflow.com/questions/23723473/query-json-inside-sql-server-2012-column

However, when having to extract several records to create rows from them, and / or when structure of fields is complex usage of OPENJSON might be neccessary. This was kind of obstacle, until I’d figured that if I’m able to use OPENXML pretty much the same. If only I could convert that JSON to XML.

My first idea was to do it in API layer, hower it appeared, that there are some services that write data directly to the DB (using stored procedures). Then I tried to look for a solution to convert Json to XML in tSQL. Short googling and I’ve found this article:

It worked like a charm for simple, well formatted JSON, however during extensive testing with various JSONs I’d to my disposal I’d figured several problems:

  • Whitespaces placed after commas or brackets caused invalid parsing of the subobjects
  • Some special characters in property names (like $) or properties with numbers only as a name caused invalid XML – I had to remove these characters or add „i” before digits to make it proper XML. However, if you need to replace more bad characters in your JSON – you would have to update script even more.
  • In some property values we had embeded XML. Yeah… Or partos of XML. This caused havoc to the output XML – threfore I’d decided to just wrap such values with CDATA to properly esacep it and treat like a plain text.

Script after fixes looks like that:

CREATE FUNCTION [dbo].[fn_parse_json2xml](
    @json    nvarchar(max)
)
RETURNS xml
AS

BEGIN;
    DECLARE @output varchar(max), @key varchar(max), @value varchar(max),
        @recursion_counter int, @offset int, @nested bit, @array bit,
        @tab char(1)=CHAR(9), @cr char(1)=CHAR(13), @lf char(1)=CHAR(10),
		@tjson varchar(max), @hax INT;

    --- Clean up the JSON syntax by removing line breaks and tabs and
    --- trimming the results of leading and trailing spaces:
    SET @json=LTRIM(RTRIM(
        REPLACE(REPLACE(REPLACE(@json, @cr, ''), @lf, ''), @tab, '')));

    --- Sanity check: If this is not valid JSON syntax, exit here.
    IF (LEFT(@json, 1)!='{' OR RIGHT(@json, 1)!='}')
        RETURN '';

    --- Because the first and last characters will, by definition, be
    --- curly brackets, we can remove them here, and trim the result.
    SET @json=LTRIM(RTRIM(SUBSTRING(@json, 2, LEN(@json)-2)));

    SELECT @output='';
    WHILE (@json!='') BEGIN;

        --- Look for the first key which should start with a quote.
        IF (LEFT(@json, 1)!='"')
            RETURN 'Expected quote (start of key name). Found "'+
                LEFT(@json, 1)+'"';

        --- .. and end with the next quote (that isn't escaped with
        --- and backslash).
        SET @key=SUBSTRING(@json, 2,
            PATINDEX('%[^\\]"%', SUBSTRING(@json, 2, LEN(@json))+' "'));
		
        --- Truncate @json with the length of the key.
        SET @json=LTRIM(SUBSTRING(@json, LEN(@key)+3, LEN(@json)));

	-- fix for improperly named nodes with numbers only
        select @key = (case when try_convert(int, @key) is not null then 'i' + REPLACE(@key, '$', '') else REPLACE(@key, '$', '') end)

        --- The next character should be a colon.
        IF (LEFT(@json, 1)!=':')
            RETURN 'Expected ":" after key name, found "'+
                LEFT(@json, 1)+'"!';

        --- Truncate @json to skip past the colon:
        SET @json=LTRIM(SUBSTRING(@json, 2, LEN(@json)));

        --- If the next character is an angle bracket, this is an array.
        IF (LEFT(@json, 1)='[')
            SELECT @array=1, @json=LTRIM(SUBSTRING(@json, 2, LEN(@json)));

        IF (@array IS NULL) SET @array=0;
        WHILE (@array IS NOT NULL) BEGIN;

            SELECT @value=NULL, @nested=0;
            --- The first character of the remainder of @json indicates
            --- what type of value this is.

            --- Set @value, depending on what type of value we're looking at:
            ---
            --- 1. A new JSON object:
            ---    To be sent recursively back into the parser:
            IF (@value IS NULL AND LEFT(@json, 1)='{') BEGIN;
                SELECT @recursion_counter=1, @offset=1;
                WHILE (@recursion_counter!=0 AND @offset<LEN(@json)) BEGIN;
                    SET @offset=@offset+
                        PATINDEX('%[{}]%', SUBSTRING(@json, @offset+1,
                            LEN(@json)));
                    SET @recursion_counter=@recursion_counter+
                        (CASE SUBSTRING(@json, @offset, 1)
                            WHEN '{' THEN 1
                            WHEN '}' THEN -1 END);
                END;

                SET @value=CAST(
                    dbo.fn_parse_json2xml(LEFT(@json, @offset))
                        AS varchar(max));
                SET @json=SUBSTRING(@json, @offset+1, LEN(@json));
                SET @nested=1;
            END

            --- 2a. Blank text (quoted)
            IF (@value IS NULL AND LEFT(@json, 2)='""')
                SELECT @value='', @json=LTRIM(SUBSTRING(@json, 3,
                    LEN(@json)));

            --- 2b. Other text (quoted, but not blank)
            IF (@value IS NULL AND LEFT(@json, 1)='"') BEGIN;
                SET @value=SUBSTRING(@json, 2,
                    PATINDEX('%[^\\]"%',
                        SUBSTRING(@json, 2, LEN(@json))+' "'));
                SET @json=LTRIM(
                    SUBSTRING(@json, LEN(@value)+3, LEN(@json)));
            END;

          --- 3. Blank (not quoted)
            IF (@value IS NULL AND LEFT(@json, 1)=',')
                SET @value='';

            --- 4. Or unescaped numbers or text.
            IF (@value IS NULL) BEGIN;
                SET @value=LEFT(@json,
                    PATINDEX('%[,}]%', REPLACE(@json, ']', '}')+'}')-1);
                SET @json=SUBSTRING(@json, LEN(@value)+1, LEN(@json));
            END;

            -- workaround for invalid XML/HTML nested in JSON
            IF (@nested=0 AND @value LIKE '%<%')
				SET @value = '<![CDATA[' + @value + ']]>'

            --- Append @key and @value to @output:
            SET @output=@output+@lf+@cr+
                REPLICATE(@tab, @@NESTLEVEL-1)+
                '<'+@key+'>'+
                    ISNULL(REPLACE(
                        REPLACE(@value, '\"', '"'), '\\', '\'), '')+
                    (CASE WHEN @nested=1
                        THEN @lf+@cr+REPLICATE(@tab, @@NESTLEVEL-1)
                        ELSE ''
                    END)+
                '</'+@key+'>';


            --- And again, error checks:
	    SET @tjson = LTRIM(@json)			
	    ---
            --- 1. If these are multiple values, the next character
            ---    should be a comma:
            IF (@array=0 AND @tjson!='' AND LEFT(@tjson, 1)!=',')
                RETURN @output+'Expected "," after value, found "'+
                    LEFT(@json, 1)+'"!';

            --- 2. .. or, if this is an array, the next character
            --- should be a comma or a closing angle bracket:
            IF (@array>=1 AND LEFT(@tjson, 1) NOT IN (',', ']'))
                RETURN @output+'In array, expected "]" or "," after '+
                    'value, found "'+LEFT(@json, 1)+'"!';

            --- If this is where the array is closed (i.e. if it's a
            --- closing angle bracket)..
            IF (@array>=1 AND LEFT(@tjson, 1)=']') BEGIN;
                SET @array=NULL;
                SET @json=LTRIM(SUBSTRING(@tjson, 2, LEN(@tjson)));

                --- After a closed array, there should be a comma:
                IF (LEFT(@json, 1) NOT IN ('', ',')) BEGIN
                    RETURN 'Closed array, expected ","!';
                END;
            END;

            SET @json=LTRIM(SUBSTRING(@json, 2, LEN(@json)+1));
            IF (@array=0) SET @array=NULL;

        END;
    END;

    --- Return the output:
	RETURN CAST(@output AS xml);
END;

It works like a charm with simple XPath snippets like:

DECLARE @xml  XML
DECLARE @documentsCount INT

SELECT @xml = dbo.fn_parse_json2xml(@json))

SELECT @documentsCount = @xml.value('count(/root/data/documents)', 'int')

However, when we try to use XPATH to extract multple values, or whole objects like blow – it occurs to be VERY SLOW. I mean XML processing itself, not conversion.

SELECT 
		x.Rec.value('clientId[1]', 'nvarchar(100)') as ClientId,
		x.Rec.value('documentId[1]', 'nvarchar(100)') as DocumentId,
		x.Rec.value('documentType[1]', 'nvarchar(100)') as DocumentType,		
		x.Rec.value('documentFlags[1]/meta1[1]', 'nvarchar(100)') as Meta1,
		x.Rec.value('documentFlags[1]/meta3[1]', 'nvarchar(100)') as Meta3,
		x.Rec.value('documentFlags[1]/dateUpdate[1]', 'datetime') as UpdateDate
	INTO #documents
	FROM @xml.nodes('/data/documents') as x(Rec)

And I mean it – simply reading one row of those few fields into this temporary table consumed over 95% time of the whole procedure.

I had to research for something faster, and it appears that there is such thing – it’s OPENXML. However it has some issues:

  1. It’s sensitive to XML being well formed, and JSON converter tended to produce multiple roots when multiple objects were present on first level of JSON.
  2. It would not work with temporary tables – I had to use temporary table variables for some reasons.

I solved first problem by just manually adding root element to the returned XML:

SELECT @xml = CONVERT(XML, '<root>' + CONVERT(nvarchar(max), dbo.fn_parse_json2xml(@Json)) + '</root>')

Then I had to load and parse whole document – this still requires some time, and you must explicitely release handler after finished using it (need to simulate FINALLY block) to avoid memory waste (or even leaking?).

DECLARE @idoc INT
EXEC sys.sp_xml_preparedocument @idoc OUTPUT, @xml;

-- ......................

EXEC sys.sp_xml_removedocument @idoc;

And then I could read my data:

INSERT INTO @documents (ClientId, DocumentType, DocumentId, Meta1, Meta3, UpdateDate) 
SELECT *
FROM OPENXML(@idoc, '/root/data/documents') 
WITH (
	ClientId  nvarchar(100) 'clientId[1]', 
	DocumentType   nvarchar(100) 'documentType[1]',
	DocumentId  nvarchar(100) 'documentId[1]', 
	DocumentId  nvarchar(100) 'inputSource[1]',
	Meta1  nvarchar(100)  'documentFlags[1]/meta1[1]', 
	Meta3  nvarchar(100)  'documentFlags[1]/meta3[1]',
	UpdateDate datetime 'documentFlags[1]/dateUpdate[1]'
)

Note – I had to load data to the temporary table because there were several posible consecutive operations of INSERT/UPDATE on subsets of records, but you of course could directly insert rows from this QUERY directly into some table or use as part of cursor.