Sunday, July 7, 2013

How to handle flat file row delimiter and column delimiter issues in SSIS?

There are scenarios when the source file gets generated with different row delimiters from different systems and results into error when reading within the package. Handle dynamically row delimiter issue we can use this will generic and permanent fix, so suppose if in future we are changing row delimiter no need to change any package and re-test but this can be managed only at time initial design time. Once package development done with any one row delimiter (CRLF, LF, CR) we need to use different approach to provide robust solution to avoid rework on each package. There are 3 approaches to manage this issue:

Approach ‘A’


Use BULK INSERT TASK instead of dataflow and flat file source. In the BULK INSERT TASK, you can set ROW TERMINATOR and FIELD TERMINATOR expression properties, and this will work correctly.  

Approach ‘B’


We do some pretty intense transformation of the data, so BULK INSERT was not an option. So we need to some different approach. First we need to identifying the delimiter and then reading the file within the respective Data Flow Tasks. Let’s see with an example:

Example: A file gets generated with a variable row delimiter which sometimes is “LF” and sometimes is “CR”. The column delimiter of this file is always fixed i.e. “TAB”. So, to read this file successfully with both the delimiters we need to follow the below steps:

Step 1: – Create a variable “varRowDelimiter” to store the delimiter value and a variable “varSourceFile”   which contains the path of the file.

Step 2: – Create a Script task which will read the source file and identifies the row delimiter used in the file. The task will use both of the above declared variables:  ‘varSourceFile’ as ReadOnly variable and ‘varRowDelimiter’ as ReadWrite variable
.
Inside the Editor we will write the code to read the source file, identify the delimiter and store its value in the variable. One thing which one should consider while writing this code is: how huge your source files are going to be?  As reading file will consume memory, the files with huge sizes will affect the performance and may result into errors such as Out of Memory. Mostly, I have seen that developers simply use “ReadtoEnd” property even if the purpose of reading the file is not related to reading the complete data. ReadtoEnd reads the complete file till end and stores the entire data into memory thus resulting into high memory usage and a bit of slow performance, thus it should be used only when there is a need to read all the input to the end of the stream.
Here, I will be using the simple read method in a loop in which I will read some characters at a time and as soon as I find the delimiter, I will exit the loop.You can also read the files in fixed block sizes using Roadblock method and can exit the loop when you have the delimiter value.

Step 3 – Now, create two flat file connection managers – one with the “LF” Row Delimiter and one with the “CR” Row Delimiter and incorporate them into the respective Data flow tasks.

Step 4: Create two Precedence constraints and connect them from the Script Task to the Data flow tasks with the respective Expressions.

Now, just run the package for source file with both of the delimiters and use the results as required.
The above example can be modified to check for other delimiters as well or to cater other scenarios which requires the delimiters identification at run time.

Approach ‘C’


This is agile approach and very useful, when we have already delivered flat file reading solution with one delimiter and in situation where we are can receive flat file with multiple row delimiters.
We need to create a new package will be executed before the set of packages to load flat file in database (Destination). This package will identify row delimiter of flat file and will replace with the row delimiter we are using default in packages. For example if packages developed with {CR}{LF} but files received are having {LF} then we need to replace {LF} to {CR}{LF}.

Step 1: – Create a variable “varReplaceRowDelimiter” to replace the delimiter value, variable “varFindRowDelimiter” to find the existing delimiter value and a variable “varSourceFile”   which contains the path of the file.

Step 2: – Create a Script task which will read the source file and identifies the row delimiter used in the file. The task will use all of the above declared variables:  ‘varSourceFile’, “varReplaceRowDelimiter” , “varFindRowDelimiter”  as ReadOnly variable.

Inside the Editor we will write the code to read the source file, identify the delimiter store its value in the variable and replace with delimiter store its value in the variable. See the below code:


    'This method is called when this script task executes in the control flow.
    'Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.
    'To open Help, press F1.

    Public Sub Main()
        '
        ' Add your code here
        '

        Dim pk As New Package


        Dim file As New System.IO.StreamReader("C:\users\ABC\Downloads\test.txt")
        Dim data As String
        data = file.ReadToEnd()
        data = data.Replace( vbLf , vbCrLf)
        file.Close()

        Dim writer As New System.IO.StreamWriter("C:\Users\ABC\Downloads\test.txt", False)
        writer.Write(data)
        writer.Flush()
        writer.Close()
        Dts.TaskResult = ScriptResults.Success
    End Sub

Use variables as below:
"C:\users\ABC\Downloads\test.txt") = Dts.Variables("@[User::varSourceFile]").Value.ToString()
 
vbLf =  Dts.Variables("@[User::varFindRowDelimiter]").Value.ToString()
vbCrLf = Dts.Variables("@[User::varReplaceRowDelimiter]").Value.ToString()


One thing which one should consider while writing this code is: how huge your source files are going to be?  As reading file will consume memory, the files with huge sizes will affect the performance and may result into errors such as Out of Memory. We need to handle buffer location through config file.

4 comments:

  1. Hi Krishna,
    Thanks for your Information.
    I have a scenario i.e., in my source folder having some 10-15 no of flat files.
    some are having column delimiter is "tab" and some are having column delimiter is "Comma".
    in this scenario How to handle column delimiter

    Please help me,
    Thanks in advance

    ReplyDelete
  2. Hi Krishna,

    I am not very comfortable in writting code to identify delimeters of a flat file inside script task, could you please share some inputs on this ?

    Need your Help !!!
    Thanks in advance.

    ReplyDelete
  3. Hi Krishna,
    Thanks for your Information.
    I have a scenario i.e., in my source folder having some 10-15 no of flat files.
    some are having column delimiter is "tab" and some are having column delimiter is "Comma".
    in this scenario How to handle column delimiter

    Please help me,
    Thanks in advance

    ReplyDelete
  4. Hi Krishna,
    Thanks for your Information.
    I have a scenario i.e., in my source folder having some 10-15 no of flat files.
    some are having column delimiter is "tab" and some are having column delimiter is "Comma".
    in this scenario How to handle column delimiter

    Please help me,
    Thanks in advance

    ReplyDelete