This is a followup to my first question "Porting “SQL” export to T-SQL".
I am working with a 3rd party program that I have no control over and I can not change. This program will export it's internal database in to a set of .sql
each one with a format of:
INSERT INTO [ExampleDB] ( [IntField] , [VarcharField], [BinaryField])
VALUES
(1 , 'Some Text' , 0x123456),
(2 , 'B' , NULL),
--(SNIP, it does this for 1000 records)
(999, 'E' , null);
(1000 , 'F' , null);
INSERT INTO [ExampleDB] ( [IntField] , [VarcharField] , BinaryField)
VALUES
(1001 , 'asdg', null),
(1002 , 'asdf' , 0xdeadbeef),
(1003 , 'dfghdfhg' , null),
(1004 , 'sfdhsdhdshd' , null),
--(SNIP 1000 more lines)
This pattern continues till the .sql
file has reached a file size set during the export, the export files are grouped by EXPORT_PATH\%Table_Name%\Export#.sql
Where the # is a counter starting at 1.
Currently I have about 1.3GB data and I have it exporting in 1MB chunks (1407 files across 26 tables, All but 5 tables only have one file, the largest table has 207 files).
Right now I just have a simple C# program that reads each file in to ram then calls ExecuteNonQuery. The issue is I am averaging 60 sec/file which means it will take about 23 hrs for it to do the entire export.
I assume if I some how could format the files to be loaded with a BULK INSERT instead of a INSERT INTO it could go much faster. Is there any easy way to do this or do I have to write some kind of Find & Replace and keep my fingers crossed that it does not fail on some corner case and blow up my data.
Any other suggestions on how to speed up the insert into would also be appreciated.
UPDATE:
I ended up going with the parse and do a SqlBulkCopy method. It went from 1 file/min. to 1 file/sec.
INSERT INTO
per 1000 rows, as if you attempt to insert more than that you will get a error The number of row value expressions in the INSERT statement exceeds the maximum allowed number of 1000 row values.
. My distilled question is Is there any easy way to convert to CSV or do I have to write some kind of Find & Replace and keep my fingers crossed that it does not fail on some corner case and blow up my data. - Scott Chamberlain 2012-04-03 22:31
'
, and terminated with a '
not followed by another '
). It should take about 10 minutes to write - NoName 2012-04-03 22:55
Well, here is my "solution" for helping convert the data into a DataTable or otherwise (run it in LINQPad):
var i = "(null, 1 , 'Some''\n Text' , 0x123.456)";
var pat = @",?\s*(?:(?<n>null)|(?<w>[\w.]+)|'(?<s>.*)'(?!'))";
Regex.Matches(i, pat,
RegexOptions.IgnoreCase | RegexOptions.Singleline).Dump();
The match should be run once per value group (e.g. (a,b,etc)
). Parsing of the results (e.g. conversion) is left to the caller and I have not tested it [much]. I would recommend creating the correctly-typed DataTable first -- although it may be possible to pass everything "as a string" to the database? -- and then use the information in the columns to help with the extraction process (possibly using type converters). For the captures: n
is null, w
is word (e.g. number), s
is string.
Happy coding.
Apparently your data is always wrapped in parentheses and starts with a left parenthesis. You might want to use this rule to split
(RemoveEmptyEntries
) each of those lines and load it into a DataTable. Then you can use SqlBulkCopy
to copy all at once into the database.
This approach would not necessarily be fail-safe, but it would be certainly faster.
Edit: Here's the way how you could get the schema for every table:
private static DataTable extractSchemaTable(IEnumerable<String> lines)
{
DataTable schema = null;
var insertLine = lines.SkipWhile(l => !l.StartsWith("INSERT INTO [")).Take(1).First();
var startIndex = insertLine.IndexOf("INSERT INTO [") + "INSERT INTO [".Length;
var endIndex = insertLine.IndexOf("]", startIndex);
var tableName = insertLine.Substring(startIndex, endIndex - startIndex);
using (var con = new SqlConnection("CONNECTION"))
{
using (var schemaCommand = new SqlCommand("SELECT * FROM " tableName, con))
{
con.Open();
using (var reader = schemaCommand.ExecuteReader(CommandBehavior.SchemaOnly))
{
schema = reader.GetSchemaTable();
}
}
}
return schema;
}
Then you simply need to iterate each line in the file, check if it starts with (
and split that line by Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries)
. Then you could add the resulting array into the created schema-table.
Something like this:
var allLines = System.IO.File.ReadAllLines(path);
DataTable result = extractSchemaTable(allLines);
for (int i = 0; i < allLines.Length; i++)
{
String line = allLines[i];
if (line.StartsWith("("))
{
String data = line.Substring(1, line.Length - (line.Length - line.LastIndexOf(")")) - 1);
var fields = data.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries);
// you might need to parse it to correct DataColumn.DataType
result.Rows.Add(fields);
}
}
INSERT INTO
per file but... that is, make sure the issue is caused by not using TDS first. It might be easiest to take data and turn it into CSV first as most tools (including bulk data/merge) understand CSV. Also ensure the chosen cluster is not silly and thrashing IO on inserts - NoName 2012-04-03 22:26