Project Description
Import the RDF data from Open Directory Project (ODP) (http://rdf.dmoz.org) straight
into a Microsoft SQL Server 2005 database.

DMOZ Data Importer License
----------------------
DMOZ Data Importer is released under the GNU LESSER GENERAL PUBLIC LICENSE. You can always find the latest GNU LGPL license at:
http://www.gnu.org/licenses/lgpl.html


System Requirements for runtime only
--------------------------------
SQL Server Management Studio/SQL Server Management Studio Express
Microsoft SQL Server2005+ and about 4 Gb of diskspace for the database and the temporary import file
Microsoft .Net 2.0
ODP RDF dump files

System Requirements for developers (source code)
--------------------------------------------
Visual Studio 2005
SQL Server Management Studio/SQL Server Management Studio Express
Microsoft SQL Server2005+ and about 4 Gb of diskspace for the database and the temporary import file
Microsoft .Net 2.0
ODP RDF dump files

-------------------------
Installing & Getting Started
-------------------------
1. First download the ODP RDF dump files at http://rdf.dmoz.org/.
There are 2 files compressed with GZip :
Structure.rdf.u8.gz (68 Mb compressed, 664 Mb uncompressed), and content.rdf.u8.gz (317 Mb compressed).

Right now, only the structure is imported, so no need to download the content file.
Uncompress the .gz file.

2. In DMOZ Data Importer directory, edit the file configuration file "DMOZ_Importer.exe.config" using a text editor.

Locate the <DMOZ_Importer.Properties.Settings> section at the end of the file.
Modify the settings "DMOZstructurepath" and "OutputPath" which respectively contains the full path to the structure file,
and the path to the folder that will contain the intermediate import file (Import.txt).

Locate the line "<add name="netTiersConnectionString" ..." and modify the connection string with your SQL Server instance name
as the data source.

Save the file and close the editor.

3. Open the SQL Server Management Studio and open the file "Script DMOZ Database.sql". Execute it to create the database "dmoz".
Then open the file "Data for DMOZ Database.sql" and execute it.

4. Now, you're ready to go! Just launch the executable "DMOZ_Importer.exe" and wait a little.

On October 6th, 2007, there are 738165 topics (table "Category"), and 2890983 resources of different types (narrow, editor,
altlang, newsgroup, related, symbolic)(the resources types are in the table "resourcetype").

From altlang elements, I extracted too 2954 referenced languages (table "Language") but there are many languages which have
their names described in different languages. For example, "swedish" appears also as "suédois" in french, "ruotsi" in finnish,
or "svensk" in swedish.

All of these information takes about 40 minutes to be processed, optimization process included, on a dual-core laptop
having 2Gb of memory and SQL Server 2005 Express SP2.


----------------------
Description of the process
----------------------

First of all, the program parses the RDF file to create a more adequat XML import file dubbed as "Import.txt".
The challenge here was about being able to process this big file as quick as possible without having to load it
entirely into memory.

So I used the .NET class XmlTextReader which acts like a forward-only cursor on the file.

Why do I create an XML import file and not a simple plain-text file ? Well, I first used a plain text file to bulk load it,
and found some bugs from SQL Server 2005 bulk functions that deals with unicode encoded files.

As I said in the previous chapter, there are a lot of different languages in these data, and so a lot of different characters
sets such as russian, chinese, japanese, arabian and so on.

It appears that the bulk load functionalities of SQL Server 2005 are older than the unicode standard, and they seem to need
a refresh update.
There is an article in MS Knowledge base about this problem : http://support.microsoft.com/?scid=kb%3Ben-us%3B942660&x=8&y=12
A patch is scheduled but not yet available.

So I decided to perform the bulk import using the .NET 2.0 new class : SqlBulkCopy (http://support.microsoft.com/?scid=kb%3Ben-us%3B942660&x=8&y=12).
This class needs to deal with an IDataReader interface as input. Usually, it's described to be used with a SqlDataReader instance.
So I implemented an XmlTextReader inherited class which implements the IDataReader interface. This allows me to have an efficient bulk import
with the advantage of the XmlTextReader to read my import file.

Before importing the data, I execute the stored procedure "importtruncateMainTables" which will remove data from all the tables,
except the table "resourcetype" which is a static referential.
This procedure also switches the database recovery model to "BULK_LOGGED" in order to optimize the bulk import logs.

Then the data are imported to the table "Import", using the SqlBulkCopy class used with the XmlImportDataReader class.

Once the data are imported, the stored procedure "ImportPostProcess" is executed.
This procedure will put the data from the table "Import" to the tables "Language", "LocalizedName" that will keep all the distinct labels associated
to their language ("english" by default if not specified), "Category" for topics, "Resource" for the other resources,
"resourcepercategory" that is the link between "category" and "resource".
Then it calculates the parent/son relation between topics and fills the "parentid" column in the table "Category".

Eventually, it optimizes the indexes, rebuild them if necessary, updates the statistics, shrinks the files after having emptied the "Import" table
and sets the recovery model to "full".
So you can start working with a "ready to go" optimized unicode database!



More Information
----------------
There is a pretty good data access model object generated by NetTiers templates (http://www.nettiers.com) and CodeSmith.
You can use it by referencing the assemblies dmoz.dll, dmoz.Data.dll and dmoz.Data.Sqlclient.dll. Then use the namespaces dmoz and dmoz.Data.
For more information about using the NetTiers code, please consult the sections "Getting Started", "Database and SQL", "Entity Layer",
"Data Layer" in the NetTiers documentation at http://docs.nettiers.com .

You may also have a look at the database schema in the file "DMOZ database diagram.jpg".

Last edited Oct 15, 2007 at 9:10 AM by bodzebod, version 2