First off a couple of definitions. When I refer to columns in this post, I am not referring to the actual field in a database, but rather I am referring to a column if the database was exported to fixed (space delimited) ASCII text. Also, when I mention records, I am referring to the number of rows that the database contains. The reason I mention records is because these database files are sample representations of certain geographical areas. For example, a database may contain 4,000 records of Republican voters in a certain county in a certain state. Required info that must be in every database is phone number and geographical representation (a FIPS number, the city or county name, or other criteria). Other info commonly found in these databases are name, address, state, gender, race, party, voting history and a plethora of other information.
I work with databases daily. Most databases that we handle are in dBase format and are roughly around 5,000 to 10,000 records with about 400-500 columns. Luckily, these databases aren't too terrible to manage. Using Paradox as our database software, we can get what we need done quite quickly without any hassle. However, every once in a while (I mean about once or twice a month), I will receive a dBase file with hundreds of thousands and even a few times millions of records with 800-900 columns of data. These files are HUGE. The last database I worked on held 3.4 million records with 850 columns of data. The database was 1.6GB, and was a voting history of the entire state of Tennessee.
So my question is this: what database software would you use to get the job done the fastest? There is a lot that needs to be done on the database. I first need to run counts based on the clients request. For example, on the Tennessee voter file with 3.4 million records, I needs to find out how many people with valid phone numbers voted in either the 1998 GOP primary, 2000 Presidential preference primary or the 2002 GOP primary elections broken down by county. I also needed to find out how many people who voted in any of those elections fell into certain age groups. All in all, the client wanted 7 different counts based on different criteria. Once the client finds out what sample representation he needs for his project, I need to remove any records without valid phone numbers and any duplicate phone numbers, then re-index the table. After which, I need to append two new fields at the end of the table, one of them needs to be populated with a random number, then sort the table by the random number (of course, re-index). Finally, I need to pull the records based on the criteria he selected with a 30:1 ratio (30 records for every quota he is looking to fill). This means copying those records to another table, and populating in one of the newly appended fields in the original table with a 1 so I know what records were used. If you don't know your SQL, you will drown when trying to process this data.
I ask my question again: what database software would you use to get the job done the fastest?
These databases are always associated with a project that almost always have daily deadlines. And that isn't the only project that will be running for the day. Usually, we have 6 or 7 projects running at any given day for clients, sometimes more and sometimes less, and most of them have to be finished by 3pm. With the shift starting at 9am, this doesn't give us a lot of time to get these projects done. With databases that sit around 5,000 to 10,000 or even 20,000 records, it only takes 10-15 minutes from start to finish using Paradox and a simple text editor. However, with this voter file of 3.4 million records, it takes a lot longer.
The system we have setup is to run the counts that the client wants using SPSS for Windows. Importing the Tennessee voter file database takes about 2-3 minutes, then running each selection criteria takes about 1-2 minutes and finally processing the frequencies for the counts also takes about 1-2 minutes. When a client is looking for 7 counts, it will take about 25-30 minutes to get all the numbers he is looking for. We then have another utility that removes the duplicates, appends the new fields, populates the random field and sorts by the random number. This process will take about another 10-15 minutes on the 3.4 million record Tennessee. Lastly, we pull the sample and populate the "used" field with a 1 for each record pulled. This only takes 2-3 minutes. All in all, you are looking at about 45 minutes to an hour with a database this size using SPSS, Paradox and this in-house utility we built.
These tools work great, but we couldn't help but wonder if a full fledged commercial database engine couldn't process the requests much faster. We decided to give SQL Server 2000 a try. It should be mentioned at this point that I have an Intel Pentium 4 2.66 Ghz processor with a 500Mhz FSB, a 7200 RPM SATA 80GB hard drive and 1GB of SD333 DDR RAM. She'll keep up with anything you throw at her.
First, the import. Because the Tennessee voter file database sent is in dBase format, we need to import it into SQL Server. With 3.4 million records and over 850 columns, this took 45 minutes alone. Once the new database was populated into SQL server, we needed to run the counts. For each count, it took roughly 8-10 minutes to return the result. Because the client was looking for 7 different counts, this took over an hour to crunch the numbers. Once the counts were reached, removing the duplicate and invalid phone numbers then re-indexing the table took about 25 minutes. It took about another 10 minutes to append the two new fields "used" and "random" to the table. Populating the "random" field with random numbers took about 5 minutes. Sorting the table by the random field and re-indexing the table took about another 12-15 minutes. Finally, pulling the sample and populating the "used" field took about 25 minutes. Overall, it took me about 3 and-a-half to 4 hours from start to finish. I even had two other data engineers sharing the workload with me. There is still a great deal of work left to be done after this point, but I was shocked that it would take SQL Server 2000 so long to process the data.
There are still a couple of other database options that we can look at. MySQL and PostgreSQL. The network admin recommends Postgre, so I think that will be the next database software tested on this monster. I will give it to SQL Server 2000 for its wide range of powerful GUI tools. The Enterprise Manager and Query Analyzer are awesome. It was really intuitive to use the software without much trouble. However, the time it takes just won't cut it.
But the question still remains what is the fastest? Right now, it seems to be SPSS and Paradox which allow working on a database table of that size a fairly quick and painless job. Is there something else faster? Am I not using SQL Server 2000 to it's fullest potential? Are there some shortcuts that I am not aware of? I am not finished with looking for other options. When work is slow, I will be trying to find ways to speed up the process. I will probably publish my findings here.
{ 1 } Trackback
[...] Well, it looks like I have finally reached an end to managing massive DBF tables (500MB+), and MySQL provides the answer. Even if I do have to make a couple hacks to get it to work. For those of you unaware, this has been a plaguing problem of mine for a while now. [...]
Post a Comment