<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Aaron Toponce &#187; Databases</title>
	<atom:link href="http://pthree.org/category/databases/feed/" rel="self" type="application/rss+xml" />
	<link>http://pthree.org</link>
	<description>Linux.  GNU.  Freedom.</description>
	<lastBuildDate>Sun, 05 Feb 2012 14:33:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4-alpha</generator>
		<item>
		<title>MySQL Rocks</title>
		<link>http://pthree.org/2006/07/16/mysql-rocks/</link>
		<comments>http://pthree.org/2006/07/16/mysql-rocks/#comments</comments>
		<pubDate>Sun, 16 Jul 2006 16:12:35 +0000</pubDate>
		<dc:creator>Aaron Toponce</dc:creator>
				<category><![CDATA[Databases]]></category>

		<guid isPermaLink="false">http://www.pthree.org/2006/07/16/mysql-rocks/</guid>
		<description><![CDATA[Well, it looks like I have finally reached an end to managing massive DBF tables (500MB+), and MySQL provides the answer. Even if I do have to make a couple hacks to get it to work. For those of you unaware, this has been a plaguing problem of mine for a while now. For starters, [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.pthree.org/wp-content/uploads/2006/07/Screenshot.png" align="right" hspace="15" vspace="15" alt="MySQL Screenshot" />Well, it looks like I have finally reached an end to managing massive DBF tables (500MB+), and MySQL provides the answer.  Even if I do have to make a couple hacks to get it to work.  For those of you unaware, <a href="http://www.pthree.org/2006/01/22/managing-large-databases/">this has been a plaguing problem of mine for a while now</a>.</p>
<p>For starters, MySQL does not support DBF natively.  There isn&#8217;t a way to import a dBase format table directly into MySQL.  Which is unfortunate, provided the knowledge that DBF is the oldest database format to date, and used to be one of the most common.  So, other than purchasing proprietary software that handles the import directly, I need another way.</p>
<p>MySQL handles importing comma-separated (*.csv) and tab-separated (*.tsv) files through the LOAD DATA INFILE command (technically speaking, any separation format is acceptable as long as it is consistent, and can be identified for recognizing each column).  So, I can export the file into tab-separated, and import the data as follows:</p>
<pre>
LOAD DATA [LOCAL] INFILE '/home/aaron/somedata.tsv'
INTO TABLE some_table
FIELDS TERMINATED BY '\\t'
LINES TERMINATED BY '\\n'
</pre>
<p>[LOCAL] (without the brackets) is optional and only needed when administering a remote MySQL database.  LOCAL would load the data from your client machine, otherwise look for it on the server.  Also, MySQL looks at tab-separated data and Unix line feeds only by default, so FIELDS TERMINATED BY &#8216;\t&#8217; and LINES TERMINATED BY &#8216;\n&#8217; aren&#8217;t needed if that is the case.  If you are working on a Windows machine, then your file could be saved in DOS format with carriage returns and line feeds, and which case LINES TERMINATED BY &#8216;\r\n&#8217; would need to be used.  Check the documentation for further info.  Of course, the table structure would need to be built and defined before loading the data.</p>
<p>So, now I have my data loaded.</p>
<p>I should mention that loading data with 2 million rows and 250 fields (with each field containing ~30 characters of data and each row with ~200 of the fields populated on average) actually imported fairly quickly.  My database server is a Pentium 4 2.8 GHz with 1GB DDR2 RAM 533Mhz FSB and a SATA 2 100GB HDD (I got an upgrade).  It took about 10 minutes to export the DBF to *.tsv (we had to break up the table, or Paradox would not do it) and 2 minutes to import the data into MySQL.  When I get the data I need, I can then export my result set to *.csv in less than a minute.  Needless to say, much much MUCH faster than SQL Server 2000 (45-90 minutes to import the DBF directly and another 20-40 minutes to export it to *.csv not including the queries and populating data).</p>
<p>The great thing about MySQL is the sheer speed.  This is important, because we get very busy at work, and I don&#8217;t have time to sit and wait.  You may remember an earlier post (link above) where I mentioned that SQL Server 2000 took 3-4 hours from start to finish to get done what I need.  Well, with MySQL the whole process from start to finish takes less than 1 hour.  Exactly what I am looking for.</p>
<p>There is only one hiccup, and it is a fairly large one.  When running some generic queries, like SELECT * FROM table, on a table that size, it can take anywhere from 20 minutes to over an hour, which is expected.  If I am running the query on a server that resides on a Linux machine, it completely crashes X and locks up the machine and the query obviously fails, which isn&#8217;t expected.  The only thing I can think of is allocating more disk space to the swap partition.  Has anyone else noticed this issue, and how to fix it?  It has to be a size issue.  Other than that, MySQL rocks!</p>
<p>Now, here is the great thing.  MySQL has a slew of GUI tools available for Windows, Linux, Mac, etc. to make this process as painless as possible.  For example, rather than using the command line to execute my queries, there is the MySQL Query Browser for that very need.  It can connect to multiple databases and servers (MySQL and others) simultaneously, it has <a href="http://dev.mysql.com/doc/">great documentation</a> build right in, and it is easy to use (see screenshot above).  There is MySQL Administrator, MySQL Workbench (Windows only), MySQL Query browser and MySQL Migration Toolkit (Windows and Linux only) all available for download and completely open source.</p>
]]></content:encoded>
			<wfw:commentRss>http://pthree.org/2006/07/16/mysql-rocks/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Linux Is Fun and Stuff, but&#8230;</title>
		<link>http://pthree.org/2006/02/14/linux-is-fun-and-stuff-but/</link>
		<comments>http://pthree.org/2006/02/14/linux-is-fun-and-stuff-but/#comments</comments>
		<pubDate>Tue, 14 Feb 2006 15:17:58 +0000</pubDate>
		<dc:creator>Aaron Toponce</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[Wordpress]]></category>

		<guid isPermaLink="false">http://www.pthree.org/2006/02/14/linux-is-fun-and-stuff-but/</guid>
		<description><![CDATA[&#8230;setting up a Linux server, and evern worse, being the system administrator is a pain in the butt!!!111!!!11!one Okay, here&#8217;s the thing.  I am now hosting the OALUG site on my Ubuntu web server.  No big deal.  I am hosting 3 other sites as well.  Being a web admin is not that bad.  Lately, however, [...]]]></description>
			<content:encoded><![CDATA[<p>&#8230;setting up a Linux server, and evern worse, being the system administrator is a pain in the butt!!!111!!!11!one</p>
<p>Okay, here&#8217;s the thing.  I am now hosting the OALUG site on my Ubuntu web server.  No big deal.  I am hosting 3 other sites as well.  Being a web admin is not that bad.  Lately, however, I need to set up a mail and DNS server for the OALUG group.  I&#8217;ll get to that in another post as to why.</p>
<p>First, I asked a friend what packages I needed and what needed to be done, as he has done it before.  He gave me the link to an excellent howto, and for the most part, everything went smoothly, until I began installing packages I already had installed, like MySQL.  Don&#8217;t ask.</p>
<p>Everything started breaking.  First it was MySQL, then PHP, then Apache2.  Then, after updating the repositories in my source.lst file, 280 packages broke, including the Linux kernel, the Ubuntu desktop, and a top more.</p>
<p>As you can see, MySQL, PHP and Apache2 are working just fine now, I hope (I&#8217;ll really find out if this post doesn&#8217;t submit).  All the packages that I broke, I have fixed.  But I was up until 2:30 in the freaking morning fixing them.  And that was before I even had the chance to get my web server as it was before all this mess.</p>
<p>Now, before all you Microsoft advocates come out of the woodwork saying, &#8220;We told you so Aaron.  Windows is easier and less cryptic to use than Linux&#8221;, I have to say that I have learned <em>a ton</em>.  And, despite the lack of sleep, the 300 new gray hairs, and ulcers that I developed, it was actullay, in an odd sort of way, fun.  I got to see deep inside the server.  I edited config files, changed all sorts of permissions, added, removed and readded packages, and really got to know a lot better how my Linux server works.  I wouldn&#8217;t trade that for an easy GUI that handles everything for me.</p>
<p>Am I finished?  No, I still have a little more to go.  And I will probably break more packages on the way, but, I will learn more than if I purchased a program to do to the job for me.  And I will be more qualified as a Linux/UNIX system admin.  I would never apply for such a job.  The stress would kill me, but my qualifications have just jumped through the roof.</p>
]]></content:encoded>
			<wfw:commentRss>http://pthree.org/2006/02/14/linux-is-fun-and-stuff-but/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Managing LARGE Databases Continued</title>
		<link>http://pthree.org/2006/01/26/managing-large-databases-continued/</link>
		<comments>http://pthree.org/2006/01/26/managing-large-databases-continued/#comments</comments>
		<pubDate>Thu, 26 Jan 2006 15:21:54 +0000</pubDate>
		<dc:creator>Aaron Toponce</dc:creator>
				<category><![CDATA[Databases]]></category>

		<guid isPermaLink="false">http://www.pthree.org/2006/01/26/managing-large-databases-continued/</guid>
		<description><![CDATA[Well, I don&#8217;t have much to report so far, other than OpenOffice.org still has a lot of bugs when it comes to it&#8217;s Base program.  First off, before throwing the 3.4 million record Tennessee database at it, I thought I would try a much more compact database in dBase format with only a few thousand [...]]]></description>
			<content:encoded><![CDATA[<p>Well, I don&#8217;t have much to report so far, other than <a target="_blank" title="OpenOffice.org" href="http://www.openoffice.org">OpenOffice.org</a> still has a lot of bugs when it comes to it&#8217;s Base program.  First off, before throwing the 3.4 million record Tennessee database at it, I thought I would try a much more compact database in dBase format with only a few thousand records.  Needless to say, it c&#8230;r&#8230;a&#8230;w&#8230;l&#8230;s.  If Paradox is like a cheetah running at 45 miles / hour, then SQL Server 200 is like a man lightly jogging at 11 miles / hour, while OpenOffice.org Base is a three-toed sloth taking a holiday. It is <strong>painfully</strong> slow.  And, when up and running, it constantly crashes, which means more time spent re-importing the database and starting over.  It blows!  Of course, the culprit is the Java Virtual Machine (JVM).</p>
<p>While the programmers at Sun Microsystems want to keep an application platform independant, deciding to base the entire OpenOffice.org infrastructure on the JVM is a poor decision.  While is runs at almost native speed when compared to Brand M Office, it has its limitations.  Base is one LARGE limitation.  Sure, I can develop robust, light and scalable database applications complete with reports, forms and queries.  Heck, it can even keep up with commercially driven database applications if you know what you are doing.  But begin to increase the size of the database anywhere past 1,000 records, and you have nothing but headache on your hands.  Which is unfortunate, as I <em>love</em> this application suite.</p>
<p>So, I won&#8217;t even bother trying to import the Tennessee database using Base.  Instead, I am on to MySQL and PostgreSQL, as they are much more robust SQL applications.  Honestly though, I never expected to use OpenOffice.org Base to begin with.  I was just curious about it&#8217;s speed, and boy did I find out.</p>
]]></content:encoded>
			<wfw:commentRss>http://pthree.org/2006/01/26/managing-large-databases-continued/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Managing LARGE Databases</title>
		<link>http://pthree.org/2006/01/22/managing-large-databases/</link>
		<comments>http://pthree.org/2006/01/22/managing-large-databases/#comments</comments>
		<pubDate>Sun, 22 Jan 2006 15:33:37 +0000</pubDate>
		<dc:creator>Aaron Toponce</dc:creator>
				<category><![CDATA[Databases]]></category>

		<guid isPermaLink="false">http://www.pthree.org/2006/01/22/managing-large-databases/</guid>
		<description><![CDATA[First off a couple of definitions. When I refer to columns in this post, I am not referring to the actual field in a database, but rather I am referring to a column if the database was exported to fixed (space delimited) ASCII text. Also, when I mention records, I am referring to the number [...]]]></description>
			<content:encoded><![CDATA[<p>First off a couple of definitions.  When I refer to <em>columns </em>in this post, I am not referring to the actual field in a database, but rather I am referring to a column if the database was exported to fixed (space delimited) ASCII text.  Also, when I mention <em>records</em>, I am referring to the number of rows that the database contains.  The reason I mention records is because these database files are sample representations of certain geographical areas.  For example, a database may contain 4,000 records of Republican voters in a certain county in a certain state.  Required info that must be in every database is phone number and geographical representation (a FIPS number, the city or county name, or other criteria).  Other info commonly found in these databases are name, address, state, gender, race, party, voting history and a plethora of other information.</p>
<p>I work with databases daily.  Most databases that we handle are in <a title="dBase" target="_blank" href="http://www.dbase.com/">dBase</a> format and are roughly around 5,000 to 10,000 records with about 400-500 columns.  Luckily, these databases aren&#8217;t too terrible to manage.  Using <a title="Paradox" target="_blank" href="http://www.corel.com">Paradox</a> as our database software, we can get what we need done quite quickly without any hassle.  However, every once in a while (I mean about  once or twice a month), I will receive a dBase file with hundreds of thousands and even a few times millions of records with 800-900 columns of data.  These files are HUGE.  The last database I worked on held 3.4 million records with 850 columns of data.  The database was 1.6GB, and was a voting history of the entire state of Tennessee.</p>
<p>So my question is this: what database software would you use to get the job done the fastest?  There is a lot that needs to be done on the database.  I first need to run counts based on the clients request.  For example, on the Tennessee voter file with 3.4 million records, I needs to find out how many people with valid phone numbers voted in either the 1998 GOP primary, 2000 Presidential preference primary or the 2002 GOP primary elections broken down by county.  I also needed to find out how many people who voted in any of those elections fell into certain age groups.  All in all, the client wanted 7 different counts based on different criteria.  Once the client finds out what sample representation he needs for his project, I need to remove any records without valid phone numbers and any duplicate phone numbers, then re-index the table.  After which, I need to append two new fields at the end of the table, one of them needs to be populated with a random number, then sort the table by the random number (of course, re-index).  Finally, I need to pull the records based on the criteria he selected with a 30:1 ratio (30 records for every quota he is looking to fill).  This means copying those records to another table, and populating in one of the newly appended fields in the original table with a 1 so I know what records were used.  If you don&#8217;t know your SQL, you will drown when trying to process this data.</p>
<p>I ask my question again: what database software would you use to get the job done the fastest?</p>
<p>These databases are always associated with a project that almost always have daily deadlines.  And that isn&#8217;t the only project that will be running for the day.  Usually, we have 6 or 7 projects running at any given day for clients, sometimes more and sometimes less, and most of them have to be finished by 3pm.  With the shift starting at 9am, this doesn&#8217;t give us a lot of time to get these projects done.  With databases that sit around 5,000 to 10,000 or even 20,000 records, it only takes 10-15 minutes from start to finish using Paradox and a simple text editor.  However, with this voter file of 3.4 million records, it takes a lot longer.</p>
<p>The system we have setup is to run the counts that the client wants using <a target="_blank" title="SPSS for Windows" href="http://www.spss.com/">SPSS for Windows</a>.  Importing the Tennessee voter file database takes about 2-3 minutes, then running each selection criteria takes about 1-2 minutes and finally processing the frequencies for the counts also takes about 1-2 minutes.  When a client is looking for 7 counts, it will take about 25-30 minutes to get all the numbers he is looking for.  We then have another utility that removes the duplicates, appends the new fields, populates the random field and sorts by the random number.  This process will take about another 10-15 minutes on the 3.4 million record Tennessee.  Lastly, we pull the sample and populate the &#8220;used&#8221; field with a 1 for each record pulled.  This only takes 2-3 minutes.  All in all, you are looking at about 45 minutes to an hour with a database this size using SPSS, Paradox and this in-house utility we built.</p>
<p>These tools work great, but we couldn&#8217;t help but wonder if a full fledged commercial database engine couldn&#8217;t process the requests much faster.  We decided to give <a target="_blank" title="SQL Server 2000" href="http://www.microsoft.com/sql/default.mspx">SQL Server 2000</a> a try.  It should be mentioned at this point that I have an Intel Pentium 4 2.66 Ghz processor with a 500Mhz FSB, a 7200 RPM SATA 80GB hard drive and 1GB of SD333 DDR RAM.  She&#8217;ll keep up with anything you throw at her.</p>
<p>First, the import.  Because the Tennessee voter file database sent is in dBase format, we need to import it into SQL Server.  With 3.4 million records and over 850 columns, this took 45 minutes alone.  Once the new database was populated into SQL server, we needed to run the counts.  For each count, it took roughly 8-10 minutes to return the result.  Because the client was looking for 7 different counts, this took over an hour to crunch the numbers.  Once the counts were reached, removing the duplicate and invalid phone numbers then re-indexing the table took about 25 minutes.  It took about another 10 minutes to append the two new fields &#8220;used&#8221; and &#8220;random&#8221; to the table.  Populating the &#8220;random&#8221; field with random numbers took about 5 minutes.  Sorting the table by the random field and re-indexing the table took about another 12-15 minutes.  Finally, pulling the sample and populating the &#8220;used&#8221; field took about 25 minutes.  Overall, it took me about 3 and-a-half to 4 hours from start to finish.  I even had two other data engineers sharing the workload with me.  There is still a great deal of work left to be done after this point, but I was shocked that it would take SQL Server 2000 so long to process the data.</p>
<p>There are still a couple of other database options that we can look at.  <a target="_blank" title="MySQL" href="http://www.microsoft.com/sql/default.mspx">MySQL</a> and <a target="_blank" title="PostgreSQL" href="http://www.postgresql.com">PostgreSQL</a>.  The network admin recommends Postgre, so I think that will be the next database software tested on this monster.  I will give it to SQL Server 2000 for its wide range of powerful GUI tools.  The Enterprise Manager and Query Analyzer are awesome.  It was really intuitive to use the software without much trouble.  However, the time it takes just won&#8217;t cut it.<br />
But the question still remains what is the fastest?  Right now, it seems to be SPSS and Paradox which allow working on a database table of that size a fairly quick and painless job.  Is there something else faster?  Am I not using SQL Server 2000 to it&#8217;s fullest potential?  Are there some shortcuts that I am not aware of?  I am not finished with looking for other options.  When work is slow, I will be trying to find ways to speed up the process.  I will probably publish my findings here.</p>
]]></content:encoded>
			<wfw:commentRss>http://pthree.org/2006/01/22/managing-large-databases/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

