So, I needed something to search a string in one text file, and compare it with a string in a second text file. If there is a match, I need to print the results to a new text file. So, because I'm new to Perl, but have a programming background, I figured this would be fairly easy, even if I'm not fluent with the language. Rather than get into the nitty-gritty about the details, I'll just show you what I did starting off:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | #!/usr/bin/perl -w open(in1, "phones.dat"); $counter = 1; while(<in1>) # </in1> { $phone2 = substr($_, 0, 10); print "Match: 0" . "\r"; open(in2, "/kdata/2006/1514/data/P1514.FIN"); while(<in2>) #</in2> { $phone1 = substr($_, 1384, 10); if($phone1 == $phone2) { print "Match: " . $counter . "\r"; $counter++; } } close(in2); } print "\n"; close(in1); |
As you can see, this is a very inefficient algorithm. If "phones.dat" is n-lines long, and "P1514.FIN" is m-lines long, then for every n-record, I am looking at EVERY m-record. This means, my search algorithm is O(n*m). Now, I'm going to say, that this works for small files. Looking at "phones.dat" which only has 330 lines, and "P1514.FIN" which only has 500 lines, what am I worried about? It takes less than a second to execute the code, and spit out the result. Heh. If only every file I dealt with was that small. Today, I was looking at two files that had about 12,000 lines. It took ~20 seconds to finish the script. Still, not really a length that I have to be concerned about, but I can make it better.
First, thanks go to Levi Pearson and Jason Hall for helping me realize the errors of my ways. I can increase the efficiency of the algorithm greatly by using hashes. In fact, I can take the order of the algorithm to O(n+m) by only traversing each file only once. Here's the code I came up for that:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | #!/usr/bin/perl -w open(in, "P1514.FIN"); while(<in>) # </in> { chomp; my $layout = '@1384 A10'; my $number = unpack $layout, $_; $Phone{$number} = $_; } close(in); open(in, "p1514.afo"); $counter = 0; while(<in>) #</in> { my $layout = '@0 A10'; my $number = unpack $layout, $_; if ($Phone{$number}) { print "Match: " . $counter . "\r"; $counter++; } } print "\n"; close(in); |
In this case, "P1514.FIN" still has the 500 lines it used to have, and "p1514.afo" has well over 12,000. Executing the code is actually faster than previous. We should expect this as rather than traversing the m-file n times making my comparison, I traverse each file only once. The first file to store the key and value in a hash, and the second file to make the comparison. The only thing that may make the program a bit more efficient, would be to store the smaller of the two files in the hash, and use the larger of the two for comparing against the hash. But really, in that case, we're just concerned about memory, which isn't a worry for me.
The funny thing about this little programming example? I'm applying what I've learned in Discrete Math to a very real everyday problem. And who'd-a-thunk that any of the stuff was important?! 🙂
{ 7 } Comments