Image of the glider from the Game of Life by John Conway
Skip to content

Optimizing Searches In Perl

So, I needed something to search a string in one text file, and compare it with a string in a second text file. If there is a match, I need to print the results to a new text file. So, because I'm new to Perl, but have a programming background, I figured this would be fairly easy, even if I'm not fluent with the language. Rather than get into the nitty-gritty about the details, I'll just show you what I did starting off:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/perl -w

open(in1, "phones.dat");

$counter = 1;

while(<in1>) # </in1>
{
        $phone2 = substr($_, 0, 10);
        print "Match: 0" . "\r";

        open(in2, "/kdata/2006/1514/data/P1514.FIN");

        while(<in2>) #</in2>
        {
                $phone1 = substr($_, 1384, 10);

                if($phone1 == $phone2)
                {
                        print "Match: " . $counter . "\r";
                        $counter++;
                }
        }
        close(in2);
}

print "\n";

close(in1);

As you can see, this is a very inefficient algorithm. If "phones.dat" is n-lines long, and "P1514.FIN" is m-lines long, then for every n-record, I am looking at EVERY m-record. This means, my search algorithm is O(n*m). Now, I'm going to say, that this works for small files. Looking at "phones.dat" which only has 330 lines, and "P1514.FIN" which only has 500 lines, what am I worried about? It takes less than a second to execute the code, and spit out the result. Heh. If only every file I dealt with was that small. Today, I was looking at two files that had about 12,000 lines. It took ~20 seconds to finish the script. Still, not really a length that I have to be concerned about, but I can make it better.

First, thanks go to Levi Pearson and Jason Hall for helping me realize the errors of my ways. I can increase the efficiency of the algorithm greatly by using hashes. In fact, I can take the order of the algorithm to O(n+m) by only traversing each file only once. Here's the code I came up for that:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/usr/bin/perl -w
       
open(in, "P1514.FIN");
       
while(<in>) # </in>
{
        chomp;
        my $layout = '@1384 A10';
        my $number = unpack $layout, $_;
       
        $Phone{$number} = $_;
}
       
close(in);
open(in, "p1514.afo");

$counter = 0;
       
while(<in>) #</in>
{
        my $layout = '@0 A10';
        my $number = unpack $layout, $_;

        if ($Phone{$number})
        {
            print "Match: " . $counter . "\r";
            $counter++;
        }
}

print "\n";

close(in);

In this case, "P1514.FIN" still has the 500 lines it used to have, and "p1514.afo" has well over 12,000. Executing the code is actually faster than previous. We should expect this as rather than traversing the m-file n times making my comparison, I traverse each file only once. The first file to store the key and value in a hash, and the second file to make the comparison. The only thing that may make the program a bit more efficient, would be to store the smaller of the two files in the hash, and use the larger of the two for comparing against the hash. But really, in that case, we're just concerned about memory, which isn't a worry for me.

The funny thing about this little programming example? I'm applying what I've learned in Discrete Math to a very real everyday problem. And who'd-a-thunk that any of the stuff was important?! :)

{ 7 } Comments

  1. Harley Pig using BonEcho 2.0 on GNU/Linux 64 bits | December 18, 2006 at 3:20 pm | Permalink

    #!/usr/bin/perl -w

    use strict;

    # I realize a lot of these are not
    # specifically related to your problem
    # but I'm pedantic.

    # First, filehandles in perl are
    # idiomatically uppercase, just to
    # help distinguish them.
    #
    # Second, use variable filehandles. It
    # makes error handling easier.
    #
    # Third, predeclare as much as possible.
    # It'll save a little bit of time. Right
    # now you're declaring a variable every
    # time through both loops, but it's not
    # changing.

    # I don't use (un)pack that often so I
    # can't comment on that.

    my %phone;

    my $layout1 = '@1384 A10';
    my $layout2 = '@0 A10';

    if ( open my $IN1, "P1514.FIN" ) {

    while( ) {

    chomp;

    # You don't need to assign a variable
    # for temporary use, it just slows you
    # down and takes up memory.

    # Why are you saving the line if you're
    # not using it later?
    $phone{ unpack $layout1, $_ } = '';

    }

    # $IN1 loses scope here. Perl will
    # automatically close the file.

    } else {

    # It's confusing when you get bogus data,
    # or no data.

    die "Unable to open P1514.FIN: $!\n";

    }

    if ( open my $IN2, "p1514.afo" ) {

    my $counter = 0;

    while( ) {

    # I'm not sure what you're trying to
    # accomplish here, but taking what you
    # have, you can increase the efficiency
    # quite a bit by using a little more
    # idiomatic perl.

    print "Match: " . $counter++ . "\r"
    if exists $phone{ unpack $layout2, $_ };

    }
    } else {

    die "Unable to open p1514.afo: $!\n";

    }

    print "\n";

    # Not having a copy of the data I can't test this
    # but it passes perl -c

  2. Aaron using Firefox 2.0 on Ubuntu | December 18, 2006 at 4:04 pm | Permalink

    harleypig- thx for your response. A couple of things:

    1) WordPress likes to lowercase anything that sits between < and >. As in the case with this code, it got lower case, but I do use uppercase filehandles. Also, you’ll notice “# </in1>”. This is because WordPress will automatically close my tags if I don’t, so I just put it in the code.
    2) I store the line of the key in the hash, because I actually am using it later. I probably shouldn’t have put it in my example code.
    3) I think some of your example is preference to the programmer, but at any case, goes to show that there is more than one way to skin a cat! But I agree that it could get a bit more efficient.

  3. Matthew Kimber using Firefox 2.0 on Windows XP | December 19, 2006 at 8:01 am | Permalink

    Couldn't you have just used regular expressions?

  4. Aaron using Firefox 2.0 on Ubuntu | December 19, 2006 at 8:06 am | Permalink

    Matthew-

    In this case, no. The files that I am searching have no rhyme and reason to the layout. Well, they do, but it differs from file to file, and I don't know before hand what the layout will be. I only know where certain variables are in the layout. So, with that, I can just use substr() or unpack(), as seen here, to get right to where I need to be in the file.

  5. Jayce^ using Firefox 2.0 on Ubuntu | December 19, 2006 at 10:18 am | Permalink

    Matthew - In fixed width data especially, unpack is much faster, it has no need to save state, load a complex engine, or other background tasks it would need. He still *could* have used them, but this is much more efficient.

    Harleypig - as for : print “Match: ” . $counter++ . “\r”

    just to be pedantic to you :) dont' use the concat operator (.) in a print. If you think about it the print operator is set to handle arrays, so why not just give it one. Benchmark the difference to verify, but using concat will force it to concat a new string first, then print, instead of print array.

  6. Harley Pig using BonEcho 2.0 on GNU/Linux 64 bits | December 19, 2006 at 10:37 am | Permalink

    Jayce--arghh ... yeah ... I'm still trying to internalize that one.

    > <

  7. Harley Pig using BonEcho 2.0 on GNU/Linux 64 bits | December 19, 2006 at 10:45 am | Permalink

    That first if block, where you're reading in the first file would be much faster with this:

    %phone = map { ( unpack $layout1, $_ ) , $_ } <$IN1>

{ 1 } Trackback

  1. Aaron Toponce » Blog Archive » Using GeSHi | December 19, 2006 at 8:49 am | Permalink

    [...] One of the strong features in GeSHi, is the ability to recognize keywords and functions built into the language, and provide a link to the official documentation about that keyword. You may have noticed this with my last post about optimizing searches in Perl. Clicking on one of the functions will take to the Perl documentation site about that function. Try ‘print’ or ‘open’ to see what I am talking about. [...]

Post a Comment

Your email is never published nor shared.

Switch to our mobile site