In this article I show how to use Perl to extract keywords from a text file, create an index of these keywords, and reassemble the text in a simplified way. This will give you compression, as well as control over stored data. I used The Snow Queen with the simplified character set of the Odyssey 2 in mind, but certainly this has other applications for searching and extracting data. This also strictly controls non-word characters, which is useful for security reasons when accepting form text. For more information on my Odyssey 2 projects, see this page. Here is the script:
%found = (); %index= (); $i=0; open (SQ, "< snowqueen"); open (SQI, "> snowqueeni"); while (<SQ>){ if (/^$/ ) { s/^/ endofparagraph/g; } s/\./ endofsentence/g; s/--/ doubledash /g; s/-/ singledash /g; s/\"/ doublequote /g; s/\,/ commainsentence/g; s/:/ coloninsentence/g; s/\?/ questioninsentence/g; s/!/ banginsentence/g; while ( /(\w['\w-]*)/g ){ if (!(exists $found{uc $1})){ $found{uc $1}=$i; print SQI (uc $1),"\n"; $i++; } } } close (SQ); close (SQI); open (SQC, "> snowqueenc"); open (SQ, "< snowqueen"); while (<SQ>){ if (/^$/ ) { s/^/ endofparagraph/g; } s/\./ endofsentence/g; s/--/ doubledash /g; s/-/ singledash /g; s/\"/ doublequote /g; s/\,/ commainsentence/g; s/:/ coloninsentence/g; s/\?/ questioninsentence/g; s/!/ banginsentence/g; while ( /(\w['\w-]*)/g ){ print SQC $found{uc $1},"\n"; } } close SQC; close SQ; open (SQI, "< snowqueeni"); $i=0; while (<SQI>){ chop; $index{$i}=$_; $i++; } close SQI; open (SQC, "< snowqueenc"); while (<SQC>){ chop; use Switch; switch ($index{$_}){ case "COMMAINSENTENCE" { print "\/"; } case "COLONINSENTENCE" { print ":"; } case "QUESTIONINSENTENCE" { print "?"; } case "BANGINSENTENCE" { print "!"; } case "ENDOFSENTENCE" { print "\."; } case "DOUBLEQUOTE" { print "\>"; } case "DOUBLEDASH" { print "--"; } case "SINGLEDASH" { print "-"; } case "ENDOFPARAGRAPH" { print "\n\n"; } else { print " ".$index{$_}; } } } close SQC; |
Here is what the compressed file looks like:
u-1@srv-1 sq $ head snowqueenc -n 20 0 1 2 3 4 5 6 7 8 9 10 11 12 9 0 13 3 14 15 16 |
Here is what the index file looks like:
u-1@srv-1 sq $ head snowqueeni -n 20 THE SNOW QUEEN ENDOFPARAGRAPH FIRST STORY ENDOFSENTENCE WHICH TREATS OF A MIRROR AND SPLINTERS NOW THEN COMMAINSENTENCE LET US BEGIN |
Here is a section of text in the original story:
"Oh, how long I have stayed!" said the little girl. "I intended to look for Kay! Don't you know where he is?" she asked of the roses. "Do you think he is dead and gone?" "Dead he certainly is not," said the Roses. "We have been in the earth where all the dead are, but Kay was not there." "Many thanks!" said little Gerda; and she went to the other flowers, looked into their cups, and asked, "Don't you know where little Kay is?" But every flower stood in the sunshine, and dreamed its own fairy tale or its own story: and they all told her very many things, but not one knew anything |
Here is the rendered text using the bit of code at the end of the script:
> OH/ HOW LONG I HAVE STAYED!> SAID THE LITTLE GIRL.> I INTENDED TO LOOK FOR KAY! DON'T YOU KNOW WHERE HE IS?> SHE ASKED OF THE ROSES.> DO YOU THINK HE IS DEAD AND GONE?> > DEAD HE CERTAINLY IS NOT/> SAID THE ROSES.> WE HAVE BEEN IN THE EARTH WHERE ALL THE DEAD ARE/ BUT KAY WAS NOT THERE.> > MANY THANKS!> SAID LITTLE GERDA AND SHE WENT TO THE OTHER FLOWERS/ LOOKED INTO THEIR CUPS/ AND ASKED/> DON'T YOU KNOW WHERE LITTLE KAY IS?> BUT EVERY FLOWER STOOD IN THE SUNSHINE/ AND DREAMED ITS OWN FAIRY TALE OR ITS OWN STORY: AND THEY ALL TOLD HER VERY MANY THINGS/ BUT NOT ONE KNEW ANYTHING |
The characters on the Odyssey 2 are limited, so I’ve had to do some interesting things with punctuation. In this article I used a different color for quoted strings, and render the Snow Queen.