![]() |
|
#1
|
|||
|
|||
|
A Perl Grammar Corrector
Hello everyone.
A few people around the forum might know that I've been learning Perl. My brain, being big and bold, decided to set itself a challenge: write a english (but potentially any other latin character set language, with a few modifications) grammar correction script. Perl has an odd syntax. I don't like it for cryptic variable names like $$, $_, @_ and others (I think it has one variable for every printable symbol!). For those who do not know what Perl is, I suggest you look up what Perl is and read something else: this tutorial is not really for you. It is for the person who knows what they're doing. If you're really keen, go to your library or bookstore and pick up the 'Perl for Dummies, 3rd edition' book. I have to say it is excellent (the book), because I've learned Perl to almost everything, and touched up* on my Regular Expressions (see below for an explaination on what exactly these are), in just one week. *OK, I'll level with you. My regex was horrible. Since reading the book I've learned much more, and I have to thank everyone involved in it. Regular Expressions: Huh? Regular expressions (or regex, as I will refer to them occassionally) are a pattern matching system. They are one of Perl's bigger strengths among it's occasional weaknesses. I won't cover how to do regular expressions in this tutorial, but I will show you an example so you won't be left in the dust. Think about this: You work for a company. The company wants you to write a searching system, but integrate specific other searches, like only results that contain up to 5 letters and a number. You could, if you wanted to, list every letter and number combination, but that would be stupid and take up unneccessary space in the final application. Instead, you would use regular expressions. The following regex would do the trick: Code:
/([a-zA-Z]){1,5}([0-9])+/
__________________
User of Ubuntu Linux (Intrepid Ibex 8.10) and Windows XP. Laptop: Core 2 Duo 2.0 GHz / 3 GB RAM / 256 MB Radeon HD 3450 / 1280x800 15.4" LCD WLED / Backlit KB / Ubuntu 8.10 / 250G SATA HDD / Studio 15 Media PC: Celeron 3.06 GHz / 1.5 GB RAM / Intel 915GMA / Windows XP / 250G SATA HDD / Scaleo E Last edited by iTom; 04-14-2007 at 08:33 PM. |
| Sponsored Links |
|
#2
|
|||
|
|||
|
So how does this come together?
Think about this now (yes, again!): don't you think you could write a grammar correction script using the regular expressions? I will be providing snippets of a script and the whole script. In fact, I'll give you the whole script first and walk you through it, and the changes that may need done. I am releasing this code under the GPL. Code:
# grammar.pl: Tom's English grammar correction tool (TEGCT)
# Corrects grammar in a string.
# Ok the string is quite horrible
$string = "jerry ( who was a carpenter ) had 12 4x 3 plank s., he\"d bought them the other yesterday from a plank sellin' guy.. he decided to build a 3 lrg
houses with the planks . ";
# Print out the intro text...
print "Tom's Grammar Corrector\n\n";
# New string
$newstring = $string;
$tests = $regex = 0;
# First get rid of unneccessary extra punctuation (correct multiple commas, dots, new lines and apostrophes)
print "Correcting extra punctuation... "; $tests++;
$newstring =~ s/\n{3,}/\n\n/; $regex++;
$newstring =~ s/,{2,}/,/; $regex++;
$newstring =~ s/\.{1,2}/./; $regex++;
$newstring =~ s/'{2,}/'/; $regex++;
print "Done!\n";
# Now get rid of spaces next to the brackets
print "Removing spaces near brackets... "; $tests++;
$newstring =~ s/\(\s/(/; $regex++;
$newstring =~ s/\s\)/)/; $regex++;
print "Done!\n";
# Get rid of spaces left of dots and commas
print "Removing spaces left of dots and commas... "; $tests++;
$newstring =~ s/\s\././; $regex++;
$newstring =~ s/\s,/,/; $regex++;
print "Done!\n";
# Get rid of spaces right of dots
print "Removing spaces right of dots and commas... "; $tests++;
$newstring =~ s/\s\././; $regex++;
print "Done!\n";
# Change "in'" and "in`" to "ing"
print "Changing \"in'\" to \"ing\"... "; $tests++;
$newstring =~ s/(in'|in`)/ing/; $regex++;
print "Done!\n";
# Change " @ " to "at"
print "Changing \" @ \" to \" at \"... "; $tests++;
$newstring =~ s/ @ / at /; $regex++;
$newstring =~ s/ @/ at /; $regex++;
print "Done!\n";
# Change "9x 3" to "9 x 3"
print "Changing 9x 3, etc. to 9 x 3, etc... "; $tests++;
$newstring =~ s/(\d)x (\d)/$1 x $2/; $regex++;
print "Done!\n";
# Change " i " to " I "
print "Correcting captialisation of i to I... "; $tests++;
$newstring =~ s/( )+i( )+/ I /; $regex++;
print "Done!\n";
# Change "a,b" to "a, b"
print "Correcting comma spacing... "; $tests++;
$newstring =~ s/([a-zA-Z]),([a-zA-Z])/$1, $2/; $regex++;
print "Done!\n";
# Change "un-..." to "un..."
print "Correcting un-... to un... "; $tests++;
$newstring =~ s/un-(.*)/un$1/; $regex++;
print "Done!\n";
# Change "# 10" etc to "#10"
print "Correcting # number formatting... "; $tests++;
$newstring =~ s/# +(\d)/#$1/; $regex++;
print "Done!\n";
# Correcting minor grammar errors...
print "Correcting minor grammar errors... "; $tests++;
$newstring =~ s/\.!/!/; $regex++;
$newstring =~ s/\.\?/?/; $regex++;
$newstring =~ s/,\?/?/; $regex++;
$newstring =~ s/,!/?/; $regex++;
$newstring =~ s/!\./!/; $regex++;
$newstring =~ s/\?\./?/; $regex++;
$newstring =~ s/\?,/?/; $regex++;
$newstring =~ s/!,/?/; $regex++;
$newstring =~ s/ !/!/; $regex++;
$newstring =~ s/ \?/?/; $regex++;
$newstring =~ s/!(\w)/! $1/; $regex++;
$newstring =~ s/\?(\w)/? $1/; $regex++;
print "Done!\n";
# Correcting punctuaction near quotes...
print "Correcting punctuaction near quotes... "; $tests++;
$newstring =~ s/"\?/?"/; $regex++;
$newstring =~ s/"!/!"/; $regex++;
print "Done!\n";
# Changing the 'let"s', etc. grammar error...
print "Change the 'let\"s', etc. grammar error... "; $tests++;
$newstring =~ s/([A-Za-z])"(s|d)/$1'$2/; $regex++;
print "Done!\n";
# Changing the "him / her self" error
print "Changing the 'him / her / it / my self' error ";
$newstring =~ s/(him|her|it|my)( |-)self/$1self/; $regex++;
print "Done!\n";
# Changing "how ever" to "however"
print "Changing 'how ever' to 'however'... "; $tests++;
$newstring =~ s/how(\s|-)ever/however/; $regex++;
print "Done!\n";
# Correcting spacing
print "Correcting spacing... "; $tests++;
$newstring =~ s/ {2,}/ /; $regex++;
print "Done!\n";
# Correcting ',.' to '.', etc.
print "Correcting ',.' to '.', etc... "; $tests++;
$newstring =~ s/,\././; $regex++;
print "Done!\n";
# Correcting '50 p' to '50p', etc.
print "Correcting '50 p' to '50p', etc... "; $tests++;
$newstring =~ s/([0-9]) p( |\.|,)/$1p$2/; $regex++;
print "Done!\n";
# Fix multiple question marks and exclaimation marks
print "Fix multiple question marks and exclaimation marks... "; $tests++;
$newstring =~ s/(!|\?){2,}/$1/; $regex++;
print "Done!\n";
__________________
User of Ubuntu Linux (Intrepid Ibex 8.10) and Windows XP. Laptop: Core 2 Duo 2.0 GHz / 3 GB RAM / 256 MB Radeon HD 3450 / 1280x800 15.4" LCD WLED / Backlit KB / Ubuntu 8.10 / 250G SATA HDD / Studio 15 Media PC: Celeron 3.06 GHz / 1.5 GB RAM / Intel 915GMA / Windows XP / 250G SATA HDD / Scaleo E Last edited by iTom; 04-15-2007 at 02:43 PM. |
|
#3
|
|||
|
|||
|
The code has to be broken up to be posted. Here is part 2.
Code:
# Fix common grammar errors
print "Fix common grammar errors... "; $tests++;
$newstring =~ s/other yesterday/other day/; $regex++;
$newstring =~ s/yester-day/yesterday/; $regex++;
$newstring =~ s/\.,/./; $regex++;
$newstring =~ s/(\w) s/$1s/; $regex++;
$newstring =~ s/\.{2}/./; $regex++;
print "Done!\n";
# Fix abbreviations of lrg, sml, etc.
print "Fix abbreviations and common misspellings of lrg, sml, etc... "; $tests++;
$newstring =~ s/lrg/large/; $regex++;
$newstring =~ s/med /medium /; $regex++;
$newstring =~ s/sml/small/; $regex++;
$newstring =~ s/larg /large /; $regex++;
$newstring =~ s/smal /small /; $regex++;
print "Done!\n";
# Fix "build a 3", "read a 4", etc.
print "Fix 'build a 3 ', 'read a 4', etc... "; $tests++;
$newstring =~ s/a ([0-9])/$1/; $regex++;
print "Done!\n";
# Fix capitalization
print "Fix capitalization... "; $tests++;
@s = split(/\. /, $newstring); $stringnew = ""; $regex++;
foreach $s (@s) {
$stringnew .= ucfirst($s) . ". ";
}
$newstring = $stringnew;
print "Done!\n";
# Fix certain currency spacing
print "Fix certain currency spacing... "; $tests++;
$newstring =~ s/([\$£€]) ([0-9])/$1$2/; $regex++;
print "Done!\n";
# Fix percentage spacing
print "Fix percentage spacing... "; $tests++;
$newstring =~ s/([0-9])%/$1%/; $regex++;
print "Done!\n";
# Fix incorrect use of the percentage sign
print "Fix incorrect use of the percentage sign... "; $tests++;
$newstring =~ s/%( )([0-9])/$2%/; $regex++;
$newstring =~ s/%%/%/; $regex++;
print "Done!\n";
# Change '!.' to '!'
print "Change `!.' to `!'... "; $tests++;
$newstring =~ s/!\./!/; $regex++;
print "Done!\n";
# Print new lines and test info
print "\n$tests types of tests & corrections performed ($regex regular expressions). \n\n";
open(FILE, '>', 'grammar.txt');
print "Original string: \n$string\n\n";
print "New string: \n$newstring\n\n";
print FILE $newstring;
__________________
User of Ubuntu Linux (Intrepid Ibex 8.10) and Windows XP. Laptop: Core 2 Duo 2.0 GHz / 3 GB RAM / 256 MB Radeon HD 3450 / 1280x800 15.4" LCD WLED / Backlit KB / Ubuntu 8.10 / 250G SATA HDD / Studio 15 Media PC: Celeron 3.06 GHz / 1.5 GB RAM / Intel 915GMA / Windows XP / 250G SATA HDD / Scaleo E Last edited by iTom; 04-15-2007 at 02:04 PM. |
|
#4
|
|||
|
|||
|
The String
I've used the following string, which will make most English teachers die of internal hemorrhaging around the heart at the sight, let alone the mention. I can't disagree. I hate bad grammar. I decided to give people no choice: anyone can integrate this script into theirs, and hopefully people will start using this and distributing it, and ad infinitum. You can modify this string to whichever way you want. Quote:
The following results were produced when executing the script. Quote:
Code:
C:\Perl>grammar.pl Tom's Grammar Corrector Correcting extra punctuation... Done! Removing spaces near brackets... Done! Removing spaces left of dots and commas... Done! Removing spaces right of dots and commas... Done! Changing "in'" to "ing"... Done! Changing " @ " to " at "... Done! Changing 9x 3, etc. to 9 x 3, etc... Done! Correcting captialisation of i to I... Done! Correcting comma spacing... Done! Correcting un-... to un... Done! Correcting # number formatting... Done! Correcting minor grammar errors... Done! Correcting punctuaction near quotes... Done! Change the 'let"s', etc. grammar error... Done! Changing the 'him / her / it / my self' error... Done! Changing 'how ever' to 'however'... Done! Correcting spacing... Done! Correcting ',.' to '.', etc... Done! Correcting '50 p' to '50p', etc... Done! Fix multiple question marks and exclaimation marks... Done! Fix multiple spaces... Done! Fix common grammar errors... Done! Fix abbreviations and common misspellings of lrg, sml, etc... Done! Fix 'build a 3 ', 'read a 4', etc... Done! Fix capitalization... Done! Fix certain currency spacing... Done! Fix percentage spacing... Done! Fix incorrect use of the percentage sign... Done! Change `!.' to `!'... Done! 28 types of tests & corrections performed (56 regular expressions). Original string: jerry ( who was a carpenter ) had 12 4x 3 plank s., he"d bought them the other yesterday from a plank sellin' guy @ $ 12 .50 a pop.. he decided to build a 3 lr g houses with the planks !!!!!! New string: Jerry (who was a carpenter) had 12 4 x 3 planks. He'd bought them the other day from a plank selling guy at $12.50 a pop. He decided to build 3 large houses wit h the planks! C:\Perl>
__________________
User of Ubuntu Linux (Intrepid Ibex 8.10) and Windows XP. Laptop: Core 2 Duo 2.0 GHz / 3 GB RAM / 256 MB Radeon HD 3450 / 1280x800 15.4" LCD WLED / Backlit KB / Ubuntu 8.10 / 250G SATA HDD / Studio 15 Media PC: Celeron 3.06 GHz / 1.5 GB RAM / Intel 915GMA / Windows XP / 250G SATA HDD / Scaleo E Last edited by iTom; 04-15-2007 at 02:03 PM. |
|
#5
|
|||
|
|||
|
So, what exactly does this correct?
Maybe you're wondering what this corrects. Maybe not. If you are, I invite you to read this section where I walk through every test and correction performed in detail.
__________________
User of Ubuntu Linux (Intrepid Ibex 8.10) and Windows XP. Laptop: Core 2 Duo 2.0 GHz / 3 GB RAM / 256 MB Radeon HD 3450 / 1280x800 15.4" LCD WLED / Backlit KB / Ubuntu 8.10 / 250G SATA HDD / Studio 15 Media PC: Celeron 3.06 GHz / 1.5 GB RAM / Intel 915GMA / Windows XP / 250G SATA HDD / Scaleo E |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Greymatter and Perl | _star_ | Advanced Programming | 2 | 05-15-2005 02:37 PM |
| how does Perl help?? | @lMiGhTy_ViNcE | Advanced Programming | 8 | 09-23-2003 02:42 PM |
| email forms and perl | trueman15 | Advanced Programming | 1 | 08-11-2003 06:28 PM |