Lissa Explains it All:  Web Design Forums  

Go Back   Lissa Explains it All: Web Design Forums > LEIA Archives > Web Site Help > User Submitted Tutorials

Notices

 
 
Thread Tools Display Modes
  #1  
Old 04-14-2007, 08:28 PM
iTom iTom is offline
Linux + web geek
 
Join Date: May 2005
Posts: 1,592
iTom is on a distinguished road
A Perl Grammar Corrector

Hello everyone.

A few people around the forum might know that I've been learning Perl. My brain, being big and bold, decided to set itself a challenge: write a english (but potentially any other latin character set language, with a few modifications) grammar correction script.

Perl has an odd syntax. I don't like it for cryptic variable names like $$, $_, @_ and others (I think it has one variable for every printable symbol!). For those who do not know what Perl is, I suggest you look up what Perl is and read something else: this tutorial is not really for you. It is for the person who knows what they're doing. If you're really keen, go to your library or bookstore and pick up the 'Perl for Dummies, 3rd edition' book. I have to say it is excellent (the book), because I've learned Perl to almost everything, and touched up* on my Regular Expressions (see below for an explaination on what exactly these are), in just one week.

*OK, I'll level with you. My regex was horrible. Since reading the book I've learned much more, and I have to thank everyone involved in it.

Regular Expressions: Huh?

Regular expressions (or regex, as I will refer to them occassionally) are a pattern matching system. They are one of Perl's bigger strengths among it's occasional weaknesses.

I won't cover how to do regular expressions in this tutorial, but I will show you an example so you won't be left in the dust.

Think about this: You work for a company. The company wants you to write a searching system, but integrate specific other searches, like only results that contain up to 5 letters and a number. You could, if you wanted to, list every letter and number combination, but that would be stupid and take up unneccessary space in the final application. Instead, you would use regular expressions. The following regex would do the trick:

Code:
/([a-zA-Z]){1,5}([0-9])+/
Simpler, eh?
__________________
User of Ubuntu Linux (Intrepid Ibex 8.10) and Windows XP.
Laptop: Core 2 Duo 2.0 GHz / 3 GB RAM / 256 MB Radeon HD 3450 / 1280x800 15.4" LCD WLED / Backlit KB / Ubuntu 8.10 / 250G SATA HDD / Studio 15
Media PC: Celeron 3.06 GHz / 1.5 GB RAM / Intel 915GMA / Windows XP / 250G SATA HDD / Scaleo E

Last edited by iTom; 04-14-2007 at 08:33 PM.
Sponsored Links
  #2  
Old 04-14-2007, 08:30 PM
iTom iTom is offline
Linux + web geek
 
Join Date: May 2005
Posts: 1,592
iTom is on a distinguished road
So how does this come together?

Think about this now (yes, again!): don't you think you could write a grammar correction script using the regular expressions? I will be providing snippets of a script and the whole script. In fact, I'll give you the whole script first and walk you through it, and the changes that may need done. I am releasing this code under the GPL.

Code:
# grammar.pl: Tom's English grammar correction tool (TEGCT)
#  Corrects grammar in a string. 
 
# Ok the string is quite horrible
$string = "jerry  ( who was a carpenter ) had 12 4x 3 plank s., he\"d bought them the other yesterday from a plank sellin' guy.. he decided to build a 3 lrg 
houses with the planks . ";
 
# Print out the intro text... 
print "Tom's Grammar Corrector\n\n";
 
# New string
$newstring = $string;
$tests = $regex = 0;
 
# First get rid of unneccessary extra punctuation (correct multiple commas, dots, new lines and apostrophes)
print "Correcting extra punctuation... "; $tests++;
$newstring =~ s/\n{3,}/\n\n/; $regex++;
$newstring =~ s/,{2,}/,/; $regex++; 
$newstring =~ s/\.{1,2}/./; $regex++;
$newstring =~ s/'{2,}/'/; $regex++;
print "Done!\n";
 
# Now get rid of spaces next to the brackets
print "Removing spaces near brackets... "; $tests++;
$newstring =~ s/\(\s/(/; $regex++;
$newstring =~ s/\s\)/)/; $regex++;
print "Done!\n";
 
# Get rid of spaces left of dots and commas
print "Removing spaces left of dots and commas... "; $tests++;
$newstring =~ s/\s\././; $regex++;
$newstring =~ s/\s,/,/; $regex++;
print "Done!\n";
 
# Get rid of spaces right of dots
print "Removing spaces right of dots and commas... "; $tests++;
$newstring =~ s/\s\././; $regex++;
print "Done!\n";
 
# Change "in'" and "in`" to "ing"
print "Changing \"in'\" to \"ing\"... "; $tests++;
$newstring =~ s/(in'|in`)/ing/; $regex++;
print "Done!\n";
 
# Change " @ " to "at"
print "Changing \" @ \" to \" at \"... "; $tests++;
$newstring =~ s/ @ / at /; $regex++;
$newstring =~ s/ @/ at /; $regex++;
print "Done!\n"; 
 
# Change "9x 3" to "9 x 3"
print "Changing 9x 3, etc. to 9 x 3, etc... "; $tests++;
$newstring =~ s/(\d)x (\d)/$1 x $2/; $regex++;
print "Done!\n";
 
# Change " i " to " I "
print "Correcting captialisation of i to I... "; $tests++;
$newstring =~ s/( )+i( )+/ I /; $regex++;
print "Done!\n";
 
# Change "a,b" to "a, b"
print "Correcting comma spacing... "; $tests++;
$newstring =~ s/([a-zA-Z]),([a-zA-Z])/$1, $2/; $regex++;
print "Done!\n";
 
# Change "un-..." to "un..."
print "Correcting un-... to un... "; $tests++;
$newstring =~ s/un-(.*)/un$1/; $regex++;
print "Done!\n";
 
# Change "# 10" etc to "#10"
print "Correcting # number formatting... "; $tests++;
$newstring =~ s/# +(\d)/#$1/; $regex++;
print "Done!\n";
 
# Correcting minor grammar errors... 
print "Correcting minor grammar errors... "; $tests++;
$newstring =~ s/\.!/!/; $regex++;
$newstring =~ s/\.\?/?/; $regex++;
$newstring =~ s/,\?/?/; $regex++;
$newstring =~ s/,!/?/; $regex++;
$newstring =~ s/!\./!/; $regex++;
$newstring =~ s/\?\./?/; $regex++;
$newstring =~ s/\?,/?/; $regex++;
$newstring =~ s/!,/?/; $regex++;
$newstring =~ s/ !/!/; $regex++;
$newstring =~ s/ \?/?/; $regex++;
$newstring =~ s/!(\w)/! $1/; $regex++;
$newstring =~ s/\?(\w)/? $1/; $regex++;
print "Done!\n";
 
# Correcting punctuaction near quotes...
print "Correcting punctuaction near quotes... "; $tests++;
$newstring =~ s/"\?/?"/; $regex++;
$newstring =~ s/"!/!"/; $regex++;
print "Done!\n";
 
# Changing the 'let"s', etc. grammar error... 
print "Change the 'let\"s', etc.  grammar error... "; $tests++;
$newstring =~ s/([A-Za-z])"(s|d)/$1'$2/; $regex++;
print "Done!\n";
 
# Changing the "him / her self" error
print "Changing the 'him / her / it / my self' error ";
$newstring =~ s/(him|her|it|my)( |-)self/$1self/; $regex++;
print "Done!\n";
 
# Changing "how ever" to "however"
print "Changing 'how ever' to 'however'... "; $tests++;
$newstring =~ s/how(\s|-)ever/however/; $regex++;
print "Done!\n";
 
# Correcting spacing
print "Correcting spacing... "; $tests++;
$newstring =~ s/ {2,}/ /; $regex++;
print "Done!\n";
 
# Correcting ',.' to '.', etc. 
print "Correcting ',.' to '.', etc... "; $tests++;
$newstring =~ s/,\././; $regex++;
print "Done!\n";
 
# Correcting '50 p' to '50p', etc. 
print "Correcting '50 p' to '50p', etc... "; $tests++;
$newstring =~ s/([0-9]) p( |\.|,)/$1p$2/; $regex++;
print "Done!\n";
 
# Fix multiple question marks and exclaimation marks
print "Fix multiple question marks and exclaimation marks... "; $tests++;
$newstring =~ s/(!|\?){2,}/$1/; $regex++;
print "Done!\n";
The code is continued in the next post.
__________________
User of Ubuntu Linux (Intrepid Ibex 8.10) and Windows XP.
Laptop: Core 2 Duo 2.0 GHz / 3 GB RAM / 256 MB Radeon HD 3450 / 1280x800 15.4" LCD WLED / Backlit KB / Ubuntu 8.10 / 250G SATA HDD / Studio 15
Media PC: Celeron 3.06 GHz / 1.5 GB RAM / Intel 915GMA / Windows XP / 250G SATA HDD / Scaleo E

Last edited by iTom; 04-15-2007 at 02:43 PM.
  #3  
Old 04-14-2007, 08:31 PM
iTom iTom is offline
Linux + web geek
 
Join Date: May 2005
Posts: 1,592
iTom is on a distinguished road
The code has to be broken up to be posted. Here is part 2.

Code:
# Fix common grammar errors
print "Fix common grammar errors... "; $tests++;
$newstring =~ s/other yesterday/other day/; $regex++;
$newstring =~ s/yester-day/yesterday/; $regex++;
$newstring =~ s/\.,/./; $regex++;
$newstring =~ s/(\w) s/$1s/; $regex++;
$newstring =~ s/\.{2}/./; $regex++;
print "Done!\n";
 
# Fix abbreviations of lrg, sml, etc. 
print "Fix abbreviations and common misspellings of lrg, sml, etc... "; $tests++;
$newstring =~ s/lrg/large/; $regex++;
$newstring =~ s/med /medium /; $regex++;
$newstring =~ s/sml/small/; $regex++;
$newstring =~ s/larg /large /; $regex++;
$newstring =~ s/smal /small /; $regex++;
print "Done!\n";
 
# Fix "build a 3", "read a 4", etc. 
print "Fix 'build a 3 ', 'read a 4', etc... "; $tests++;
$newstring =~ s/a ([0-9])/$1/; $regex++;
print "Done!\n";

# Fix capitalization
print "Fix capitalization... "; $tests++;
@s = split(/\. /, $newstring); $stringnew = ""; $regex++;
foreach $s (@s) {
	$stringnew .= ucfirst($s) . ". ";
}
$newstring = $stringnew;
print "Done!\n";

# Fix certain currency spacing
print "Fix certain currency spacing... "; $tests++;
$newstring =~ s/([\$£€]) ([0-9])/$1$2/; $regex++;
print "Done!\n";

# Fix percentage spacing
print "Fix percentage spacing... "; $tests++;
$newstring =~ s/([0-9])%/$1%/; $regex++;
print "Done!\n";

# Fix incorrect use of the percentage sign
print "Fix incorrect use of the percentage sign... "; $tests++;
$newstring =~ s/%( )([0-9])/$2%/; $regex++;
$newstring =~ s/%%/%/; $regex++;
print "Done!\n";

# Change '!.' to '!'
print "Change `!.' to `!'... "; $tests++;
$newstring =~ s/!\./!/; $regex++;
print "Done!\n";

# Print new lines and test info
print "\n$tests types of tests & corrections performed ($regex regular expressions). \n\n";
open(FILE, '>', 'grammar.txt');
print "Original string: \n$string\n\n";
print "New string: \n$newstring\n\n";
print FILE $newstring;
__________________
User of Ubuntu Linux (Intrepid Ibex 8.10) and Windows XP.
Laptop: Core 2 Duo 2.0 GHz / 3 GB RAM / 256 MB Radeon HD 3450 / 1280x800 15.4" LCD WLED / Backlit KB / Ubuntu 8.10 / 250G SATA HDD / Studio 15
Media PC: Celeron 3.06 GHz / 1.5 GB RAM / Intel 915GMA / Windows XP / 250G SATA HDD / Scaleo E

Last edited by iTom; 04-15-2007 at 02:04 PM.
  #4  
Old 04-14-2007, 08:34 PM
iTom iTom is offline
Linux + web geek
 
Join Date: May 2005
Posts: 1,592
iTom is on a distinguished road
The String

I've used the following string, which will make most English teachers die of internal hemorrhaging around the heart at the sight, let alone the mention. I can't disagree. I hate bad grammar. I decided to give people no choice: anyone can integrate this script into theirs, and hopefully people will start using this and distributing it, and ad infinitum. You can modify this string to whichever way you want.

Quote:
jerry ( who was a carpenter ) had 12 4x 3 plank s., he"d bought them the other yesterday from a plank sellin' guy @ $ 12 .50 a pop.. he decided to build a 3 lrg houses with the planks !!!!!!
The Results?

The following results were produced when executing the script.

Quote:
Jerry (who was a carpenter) had 12 4 x 3 planks. He'd bought them the other day from a plank selling guy at $12.50 a pop. He decided to build 3 large houses with the planks!
The full output also allows you to debug and read some exciting information.

Code:
C:\Perl>grammar.pl
Tom's Grammar Corrector

Correcting extra punctuation... Done!
Removing spaces near brackets... Done!
Removing spaces left of dots and commas... Done!
Removing spaces right of dots and commas... Done!
Changing "in'" to "ing"... Done!
Changing " @ " to " at "... Done!
Changing 9x 3, etc. to 9 x 3, etc... Done!
Correcting captialisation of i to I... Done!
Correcting comma spacing... Done!
Correcting un-... to un... Done!
Correcting # number formatting... Done!
Correcting minor grammar errors... Done!
Correcting punctuaction near quotes... Done!
Change the 'let"s', etc.  grammar error... Done!
Changing the 'him / her / it / my self' error... Done!
Changing 'how ever' to 'however'... Done!
Correcting spacing... Done!
Correcting ',.' to '.', etc... Done!
Correcting '50 p' to '50p', etc... Done!
Fix multiple question marks and exclaimation marks... Done!
Fix multiple spaces... Done!
Fix common grammar errors... Done!
Fix abbreviations and common misspellings of lrg, sml, etc... Done!
Fix 'build a 3  ', 'read a 4', etc... Done!
Fix capitalization... Done!
Fix certain currency spacing... Done!
Fix percentage spacing... Done!
Fix incorrect use of the percentage sign... Done!
Change `!.' to `!'... Done!

28 types of tests & corrections performed (56 regular expressions).

Original string:
jerry  ( who was a carpenter ) had 12 4x 3 plank s., he"d bought them the other
yesterday from a plank sellin' guy @ $ 12 .50 a pop.. he decided to build a 3 lr
g houses with the planks !!!!!!

New string:
Jerry (who was a carpenter) had 12 4 x 3 planks. He'd bought them the other day
from a plank selling guy at $12.50 a pop. He decided to build 3 large houses wit
h the planks!


C:\Perl>
__________________
User of Ubuntu Linux (Intrepid Ibex 8.10) and Windows XP.
Laptop: Core 2 Duo 2.0 GHz / 3 GB RAM / 256 MB Radeon HD 3450 / 1280x800 15.4" LCD WLED / Backlit KB / Ubuntu 8.10 / 250G SATA HDD / Studio 15
Media PC: Celeron 3.06 GHz / 1.5 GB RAM / Intel 915GMA / Windows XP / 250G SATA HDD / Scaleo E

Last edited by iTom; 04-15-2007 at 02:03 PM.
  #5  
Old 04-15-2007, 02:41 PM
iTom iTom is offline
Linux + web geek
 
Join Date: May 2005
Posts: 1,592
iTom is on a distinguished road
So, what exactly does this correct?

Maybe you're wondering what this corrects. Maybe not. If you are, I invite you to read this section where I walk through every test and correction performed in detail.
  1. Test 1 is complicated. Oh, you don't want to here that. It corrects extra puntuation in the string, so "hello,, world" is turned into "hello, world". It does this with dots, new lines and two single quotes are turned into one quote.
  2. Test 2 removes the spaces next to the brackets. "hello ( world )" becomes "hello (world)". It does this with both curvy brackets.
  3. Test 3 removes spaces left of dots and commas. "hello , world!" becomes "hello, world!".
  4. Test 4 corrects a sort of slang: it corrects in' to ing. So, "i'm walkin' home" is corrected to "i'm walking home."
  5. Test 5 corrects " @ " to " at ". Notice the spaces. This is because it would correct email addresses if I didn't supply the spaces.
  6. Test 6 corrects "9x 3" to "9 x 3" and other variants. It does not correct "9 x3" because that could be 'the number 9 times 3'.
  7. Test 7 corrects the capitalization of the I when used in personal context.
  8. Test 8 fixes spacing around commas. "a,b" becomes "a, b".
  9. Test 9 corrects "un-expectedly" (and others) to "unexpectedly". The 'un-' should be 'un'.
  10. Test 10 corrects the # spacing near the number. Example - "I ordered product ID # 3006" should be "I ordered product ID #3006".
  11. Test 11 is a monster. It corrects a variety of misuses of the punctuation. It corrects .! to !, and other misuses.
  12. Test 12 corrects a common mistake - '"Hello"! said Tom' will be corrected to '"Hello!" said Tom', and the same goes for the question mark.
  13. Test 13 corrects a not-so-common error, using the double quote instead of the single quote for 'let"s' and other similar words.
  14. Test 14 corrects another not-so-common error like test 13. It corrects spacing with 'him self', 'her self', 'my self' and 'it self'. The space between the him, etc. and the self part should not exist.
  15. Test 15 again corrects a not-so-common error. It corrects "how ever" to "however".
  16. Test 16 corrects spacing, however it is slightly glitchy. It works but may not always fix spacing.
  17. Test 17 corrects ",." to ".".
  18. Test 18 corrects the incorrect use of the "<n> p" (to indicate British Pence). You can remove this if you don't want. For example "50 p" will be corrected to "50p".
  19. Test 19 corrects multiple question marks and exclaimation marks. "hello, world!!!!????" is corrected to "hello, world!?".
  20. Test 20 fixes a variety of common grammar errors which are listed below.
    • 'other yesterday' is changed to 'other day'.
    • 'yester-day' is changed to 'yesterday'.
    • '.,' is changed to '.'.
    • 'plank s' and other variants with a space before the s are corrected.
    • Two dots converted to one.
  21. Test 21 fixes abbreviations of large, medium and small.
  22. Test 22 fixes "... a 4" error, i.e. "read a 4 books" is corrected to "read 4 books".
  23. Test 23 is a bit of a monster. It fixes up the capitalization. This is done later to avoid confusion and conflict with other functions. It works by splitting up all sentances by a dot and a space. A dot without a space will not be registered as a sentance (since there is the possibility of it being a decimal point or an email address).
  24. Test 24 corrects spacing near certain currency.
  25. Test 25 corrects percentage spacing.
  26. Test 26 corrects incorrect use of the percentage sign.
  27. Test 27 corrects '!.' to '!'.
__________________
User of Ubuntu Linux (Intrepid Ibex 8.10) and Windows XP.
Laptop: Core 2 Duo 2.0 GHz / 3 GB RAM / 256 MB Radeon HD 3450 / 1280x800 15.4" LCD WLED / Backlit KB / Ubuntu 8.10 / 250G SATA HDD / Studio 15
Media PC: Celeron 3.06 GHz / 1.5 GB RAM / Intel 915GMA / Windows XP / 250G SATA HDD / Scaleo E
 

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Greymatter and Perl _star_ Advanced Programming 2 05-15-2005 02:37 PM
how does Perl help?? @lMiGhTy_ViNcE Advanced Programming 8 09-23-2003 02:42 PM
email forms and perl trueman15 Advanced Programming 1 08-11-2003 06:28 PM


All times are GMT. The time now is 04:33 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.