This lesson is essentially an extended defense of the CGI.pm module. In it, we take a diversion from learning CGI programming, and instead go "under the hood". If you are already convinced of the virtues of CGI.pm, you can skip this lesson, but it may be worth reading anyway. By the time you are finished with this lesson, you should be able to glance at most "hand-rolled" methods of parsing CGI data and explain how they are broken.
If you've been messing around with Perl long enough, then you already know the mantra: "There is more than one way to do it" (TIMTOWTDI). So why does everyone "in the know" say you shouldn't be writing CGI scripts without CGI.pm? Shouldn't we be able to do CGI handling our own way, if we want to? The answer, quite simply, is that most scripts which attempt to parse CGI data are horribly broken. Even those in published books are often broken. To understand why, we need to understand how CGI data is encoded.
Here is a little "behind the scenes" information you need in order to understand the following discussion. It is somewhat oversimplified, but you will have what you need to know for basic CGI programming. If you'd like to learn more, see the World Wide Web Consortium's documentation on HTTP.
Here's how the web works in a nutshell: your browser makes a request and a web server figures out what to do with it and eventually sends information back to the browser. Sometimes, it passes to a CGI script the information which your browser sends, that script creates a response, and the web server returns the response to you. Pretty simple, huh? For our purposes, we'll only need to worry about two types of requests: GET and POST. There are a variety of other request methods, such as HEAD, PUT, TRACE, etc, but we're pretty much going to stick to the basics. How the CGI script responds to a browser request depends in part upon the request method.
Most of the time, when your browser requests a web page, it's issuing a GET request. What it's doing behind the scenes is sending out something like the following:
GET /~ovid/index.html HTTP/1.1 Host: perlmonk.org more stuff...
For this example, the browser is requesting Ovid's home page on perlmonk.org. The web server receives the GET request and sends back the appropriate document, if found. If the document is not found, you will probably receive the ever-popular "404 Not Found" error.
Every once in a while, you'll see a URL like the following:
http://www.somedomain.com/cgi-bin/somescript.cgi?name=BillGates&position=monopolist
Breaking this down, we have the host: www.somedomain.com, the path: /cgi-bin/, and the name of a cgi script, in this case, somescript.cgi. But what's that stuff afterwards? The question mark signifies the beginning of what's called a query string (in your Perl program, you could access this through $ENV{'QUERY_STRING'}). This string, in the case of a GET request, is embedded in the URL. This makes it easy to modify and/or bookmark. The "name=BillGates" and "position=monopolist" are referred to as name/value pairs and they are separated by ampersands (&). These name/value pairs are typically used by CGI scripts to determine how the script should respond to the browser request.
The example above is simple, but often, these get very complicated. Here's the URL for deciphering a "geek code", with a rather complicated query string (I've broken it into separate lines so it doesn't force horizontal scrolling):
http://www.ebb.org/cgi-bin/ungeek.cgi?geekCode=++GCS+d+s%3A+a+ C%2B%2B+UL%2B+P%2B%2B%24+L%2B+%21E+W%2B%2B+N%2B+o%3F+%0D%0A++K -+w%2B+%21O+M-+%21V+PS%2B%2B+PE+Y%2B+PGP+%21t+%215+%21X+%0D%0A ++R+tv--+b%2B%2B%2B+DI%2B%2B+D---+G+e%2B+h+r%2B+y%2B%2B
If you're curious, you can click here see where it takes you.
The url above can be hardcoded into an anchor tag (<a href="http://www.ebb.org/cgi-bin/ungeek.cgi?geekCode=++GCS...">), or it can be created with a form, which we'll discuss later in this course . When you see a URL like that, you know that it's probably passing that information off to a CGI script somewhere for processing. One advantage of a GET request is that it is easy to bookmark.
What the browser does to create that query string is fairly straightforward. It takes all of your data and converts some characters which might have a special meaning in a query string (equal signs, ampersands, percent signs and others) into their hexadecimal equivalent. These values have a percent sign in front of them. For example, an exclamation point becomes %21. Spaces should be converted to %20, but are often converted to plus signs. The name/value pairs are joined with an equals sign and the pairs are joined with ampersands. For a complete list of how characters are converted, see Appendix 2.
For example, if you have the following HTML for an input box: <input type="text" name="Enter something"> and the user types in "Ovid deserves an A+ for this course!" (without the quotes), the resulting name/value pair in the query string is:
Enter%20something=Ovid%20deserves%20an%20A%2B%20for%20this%20course%21
Alternatively, if the spaces are replaced with plusses, it would appear as follows:
Enter+something=Ovid+deserves+an+A%2B+for+this+course%21
It should also be noted that the web server will often have a length limit for the query string. This limit is configurable and will vary from server to server. This is usually the problem if you get the rare 414 Request-URI Too Large error. Consult your server's administrator if you feel you are having an issue with this. If you are using Apache 1.3.2 or later, you can adjust this limit with the LimitRequestFieldsize directive.
For a POST request, the browser separates off the name/value pairs and other information which is sent in the header and puts it in what is known as the "entity-body" (or just "body") of the request. Normally, this extra data is in the same name/value pair format discussed above, but instead of being in $ENV{'QUERY_STRING'}, it is read from standard input (STDIN). So far, so good. We don't have a problem here. Further, if this was all there was to the CGI, writing your own version might not be too hard. However, we still have issues.
Let's say you want to upload a file. You will use enctype="multipart/form-data" in a <form> tag in your HTML (again, this will be discussed later in the course). Since it's difficult to toss an entire file into a query string (and there's a good chance that binary data might break it), this data is separated out differently. This is easier to show than to explain. Consider the following form in a web page:
<form action="/cgi-bin/display.cgi" method="post" enctype="multipart/form-data"> <input type="file" name="file"><br /> <input type="hidden" name="hiddenData" value="nothing to see here, folks"><br /> <input type="checkbox" name="someCheckbox" value="someValue">Checkbox<br /> <input type="submit"><br> </form>
Notice that three "names" are being sent, along with three values. The value of "file" would be the contents of whatever file was selected. (In this example, "temp.pl"). Also, the checkbox was checked (otherwise, no value is sent for it). Here's what the CGI script receives in standard input:
-----------------------------7d03ca41c0 Content-Disposition: form-data; name="file"; filename="C:\WINDOWS\Desktop\temp.pl" Content-Type: text/plain #!C:/perl/bin/perl.exe -w use strict; $SIG{INT} = 'IGNORE'; my $count=0; while (){ print $count++; } -----------------------------7d03ca41c0 Content-Disposition: form-data; name="hiddenData" nothing to see here, folks -----------------------------7d03ca41c0 Content-Disposition: form-data; name="someCheckbox" someValue -----------------------------7d03ca41c0--
This data has been sent in MIME format (Multi-Purpose Internet Mail Extensions - see RFC 1521 and RFC 1522 for more details). Notice that there aren't any name/value pairs separated by an equals sign! Needless to say, it's much tougher to read that data and parse it out. We'll cover why this is an issue later.
If, for some reason, you need to debug the information being sent via a POST with a form "enctype" of "multipart/form-data", you can use the following script, which is what I used to produce the output above. It doesn't use CGI.pm as I need to read from STDIN directly, which is not possible if CGI.pm is used.
#!c:/perl/bin/perl.exe -wT use strict; my $buffer; read (STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); print "Content-type: text/plain\n\n"; print $buffer;
The discussion up to this point has been laying the groundwork for this section. Without it, much of what you're about to read won't make sense. The following subroutine is a typical example of how many people try to parse CGI data themselves. We'll take a close look at it and examine what's wrong with it. There are more problems than you might think.
1: sub parseCGI { 2: 3: my (@data, %data, $buffer, $item, $key, $value); 4: 5: # Read in text 6: if ( $ENV{ 'REQUEST_METHOD' } eq "GET" ) { 7: 8: # Split the name/value pairs into separate array entries 9: 10: @data = split /&/, $ENV{ 'QUERY_STRING' }; 11: } elsif ( $ENV{ 'REQUEST_METHOD' } eq "POST" ) { 12: read ( STDIN, $buffer, $ENV{ 'CONTENT_LENGTH' } ); 13: @data = split /&/, $buffer; 14: } else { 15: print "Content-type: text/plain\n\n"; 16: print "Only GET and POST methods are allowed!"; 17: } 18: 19: foreach $item ( @data ) { 20: # Split name value pairs and convert plus signs to spaces 21: ($key, $value) = split /=/, $item; 22: $key =~ tr/\+/ /; 23: 24: # Convert %XX from hex numbers to alphanumeric 25: $key =~ s/%[\da-f][\da-f]/pack("c",hex($1))/gei; 26: 27: $value =~ tr/\+/ /; 28: $value =~ s/%[\da-f][\da-f]/pack("c",hex($1))/gei; 29: $data{$key} = $value; 30: } 31: return %data; 32: }
At first glance, this code seems okay. Lines 6 through 13 figure out if we have a GET or POST request and get the data the user submitted. Lines 10 and 13 split along the ampersands, thus getting the individual name/value pairs. Line 21 splits each pair at the equals sign into the name and value, respectively. Lines 22 and 27 convert the plus signs to spaces and lines 25 and 28 convert the hexadecimal characters back to their ASCII form. Finally, line 29 stuffs each name and value into a hash. What could be simpler?
The first problem should be obvious. If you look up at the POST request for multipart/form-data, you'll see that this bit of code won't return any name/value pairs! So much for uploading files. There are a variety of reasons why we want to be able to upload files, but we've just shot ourselves in the foot. If you insist upon creating an alternative to CGI.pm, you're going to have to figure out how to parse multipart/form-data.
The following query string is quite valid:
color=blue&color=red&color=chartreuse
Unfortunately, the code above will return a hash with the "color" key set to "chartreuse." You lose the other colors. Having multiple values for the same key is quite common. For example, if you sign up for an online news service which customizes the news which gets e-mailed to you every day, you might get to choose which sports you want to have reported. "NBA" and "WNBA" can reasonably be expected to be values associated with "sports". When the CGI script processes these values, it's much easier to iterate over an array containing the sports chosen as opposed to searching a hash for "sport1", "sport2", etc.
When a browser requests a web page, it's usually making a GET request. However, let's take a look at the following <FORM> tag:
<FORM ACTION="somescript.cgi?launch+nuclear+weapons=false" METHOD="POST">
The above code will detect that the POST method has been used and read all parameters from STDIN. But "launch+nuclear+weapons=false" is in $ENV{QUERY_STRING}! That's just great. You've reversed global warming by inducing nuclear winter. You're going to have some explaining to do when the FBI comes knocking.
To be perfectly fair, CGI.pm also has this problem. However, Lincoln Stein anticipated this and made it easy to change, if necessary. To fix this, open up the CGI.pm module and look for the following code:
if ($meth eq 'POST') { $self->read_from_client(\*STDIN,\$query_string,$content_length,0) if $content_length > 0; # Some people want to have their cake and eat it too! # Uncomment this line to have the contents of the query string # APPENDED to the POST data. # $query_string .= (length($query_string) ? '&' : '') . $ENV{'QUERY_STRING'} # if defined $ENV{'QUERY_STRING'}; last METHOD; }
To allow yourself to use the query string with the POST method, uncomment the line beginning with:
# $query_string .=
Sometimes, problems happen on the web. A web server may hiccup, a network connection may go down, or the user might hit the "Stop" button on their browser before all of the data is sent. If this happens on a POST, the $ENV{'CONTENT_LENGTH'} may not match the length of STDIN. In that case, the code which parses the CGI data should check for this and have some method of dealing with it. Rarely, if ever, will you see someone's personal parsing routine handle this correctly. CGI.pm does.
These are too numerous to document here. Anyone not fully convinced of the utility of CGI.pm is strongly encouraged to read through the module at some point. It's actually rather interesting reading. The author has documented it extremely well, and even if you don't understand everything he's doing, you'll be very impressed.
Once you get used to the syntax of this module, you'll wonder why anyone argues against its virtues. Here's a quick script which will print out all of the name/value pairs you enter:
#!C:\perl\bin\perl.exe -wT use strict; use CGI; my $query = CGI->new(); my @names = $query->param; # Yup. That's all it takes to get all names # from name/value pairs. print $query->header( "text/plain" ); # This is just to show that we can # specify our Content-type foreach my $name ( @names ) { # If we know that a particular name only has one value, we can use # my $value = $query->param( $name ); # However, if we assign the param to a scalar and there are multiple # values, we'll only get the first one. my @values = $query->param( $name ); print $name . "=" . (join ", ", @values) . "\n"; }
See how easy it is? The CGI.pm module also allows for easy debugging from the command line. If you would like to test this, remove the -T switch from the shebang line (it will cause an error if you don't. See Lesson One), don't run it as a CGI script, and enter:
perl -T myscriptname.cgi name=Ovid color=red color=blue
This should output the following:
Content-type: text/plain name=Ovid color=red, blue
Some of you may have wondered why I didn't give an HTML form to test this with. Frankly, I don't want to jump the gun here. If you're fairly well-versed in either Perl, or programming in general, you'll probably see plenty of ways you could take this newfound knowledge and start creating web pages. Don't! While this particular script does not have any security holes, it is very dangerous to start blindly accepting user input like this.
This lesson was inspired, in part by a security hole found in someone's script which would allow anyone else to execute any program on that web server (as long as the CGI script had rights to execute it). This was because the programmer was simply blindly trusting the data which the user supplied. Further into the course, you'll learn how to protect against dangers like this.
We only have one exercise for this lesson. The following code is a complete CGI script which attempts to read name/value pairs passed to it. In theory, it should function exactly (except for the wording of the output) as the code I listed above to read name/value pairs.
Read through the following code. Use the comments to prompt you, in case you need help remembering what each part is for, and explain why it's broken. Try to find as many problems with this code as you can. Don't answer it blindly, though, as this code has some subtle differences from the bad example provided earlier. You'll need information from both lessons one and two for this. I have taken this code directly from the book "PERL and CGI for the World Wide Web" by Elizabeth Castro. The only change I made was to add some comments (she had none).
There are many flaws in this code, which is a bit more than average. Consider finding 5 flaws a passing grade, but be sure you understand the other issues.
#!/usr/local/bin/perl # GET method if ($ENV{'REQUEST_METHOD'} eq 'GET') { @pairs = split(/&/, $ENV{'QUERY_STRING'}); # POST method } elsif ($ENV{'REQUEST_METHOD'} eq 'POST') { read (STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); @pairs = split(/&/, $buffer); # Error message } else { print "Content-type: text/html\n\n"; print "<P>Use Post or Get"; } foreach $pair (@pairs) { ($key, $value) = split (/=/, $pair); # Convert plusses to spaces $key =~ tr/+/ /; # Convert Hex values to ASCII $key =~ s/%([a-fA-F0-9] [a-fA-F0-9])/pack("C", hex($1))/eg; $value =~ tr/+/ /; $value =~ s/%([a-fA-F0-9] [a-fA-F0-9])/pack("C", hex($1))/eg; # Eliminate comments? What the heck is she doing here??? $value =~s/<!--(.|\n)*-->//g; # If we already have a key with this name, allow for # multiple values!!! if ($formdata{$key}) { $formdata{$key} .= ", $value"; } else { $formdata{$key} = $value; } } print "Content-type: text/html\n\n"; foreach $key (sort keys(%formdata)) { print "<P>The field named <B>$key</B> contained <B>$formdata{$key}</B>"; }
Answers to Lesson 2 Exercises | Next Lesson: Basic CGI Security |