Website Setup and Configuration
Basic Principles
In a nutshell, all you have to do is create a public_html directory within your own home directory, put your web page files in it, and set the files to be publicly accessible. This document tells you how and why.
Our configuration
Each user on the Computer Science network has a directory stored on a central public fileserver. This directory is accessible both from the Windows machines (your F: drive) and from the Linux machines (your home directory).
Our webserver is configured in such a way that if a there exists a user named username, then the URL http://students.cs.byu.edu/~username/ will point to the directory called public_html under username's home directory-- assuming, of course, that the public_html directory actually exists.
Furthermore, there's an indexing script that runs once a week that goes through the list of all valid users checking their home directories. If the user has a public_html directory, the script adds their name to the list found at the site http://students.cs.byu.edu. If the user doesn't have a public_html directory, their name is removed from the index.
As long as you have the file permissions set correctly, your page will be visible on the Internet. If you try to access your web page and get an error saying, "Access forbidden! You don't have permission to access the requested object. It is either read-protected or not readable by the server," then your permissions are not set correctly. See the section on permissions later on in this document for more information. In a nutshell run chmod 644 filename on any file that you want to be publicly accessible.
How Your Website Works
A website is nothing more than a directory structure that is visible over the Internet. The URL you type in is simply the location of the file you're looking for. So, for example, the URL http://students.cs.byu.edu/~username/pages/mypage.html points to the file at this location:
~username/pages/mypage.html
where ~username/ is the user's home directory.
If a directory is specified instead of a file, the server will search through a list of default filenames. If none of those filenames match, the webserver will then decide whether directory listings are allowed. If listings are allowed for this directory, the server will build a directory listing and display it as the webpage. Otherwise, it will display a "permission denied" error.
CGI
In certain cases, displaying the contents of a file is not good enough. Some applications, such as search pages, have to generate the contents of the page on the fly. For this, you use a CGI program.
CGI stands for Common Gateway Interface. CGI is simply a standard that programs that generate webpages all follow. In a nutshell, CGI programs print out to the screen the contents of a webpage when they're run. The server captures that output and sends it to the client.
The server distinguishes normal files (which get read and displayed) from CGI programs (which must be executed instead) by the file's name or by it's location. We talk more about CGI later on.
Writing HTML
This document is not about how to write a webpage, it's about how to post a webpage that you've already written (or will write later). I will therefore not go into the details of how to create your own webpage here.
However, I will give you some direction as to where to go to find the information you're looking for.
- http://www.htmlhelp.com/ - In my not-so-humble opinion, this is one of the best HTML and CSS references on the Internet. This site is, in fact, where I learned to write web pages. You should have to go no further than this site for almost all your HTML questions.
- University of Kansas's HTML Quick Reference: http://www.ku.edu/~acs/docs/other/HTML_quick.shtml - This site is quite popular, though I couldn't tell you why. It is, however, very simple and informative.
- http://www.webreference.com/html/ - Another popular reference. Not exactly a beginner site, but they have some useful information nonetheless.
Sooner or later an HTML authoring guide may appear here on this site (depending on whether or not anyone gets around to writing one). Until then, though, these should keep you busy.
Using CGI
As stated earlier, CGI programs are simply normal programs that output webpages (or graphics or whatnot) that the webserver captures and sends to the client. CGI programs are used to generate non-static webpages, which can vary according to a variety of factors, such as user input.
CGI itself is simply a set of standards that programs must follow in order to interface properly with webservers and correctly produce the dynamic content desired.
If you're looking for information on how to create CGI programs, you won't find it here. You might want to try, instead, this bland but informative [http://hoohoo.ncsa.uiuc.edu/cgi/overview.html CGI guide] found at the NCSA. This is the original CGI guide that has been around since the beginning of time itself. You may want to read it simply for its historic value.
If, on the other hand, you just want to know how to set up your CGI programs on our server, you've come to the right place.
File names and locations
The webserver must know ahead of time which files are CGI programs and which ones are normal files so that it knows to execute the CGI program rather than print its contents. Our webserver figures this out by the file's name or its location.
Putting all your CGI programs in one cgi-bin directory is considered by some to be bad organizational design and is not as commonly done now as it used to be. Instead, CGI programs are distinguished by their name. By default, any file that ends with .cgi will be treated as a CGI executable.
Custom names and locations
You can change any of these settings (or add your own) by creating an .htaccess file containing the appropriate directives.
The syntax to add a new CGI file extension is as follows:
AddType cgi-script <extension>
So, for example, if I add the following line to my .htaccess file:
AddType cgi-script exe
Then any file that ends in .exe (such as index.exe) will be treated as a CGI. Remember, though, that the file must be executable (in the permissions).
If you add the following line to your .htaccess file, the directory that the .htaccess file is in (and its subdirectories) will function like your cgi-bin directory:
SetHandler cgi-script
And remember, you can have as many .htaccess files as you need.
FastCGI
FastCGI allows a server to handle multiple web page requests at once. If you want to use a web application framework (e.g., Django, Drupal, Ruby on Rails, etc.), then FastCGI might be the way to go.
To set up FastCGI, add the following to your .htaccess file:
AddHandler fastcgi-script fcgi
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^url_to_webapp/(.*)$ /~username/file.fcgi/$1 [QSA,L]
This will treat all files ending with .fcgi as FastCGI scripts. Of course, you'll have to modify url_to_webapp, username, and file.fcgi to reflect your web site.
Next, create a script that will load your web application (i.e., file.fcgi). Every web application framework is different, so refer to those sites for more information. Once you've created this file, execute the command chmod a+x file.fcgi.
Lastly, if you make any changes to your web application, they will NOT be reflected immediately on the student web server. You will have to wait a moment for the web server to notice a change has been made.
.htaccess
What is it
The .htaccess file is (or at least was at one time) a feature unique to Apache servers. And since the majority of the Internet runs on Apache, knowing how to use an .htaccess file is an important concept to master in the world of web development.
This file is a per-directory configuration file that changes the way the webserver treats the directory that the .htaccess file resides in as well as any subdirectories. Many features, such as password protection, are added using .htaccess.
The .htaccess file can contain any number of Apache configuration directives, the majority of which are far beyond the scope of this document.
Syntax
The syntax of this file is as follows. You include one directive per line and as much whitespace as you like. Some things are case sensitive, so be careful to watch your capitalization. Below is an example of an .htaccess directive:
SetHandler cgi-script
The above line tells the server to treat all the files in the same directory as the .htaccess file and below as CGI programs (just like it does with the cgi-bin directory). All .htaccess files apply to both the directory in which the .htaccess file resides and all subdirectories below it.
Limiting the scope of your changes
Many (but not all) of the directives can be applied to only specific files by using one of the scope-limiting sections. Any directives within these sections will only apply to the files you specify. For example, in the following block, the SetHandler directive only applies to files named myprogram:
<Files myprogram> SetHandler cgi-script </Files>
The following two container sections are valid in your .htaccess file:
<Files pattern> ... </Files> <FilesMatch regular-expression> ... </FilesMatch>
The difference between the two is simply that the Files block uses a simple file pattern like you'd pass to the ls command, while FilesMatch uses a regular expression.
You can also limit behavior according to the HTTP method used by using either of the following container sections:
<Limit HTTP-methods> ... </Limit> <LimitExcept HTTP-methods> ... </LimitExcept>
The difference between the two is that Limit matches all requests that use one of the specified methods and LimitExcept matches the requests that don't use one of the specified methods. In the place of HTTP-methods you include a space-delimited list of the HTTP methods that apply. The most common methods are GET and POST, but you can find a more exhaustive (and descriptive) list at the following location: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html.
Most people have absolutely no reason to limit directives by HTTP request method, but if you want to, the option is available.
Not all directives can be included in blocks like this, but usually you won't go wrong. If you get a 500 error (Misconfiguration), you know you did something wrong.
Changing file behavior
Here are some useful directives you can put in your .htaccess file.
AddHandler
AddHandler <file type> <extension>
This directive tells Apache to treat files that end with .<extension> in a way specified by <file type>.
The various file types you might use and their meanings are as follows
- cgi-script: File should be executed as a CGI program.
- perl-script: File should be interpreted using mod_perl.
- send-as-is: File contains all the HTTP headers it needs and should be transmitted exactly as it stands.
- default-handler: Just a normal file.
In order to use mod_php you have to set the mime type instead.
SetHandler
SetHandler <file type>
Works the same as AddHandler except the changes apply to all files within the current scope. If this directive appears within a <Files> or <FilesMatch> block, it will only apply to the files specified by that block. Otherwise it will apply to all files in or below the current directory.
Adding mime types
AddType <mime-type> <extension>
This directive associates files that end with .<extension> with a given mime type (specified, of course, by <mime-type>).
Some useful mime types are as follows:
- text/html - a normal HTML web page
- text/plain - a plaintext file (printed as-is in a monospace font
- application/octet-stream - file that will be saved to disk rather than opened in the browser
- x-application/php-script - file is executed server-side using mod_php
Warning
Microsoft Internet Explorer is not standards-compliant and will not pay any attention to the mime types. This browser instead looks at the file and decides on its own how to display it.
Redirects
You can easily redirect users to updated or moved pages using your .htaccess file.
If you use either of the following two directives, the browser will be sent a redirect response, which will tell it to go to some new URL as specified in the directive. The process is transparent to the user, but the URL displayed in the browser does reflect the fact that it has been redirected:
Redirect [<status>] <url-prefix> <new-url-prefix> RedirectMatch [<status>] <url-regex> <new-url>
In the above directives, the <status> parameter is optional, and defines what status code will be passed to the browser. For the most part, which code you sent doesn't make much of a difference, but different browsers may elect to treat the response codes differently. The one exception is the gone code, which behaves much like a 404 (not found) error. Below are the possible options:
permanent - Status code 301: indicates that the resource has been permanently moved. temp - Status code 302: indicates that the resource is only temporarily moved. This is the default if you don't supply a status code. seeother - Status code 303: indicates that the resource has been replaced. gone - Status code 410: indicates that resource used to exist but has been permanently removed. In this case, you leave off the new-url argument.
You can also return any other status code by putting its number in for <status>. So, for instance, putting 404 in its place will make it look like the file in question doesn't exist (regardless of whether or not it's there), and 500 will make it look like the server is mis-configured. For status codes between 300 and 399, the new-url argument must be present, otherwise it must be omitted.
RedirectMatch
The RedirectMatch directive is the newer of the two, and is a lot easier to work with. RedirectMatch uses a standard POSIX regular-expression-driven pattern matching and replacement engine. As a general rule, the pattern matched by <url-regex> will be replaced by the string at <new-url>. If <new-url> begins with a slash (/), then the whole path is replaced by <new-url>. If it begins with a transport protocol, such as http://, https:// or ftp:// then the whole URL is replaced by <new-url>. Below are a few examples:
RedirectMatch permanent file.html otherfile.php
RedirectMatch /x[12345]{1,4}/index.html http://domain.com/notfound.html
The first of the two examples simply redirects any request for files named file.html to a file within the same directory called otherfile.php. The second example is a bit more complicated; it matches any file named index.html within a directory whose name is the letter 'x' followed by any combination of one to four of the digits 1 through 5. If you request such a file, it will redirect you to http://domain.com/notfound.html. Phew!
Further functionality (and complications) can be added by back referencing substrings of the matched expression. You surround the expression to be referenced in parentheses, and then use a dollar sign followed by a number (the index of the matched string from the first expression) in the place where you want to insert the string matched. It makes more sense if you look at an example:
RedirectMatch /(.+)/([ABC]).html /otherdirectory/$1/newfiles/$2.php
The effect of this expression is as follows:
http://www.mydomain.tld/dir/dir2/B.html
becomes:
http://www.mydomain.tld/otherdirectory/dir2/newfiles/B.php
and:
http://www.mydomain.tld/xyz/C.html
becomes:
http://www.mydomain.tld/otherdirectory/xyz/newfiles/C.php
Remember that in regular expression land, a period matches any character, and a plus sign means "match one or more instance of the preceding pattern, so .+ means "match one or more of any character." With that in mind, the first expression matches files named A.html, B.html or C.html found in a directory whose name is made of one or more characters. Since all directories' names are made of one or more characters, adding that part doesn't change much about which files match and which ones don't. But the information is used later on in the second expression.
The second expression takes the name of the file and changes the .html part to .php. It also takes the name of the directory matched before and puts it in the place marked $1 in the second expression, as you can see from the examples. The $1 always refers to the first parenthesized expression, the $2 refers to the second, and so forth. Redirect
Redirect is a bit simpler, but not as powerful, as RedirectMatch. It instead does simple prefix matching and replacement. Here, again, is the syntax of the directive:
Redirect [<status>] <url-prefix> <new-url-prefix>
According to the Apache documentation, <url-prefix> MUST be an absolute path (i.e., begins with /), and <new-url-prefix> must be a complete URL (with the http:// and everything). Here's a simple example of the Redirect directive in use:
Redirect /classes http://www.cs.byu.edu/info/classes
The preceding example would redirect http://www.cs.byu.edu/classes/cs142/index.html to http://www.cs.byu.edu/info/classes/cs142/index.html. It's that simple.
Password protecting webpages
Password protecting your web pages is a two-step process. First you have to create a password file, then you set up your .htaccess file to require that password file to be used.
While there are a number of different authentication methods available, this piece will focus only on basic authentication.
When creating password and group files, keep the following in mind: files that begin with ".ht" can never be accessed over the web using Apache. There's a directive in the server configuration that specifically checks for files whose name matches that pattern. If you do try to access such a file, the server will return an "Access Denied" error. Therefore, using names like .htpasswd and .htgroup is a common (and recommended) practice.
Setting up the password file
The first thing you have to do is create a password file. This file links valid usernames with their corresponding password. Beware, though-- while the password is stored encrypted in the file, it is transmitted unencrypted across the network when it is used to access a protected web page. (However, the password will be encrypted in transit if you use SSL.)
The (relevant) syntax for the command to create (or add to) the password file is the following:
htpasswd [-c] <passwordfile> <username>
In the syntax above, <passwordfile> refers to the file where you will store the password and <username> refers to the user name you want to add to the password file. The -c flag is optional and means the file <passwordfile> doesn't exist and must be created.
So for example, if I want to create a file called .htpasswd and add to it the username joeuser, I would issue the following command:
htpasswd -c .htpasswd joeuser
The computer will then prompt me for a password for that user. Then, if I want to add to it the user janeuser, I type this command:
htpasswd .htpasswd janeuser
Note that I left off the -c flag because the file .htpasswd already exists. If I chose the password "iguana" for joeuser and "skippy" for janeuser, the password file will have the following contents:
joeuser:Xo7ZA9CrnlEhM janeuser:ciHAO2aQ3p9sU
To remove a user from the password file, simply delete the line with that user's name and password. To change a user's password, simply use the htpasswd command to add the user again to the file.
The .htpasswd file that you create needs to be readable by the webserver. The easiest way to do this is to run the following command:
chmod 644 .htpasswd
Warning
Do not use your Computer Science password or Route Y password. Doing so increases the possibility that someone else will be able to find out your username and password.
Setting up the group file (optional)
If you like, you can create groups of users using a group file. You don't need to, but some people like it better that way. So here's how to do it:
Create a file in your favorite text editor and add to it lines in the following syntax:
<groupname> : <username> <username> <username> ...
Where <groupname> is the name of the group, and the <username> parts are the users that belong to that group. You can have as many groups as you want, each on a different line in the file.
Configuring .htaccess
Below is a description of all the lines you'll need to add to your .htaccess file to enforce authentication:
AuthType Basic
This line tells Apache that we'll be using basic authentication. You can use other authentication systems if you like, but we only explain the basic system here:
AuthName "Some text goes here"
The string of text specified by the AuthName directive is the "Realm" displayed on the client when he goes to enter his login name and password. This string can be anything you want but should have something to do with the section of the site that the client wants to access:
AuthUserFile <path-to-password-file>
With this directive, you specify the complete path (starting at /) to the password file you created:
AuthGroupFile <path-to-password-file>
With this (optional) directive, you specify the complete path (starting at /) to the group file that you may or may not have created.
Then you use one of the following three options:
| Option 1: | require valid-user |
|---|
With this directive, any user in the password file is allowed.
| Option 2: | require user <username> <username> <username> ... |
|---|
With this directive, you specify which usernames are allowed.
| Option 3: | require group <groupname> <groupname> <groupname> ... |
|---|
With this directive, you specify which groups are allowed. An example Below is an example of an .htaccess file with some simple access control:
AuthUserFile /users/joe/.htpasswd AuthGroupFile /dev/null AuthName "Joe's restricted directory" require valid-user <Files ~ "^secret"> AuthName "Joe's super secret files" require user joeuser <Files>
This .htaccess requires any user who tries to access a file within the same directory as the .htaccess to have a valid user name and password in the password file. It further limits access to files beginning with "secret" to only the user named "joeuser".
Custom error (404, 500, etc.) pages
You can easily create and use your own custom error pages, which will be displayed rather than the server's default error document. To do this, you use the ErrorDocument directive. Below is the syntax for this directive:
ErrorDocument <error-code> <document>
In this directive, <error-code> refers to the 3-digit code for the error you want to handle. Of the possible status codes that the server can return, only codes 400 to 599 indicate an error. Of these, the 4xx codes refer to cases in which the client has erred and the 5xx codes refer to cases in which the server has erred. The more common error codes are listed below:
| 401 - Unauthorized: | |
|---|---|
| The client is not authorized to view the page in question. This is, for example, the error returned if the client fails to provide a valid name and password for password protected pages. | |
| 403 - Forbidden: | |
| The client has requested something that is forbidden. Authorization will not help. Trying to view an .htaccess file over the web will get you this error. | |
| 404 - Not Found: | |
| The server could not find what the client requested. If you think real hard, I'm sure you can come up with an example or two where you've seen this error. | |
| 500 - Internal Error: | |
| An internal error has prevented the server from fulfilling the request. This error is caused 99.999% of the time by a faulty CGI program. Furthermore, 99.998% of those cases involve a CGI that did not return a proper content-type header. | |
The <document> parameter tells the server what page to send to the client. This parameter is either a relative or absolute URL.
If you specify an absolute URL (i.e. beginning with something like http://), the server will issue a redirect to the specified URL instead of returning an error code to the client. This has the side effect of preventing the client from finding out that the error actually occurred. For this reason, you must never specify an absolute URL for a 401 (Unauthorized) error. If you do, the user will never get a chance to enter a password, because the browser only prompts the user for a password when it receives a 401 error code. You may also end up confusing webcrawlers (such as Google's indexing bot) if you use absolute URLs for error pages because the bots use the status code to understand what kind of document they're looking at. Since redirects use a 3xx status code, the bot won't know that an error has occurred and will treat the error page as a normal (indexable) web page.
You can find more information about this directive from Apache's own documentation. Furthermore, you can find some examples of how to use this directive right after this paragraph:
ErrorDocument 404 /~joeuser/file-not-found.html ErrorDocument 500 http://www.somedomain.tld/howto/write-better-scripts/ ## Do not use the line below ErrorDocument 401 https://students.cs.byu.edu/~joeuser/access-denied.html ## Instead, use this line: ErrorDocument 401 /~joeuser/access-denied.html
Common Problems
Permissions
Assigned to each file on the fileserver is a set of permissions. These permissions determine who can read the file, who can write to it, and so forth. This is a good thing-- it keeps other people from messing with your stuff if you don't want them to. However, you can inadvertently lock people out of the stuff you actually wanted them to be able to access.
In the UNIX security model, there are three types of permissions: read, write and execute. These three permissions can be applied to three different classes of users: the owner, the group, and the rest of the world. Except in special cases, the webserver is going to be part of the "world" class, rather than user or group.
CGI executables will have to be world executable and normal web pages and images will have to be world readable. Directories work slightly differently: if a directory is readable then the user (webserver) can view a directory listing. If the directory is executable then the user (webserver) can access the files within that directory. Therefore, your public_html directory and all directories above it must be world-executable.
To change permissions on a file or directory, you use the chmod command (under Linux, not Windows). There are many different forms of this command, but the easiest syntax to remember is as follows:
chmod <who>[+/-]<what permission> <filename>
where <who> is any combination of u, g, and o (user, group, and other) or a for all. And <what permission> is any combination of r, w, and x for read, write, and execute.
So, for example:
chmod a+rwx *.html
Gives read, write, and execute permission to the user, group, and world for all files ending in .html:
chmod go-w *
Takes write permission away from the group and world from all files and directories in the current directory.
CGI and Internal Server Errors
If the server cannot make sense of the text returned by a CGI executable, it will return a status code 500, which translates to the dreaded "Internal Server Error" response from the client's point of view.
There are only a handful of conditions that will cause this problem. The most common is a CGI that is, for one reason or another, not returning a proper set of headers. One of the main causes for this situation is that the CGI program may actually be crashing halfway through execution (or not executing at all).
The easiest way figure out what is going wrong is to print out the headers before executing the CGI program. You can do this with a simple wrapper script. Here's an example of such a script:
#! /bin/bash echo -en "Content-type: text/plain\n\n" exec myotherprogram.cgi
Simply copy the above text into a file called something like wrapper.cgi and replace the myotherprogram.cgi bit with the name of the CGI program you're trying to debug. Make sure that your wrapper script has execute permissions. Then, if you run the wrapper CGI script from your web browser, you'll see the actual output you get when the server runs your CGI program.
Alternate cause: .htaccess
Another reason why you can get "500" errors is because of a mis-configured .htaccess file. If there are any syntax errors at all in your .htaccess file, Apache will refuse to display any content from the directory containing the faulty .htaccess file or any directories beneath it.
This error is easy to spot because instead of one page not working, all of the pages in the directory won't display.
Other possible causes
You will also often get this error if your CGI program is not set to world executable in the file permissions.
Another variation on the HTTP headers problem discussed above appears if you place non-CGI files in your cgi-bin directory. Remember, Apache assumes that everything in the cgi-bin directory should be executed as a CGI program, so if you place a normal file (such as an HTML file or a JPEG file) in the cgi-bin directory, you will get an error if you try to view it.