Pages covering this lab session: here, here, here, and here.
- [Graphical Desktop Environment] From NLTK Corpora page, download these two corpora: 2. Australian Broadcasting 2006, 20. Project Gutenberg Selections. Unzip them and place the two corpus directories where it's convenient for you to access them. Note their locations.
- Open up your terminal application. Move into one of the two corpus directories and examine its directory content. Move into the other and do the same. You will need these commands:
|pwd ||display current directory|
|cd dir ||change current directory to dir|
|cd .. ||change current directory to the parent directory|
|cd c: ||[cygwin only] move into C:|
|cd /cygdrive/c ||[cygwin only] move into C:|
|cd ||change current directory to your home (default) directory|
|cd ~ ||same as above ('~' refers to home directory)|
|cd ~/dir ||move into dir directory under your home directory|
|echo $HOME ||display your home directory (likely the directory your terminal starts in)|
|ls ||list the files and directories in the current directory|
|ls . ||same as above ('.' refers to current directory)|
|ls dir ||list the files and directories in dir|
|ls -l ||... and some information on each file|
|ls -a ||list hidden files (name starts with a ".", as in .bashrc) as well|
|ls -F ||indicate file attributes (directory with /, executable with *)|
|ls / ||list the content of your root directory (cygwin: C:/cygwin, OS-X: system disc root)|
- If a directory name contains a space, you need to (1) enclose the name argument in ", as in: cd "Documents and Settings", or (2) append each space with \, as in: cd Documents\ and\ Settings
- [cygwin] Windows uses the backslash to indicate path, as in: C:\WINDOWS\Fonts. In unix and cygwin, the forward slash is used: C:/WINDOWS/Fonts
- [cygwin] Windows does not distinguish uppercase and lowercase letters in file and directory names; neither does cygwin. As a result, cd C:\WINDOWS and cd c:\windows achieve the same thing. However, cygwin is case-sensitive for command names and options: ls -F and ls -f work differently.
- Absolute vs. relative path: An absolute path starts out from the root directory (/), as in /home/narae/documents and /cygdrive/c/windows. A relative path begins with a directory name, and that directory has to be within your current directory. Therefore, cd documents only works when documents is under your current directory; cd /home/narae/documents works anywhere.
- [cygwin] Your home directory is typically /home/user_name. Since the root (/) is c:/cygwin, this home directory can actually be found here: c:/cygwin/home/user_name
- Tab completion: You will be relieved to know that you don't have to type in entire file names. Pressing TAB half-way through typing a name triggers auto completion. System beeps when there are 2 or more matches; type in a couple more characters and try TAB again.
- Command history: You can scroll through your previous commands by pressing up and down arrow. Hit ENTER when you found the one you want.
- man page: Unix system includes " manual pages " for most commands. Try man ls if you want to find out how to use ls.
- Move into the Guttenberg corpus directory. Examine the text content of the file carroll-alice.txt. What do the first few lines of the file look like? How about the end of the file? You will need these commands:
|cat file ||print file to Standard Output (i.e., terminal window)|
|more file ||print file, one screenful at a time (SPACE to forward, q to get out)|
|less file ||print file, one screenful at a time (SPACE/PageUp to forward, b/PageDown to go back, q to get out)|
|head file ||print first 10 lines of file|
|head -m file ||print first m lines of file|
|tail file ||print last 10 lines of file|
|tail -m file ||print last m lines of file|
|tail -n +m file ||print file starting from line m|
*NOTE: the old syntax tail +m is no longer supported in newer versions of tail.
Printing the 6th line and on, old syntax: tail +6; new syntax: tail -n +6.
The commands above can be combined to carry out more complex tasks. How can we display lines 50-100? How can we extract certain lines and save them into a different file? Try:
|head -200 file | more ||print first 100 lines of file and print one screenful at a time|
|head -200 file | tail -10 ||print lines 191-200 of file|
|tail -n +191 file | head -10 ||same as above!|
|head -200 file > temp.txt ||print first 200 lines of file into another file called temp.txt |
|head -200 file | tail -10 > temp.txt ||print lines 191-200 of file into another file called temp.txt |
|rm temp.txt ||remove file temp.txt|
So how big is this file? How many characters, words, and lines are in this file? These are the commands to use:
|wc file ||print # of lines, words, and characters in the file, in that order|
|wc -l file ||print # of lines only|
|wc -w file ||print # of words only|
- When you are done, you can close your terminal by typing: