Crawford's Compression Study Document 1

From JOCRAW@macc.wisc.edu Wed Feb 14 15:51:21 1996
Date: Thu, 2 Feb 1995 08:28:23 -0500
From: Josephine Crawford
Reply to: usmarc@loc.gov
To: Multiple recipients of list
Subject: Compression Study : File 1 of 3

 
File 1 of 3                                  Josephine Crawford
re: MARBI Proposal 95-2                      Univ of Wisconsin
1/31/95                                      jocraw@macc.wisc.edu

In preparation for the MARBI discussion on Proposal 95-2, I took a little time this weekend to work on the questions raised in the proposal in the section "Concept-based Display" (that is, a compressed OPAC headings display). At the bottom of page 5 of the Proposal, there is the following example and remarks:

  "Example of such a display:
 
     English literature -- SUBDIVIDED BY CHRONOLOGICAL PERIOD
     English literature -- SUBDIVIDED BY FORM OR TYPE OF MATERIAL
     English literature -- SUBDIVIDED BY GEOGRAPHIC AREA
It should be noted, however, that with LCSH strings, such an abbreviated display reflects only the nature of the subdivisions at the level of the first subdivision. Selecting and viewing English Literature - - SUBDIVIDED BY FORM OR TYPE OF MATERIAL does not retrieve all instances of form subdivisions in strings that begin with English literature. Strings with English literature subdivided by geographic or chronological subdivisions may themselves have additional form subdivisions...."

----------------------------------------------------------------------------

I felt that this issue could be clarified by some old-fashioned systems analysis using some real data. In addition, such an analysis might uncover other problems/issues which should be addressed. Given that the Univ of Wisconsin-Madison OPAC has the capability to download a subject heading list which I could then manipulate using a spreadsheet and word processor, I picked out a common keyword search and set to work. I analyzed the data statistically and I manipulated the headings by hand to achieve compression, so that I could show Before/After displays.

This report is divided into three files:

 
     File 1 :       Introduction and description of my methodology;
                    Contents of Dataset (categories, statistics);
                    An observation.
 
     File 2 :       Sample "uncompressed" OPAC display.
                         (what occurs now)
 
     File 3 :       Sample "compressed" OPAC display;
                    Some final observations.
 

Contents of Dataset (please think of the stats below as ------------------- preliminary only; double-checking not yet done)

I chose to perform a topical subject heading keyword search on a single word: BIOTECHNOLOGY. This resulted in an alphabetical list of 480 subject headings, beginning with "Acremonimum--Biotechnology" and ending with "Yeast fungi--Biotechnology." The list includes a large number of headings which begin with the keyword Biotechnology. These 480 subject headings come from 1055 catalog records.

The dataset includes a mixture of LCSH and MeSH headings and the current display makes no attempt to differentiate between the two. In addition, please note that our automated authority control programs have not been extended to topical subject headings as of yet, so that no references or scope notes appear and some of the headings need correction.

I could have performed a left-to-right phrase search rather than a keyword search, thereby limiting my topical subject search to just those headings beginning with the word BIOTECHNOLOGY. However, I prefer to work with a "worst case" scenario in order to uncover issues and analyze solutions. Therefore, I choose the keyword search so that the resulting OPAC display has more content and complexity.

In my analysis, I assigned each of the 480 subject headings to one or more of the following categories:

 
A)   TOPICAL SUBJECT HEADING, NO SUBDIVISIONS
          e.g.      Agricultural biotechnology
                    Biotechnology industry
There are only 12 headings of this type but these 12 headings map to 289 catalog records. That is, if a user requests the display of these 12 headings separately, the user would see 289 items divided into twelve separate sets.

B)   ONE OR MORE TOPICAL SUBDIVISION PRESENT IN HEADING
          e.g.      Agricultural Biotechnology--Economic aspects
                    Biotechnology--Computer programs
                    Lignocellulose--Biotechnology--Congresses
                    Marine biotechnology--Research--United States
There are 360 headings of this type, mapping to 640 catalog records. Of these 360 headings, 82 also have geographic subdivisions and 143 also have form subdivisions, as in the last two examples above. These latter statistics may be important in getting a handle on the compression issue quoted above from the MARBI proposal.
C)   ONE OR MORE GEOGRAPHIC SUBDIVISION
          e.g.      Agricultural biotechnology--Kenya
                    Biotechnology industries--Wisconsin
                    Biotechnology industries--Wisconsin--Directories
There are 60 subject headings with a geographic subdivision, mapping to 305 catalog records. In addition, 47 of these are followed by a form subdivision, as in the Directories example above.
 
D)   ONE OR MORE CHRONOLOGICAL SUBDIVISIONS
 
     There are none of this type in the dataset.
 
 
E)   ONE OR MORE FORM/GENRE SUBDIVISIONS
          e.g.      Microbial Biotechnology--Periodicals
                    Pharmaceutical biotechnology--Congresses
There are 219 headings of this type, mapping to 563 catalog records. PLEASE NOTE: this category composes just under half of the headings in the dataset, and also just over half of the linked catalog records. (I guess I was lucky enough to hit on a search rich with form/genre data.)
F)   TWO FORM SUBDIVISIONS IN SAME SUBJECT HEADING
          e.g.      Biotechnology--Bibliography--Periodicals
 
     Three of these headings have two form subdivisions.
 
 
G)   ALL THREE TYPES OF SUBDIVISIONS IN THE SAME HEADING
          e.g.      Biotechnology--Databases--North America--Directories
                    Plant biotechnology--Research--Japan--Periodicals
There are 19 headings of this type; may be important for the compression issue quoted from MARBI proposal.

An observation
--------------

In an alphabetical, uncompressed display, the display of a main heading and all its subdivisions can be interrupted by an intervening term (and its subdivisions). This is the case in my sample dataset. This problem disappears in a compressed display, as long as the program logic is set up so that the computer searches through to the very end of the headings under the main term with which it is working. I see this as a helpful change and have therefore manipulated my "compressed" display along these lines.

 
     e.g. Biotechnology--History
          Biotechnology industries
          Biotechnology industries--Argentina
          Biotechnology--Information Services
          Biotechnology--Instrumentation
          Biotechnology laboratories
          Biotechnology--Latvia
[uncompressed display shows interruption of logical sequence]
 

Back to Form Data Follow-Up page

(saved \jo\file 1)