From JOCRAW@macc.wisc.edu Wed Feb 14 15:51:21 1996
Date: Thu, 2 Feb 1995 08:28:23 -0500
From: Josephine Crawford
Reply to: usmarc@loc.gov
To: Multiple recipients of list
Subject: Compression Study : File 1 of 3
File 1 of 3 Josephine Crawford re: MARBI Proposal 95-2 Univ of Wisconsin 1/31/95 jocraw@macc.wisc.edu
In preparation for the MARBI discussion on Proposal 95-2, I took a little time this weekend to work on the questions raised in the proposal in the section "Concept-based Display" (that is, a compressed OPAC headings display). At the bottom of page 5 of the Proposal, there is the following example and remarks:
"Example of such a display: English literature -- SUBDIVIDED BY CHRONOLOGICAL PERIOD English literature -- SUBDIVIDED BY FORM OR TYPE OF MATERIAL English literature -- SUBDIVIDED BY GEOGRAPHIC AREAIt should be noted, however, that with LCSH strings, such an abbreviated display reflects only the nature of the subdivisions at the level of the first subdivision. Selecting and viewing English Literature - - SUBDIVIDED BY FORM OR TYPE OF MATERIAL does not retrieve all instances of form subdivisions in strings that begin with English literature. Strings with English literature subdivided by geographic or chronological subdivisions may themselves have additional form subdivisions...."
----------------------------------------------------------------------------
I felt that this issue could be clarified by some old-fashioned systems analysis using some real data. In addition, such an analysis might uncover other problems/issues which should be addressed. Given that the Univ of Wisconsin-Madison OPAC has the capability to download a subject heading list which I could then manipulate using a spreadsheet and word processor, I picked out a common keyword search and set to work. I analyzed the data statistically and I manipulated the headings by hand to achieve compression, so that I could show Before/After displays.
This report is divided into three files:
File 1 : Introduction and description of my methodology; Contents of Dataset (categories, statistics); An observation. File 2 : Sample "uncompressed" OPAC display. (what occurs now) File 3 : Sample "compressed" OPAC display; Some final observations.
Contents of Dataset (please think of the stats below as ------------------- preliminary only; double-checking not yet done)
I chose to perform a topical subject heading keyword search on a single word: BIOTECHNOLOGY. This resulted in an alphabetical list of 480 subject headings, beginning with "Acremonimum--Biotechnology" and ending with "Yeast fungi--Biotechnology." The list includes a large number of headings which begin with the keyword Biotechnology. These 480 subject headings come from 1055 catalog records.
The dataset includes a mixture of LCSH and MeSH headings and the current display makes no attempt to differentiate between the two. In addition, please note that our automated authority control programs have not been extended to topical subject headings as of yet, so that no references or scope notes appear and some of the headings need correction.
I could have performed a left-to-right phrase search rather than a keyword search, thereby limiting my topical subject search to just those headings beginning with the word BIOTECHNOLOGY. However, I prefer to work with a "worst case" scenario in order to uncover issues and analyze solutions. Therefore, I choose the keyword search so that the resulting OPAC display has more content and complexity.
In my analysis, I assigned each of the 480 subject headings to one or more of the following categories:
A) TOPICAL SUBJECT HEADING, NO SUBDIVISIONS e.g. Agricultural biotechnology Biotechnology industryThere are only 12 headings of this type but these 12 headings map to 289 catalog records. That is, if a user requests the display of these 12 headings separately, the user would see 289 items divided into twelve separate sets.
B) ONE OR MORE TOPICAL SUBDIVISION PRESENT IN HEADING e.g. Agricultural Biotechnology--Economic aspects Biotechnology--Computer programs Lignocellulose--Biotechnology--Congresses Marine biotechnology--Research--United StatesThere are 360 headings of this type, mapping to 640 catalog records. Of these 360 headings, 82 also have geographic subdivisions and 143 also have form subdivisions, as in the last two examples above. These latter statistics may be important in getting a handle on the compression issue quoted above from the MARBI proposal.
C) ONE OR MORE GEOGRAPHIC SUBDIVISION e.g. Agricultural biotechnology--Kenya Biotechnology industries--Wisconsin Biotechnology industries--Wisconsin--DirectoriesThere are 60 subject headings with a geographic subdivision, mapping to 305 catalog records. In addition, 47 of these are followed by a form subdivision, as in the Directories example above.
D) ONE OR MORE CHRONOLOGICAL SUBDIVISIONS There are none of this type in the dataset. E) ONE OR MORE FORM/GENRE SUBDIVISIONS e.g. Microbial Biotechnology--Periodicals Pharmaceutical biotechnology--CongressesThere are 219 headings of this type, mapping to 563 catalog records. PLEASE NOTE: this category composes just under half of the headings in the dataset, and also just over half of the linked catalog records. (I guess I was lucky enough to hit on a search rich with form/genre data.)
F) TWO FORM SUBDIVISIONS IN SAME SUBJECT HEADING e.g. Biotechnology--Bibliography--Periodicals Three of these headings have two form subdivisions. G) ALL THREE TYPES OF SUBDIVISIONS IN THE SAME HEADING e.g. Biotechnology--Databases--North America--Directories Plant biotechnology--Research--Japan--PeriodicalsThere are 19 headings of this type; may be important for the compression issue quoted from MARBI proposal.
An observation
--------------
In an alphabetical, uncompressed display, the display of a main heading and all its subdivisions can be interrupted by an intervening term (and its subdivisions). This is the case in my sample dataset. This problem disappears in a compressed display, as long as the program logic is set up so that the computer searches through to the very end of the headings under the main term with which it is working. I see this as a helpful change and have therefore manipulated my "compressed" display along these lines.
e.g. Biotechnology--History Biotechnology industries Biotechnology industries--Argentina Biotechnology--Information Services Biotechnology--Instrumentation Biotechnology laboratories Biotechnology--Latvia [uncompressed display shows interruption of logical sequence]
Back to Form Data Follow-Up page
(saved \jo\file 1)