Thus far, analysis of the AC has been about mixing both numerical data and statistical analysis alongside necessary curation and opinions, due to the nature of the canon itself. This chapter relies far more on empirical data than previous entries, as we examine our texts strictly "by the numbers." Our content for this chapter includes book length (devised by word count) and reading level (using evidenciary formulas alongside reading evidence), examined through the lenses of our previous chapters (history, authorship, format, and genre).
How Many Words?
The Aggregate Canon totals a whopping 42,805,891 words across all texts. This mind-bendingly large figure is representative only of the fiction on the AC - if we were to expand to nonfiction, the number would grow exponentially higher. Marc Brysbaert, writing for the Journal of Memory and Language in 2019, pins the average reading speed for silent reading at 238 words per minute, and reading aloud at 183 words per minute (par. 1). Both figures are representative of adults and are the sum total of over 190 studies and 18,573 participants (par 2). If we were to assume these numbers are both accurate across all literature on the AC (perhaps an overstep, as children's literature, for example, is a far easier read than Pride and Prejudice, however, Finnegan's Wake is far harder), then the average reader could consume 14,280 words an hour. Assuming also an eight-hour reading day, equivalent to a typical American workday, the average person could consume 114,240 words/day. At this pace, reading every day with absolutely no breaks or pauses means a single individual could consume all of the fiction on the AC in 374.7 days - essentially, a full year of eight hour reading days. More realistically, given the relative difficulty of text on the AC (as we will examine within the Reading Level criteria), and assuming most adults cannot dedicate a full eight hours a day to reading, this number is far higher. Assuming, for example, the reading level on the AC requires a words per minute calculation of 200 (Brysbaert's lowest estimate for fiction content (par. 3)), this puts the total number of words read per hour at 12,000. Assuming adults could spend, at most, 90 minutes a day reading AC texts, this equates to 18,000 words/day, or roughly 23,781 days...essentially, just over sixty-five years. Considering most adults don't begin reading at an adult reading level until 12th grade or post-secondary education, we can reasonably assume the first 18-19 years contains no AC progress. Therefore, at this lowered pace and frequency, you could start reading for 90 minutes a day, every day, for virtually your entire life, and you would finish the canon when you are eighty-three years old. However, considering the data we uncovered in Chapter Two, in the 20th Century the AC grew at a rate of one new text every 2.25 years, an additional 28 texts would have been added to the canon from when you began to when you finished your endeavor. This rate of adoption is key, as is the number of words in each new text (which cannot be predicted); if these texts are on the longer side, or if the rate of adoption of texts increases only slightly, you would never read fast enough to finish every text before new ones are added.
This numerical value might seem too large to fathom, or it might seem like insignificant number-play. The truth, however, is that this figure has vast ramifications. Essentially, it means that the canon has grown so large that it cannot be read; exactly the problem, you will no doubt remember from the project introduction, the Canon was introduced to combat. This feature, perhaps more than any other related to taste, genre, or content, suggests to me that the future of the canon is a fragmented one, where smaller canons dedicated to particular authors, genres, aesthetics, time periods, and the like begin to appear and become popular.
The average length of a canonical text is 124,705.05 words (at average novel page size of 5.5in x 8.5in, approximately 220 pages, depending greatly on font size). Note that we have utilized word count over page count for this specific reason; text page count can change drastically based upon font size, page size, white space, chapter titles and headers, forewords, afterwords, indecies, introductions, endnotes, and any number of other information typically found within text. Further, the word counts presented in this chapter are wholly estimated based on research and extrapolation; differing versions of texts can also contain numerous different total word counts, sometimes off by as many as ten thousand or more. A more accurate representation would likely be character count, though despite access to a number of electronic versions of AC texts, not all were available for this metric to be examined. This is perhaps an area for a future project.
Finding word counts was a two-part process. First, the text was researched, and any available resource that gave a number was averaged with all other sources (some sources were discarded as outliers if they were significantly statistically distant other sources, either high or low (low was more common)). Then, a virtual text was selected from Project Gutenberg, loaded into Voyant-tools, and then again into Microsoft Word, and word counts were taken from each and averaged. This electronic average was then compared to, and averaged, with the figures found during research. In some cases, no electronic version was available, hence the presented information is relegated to research only. In other cases, only estimates were available (hence why a multitude of texts contain round numbers of words, virtually a statistical impossibility). I believe this to be a rather exhaustive approach, even if individual counts are off by as many as a thousand or so. It is for this reason I have grouped the texts into tiers, which we will examine in detail later in the chapter.
The Top and the Bottom
The longest text on the entire AC (and it's not particularly close) is the Indian epic The Mahabharata, which clocks in at an astounding 1.8 million words. Including a conjected six authors writing for nearly six hundred years at various points, the massive epic is perhaps the longest piece of fiction in human history. Perhaps unsurprisingly, a mammoth text this long recieved only minimal AC support, garnering a single GR. In a trend that will likely not surprise you, the longest texts within this chapter (again, fiction only was analyzed here) receive, for the most part, strong support within the GR metric but comparatively little UR. UR, you will recall, is a metric that examines the recommendations of universities and syllabi; it would be a long class that enabled students to read 1.8 million words in a single semester! As discussed in Chapter Four (Format), the Short Story, the Novella, and Poetry reign supreme at the University, likely due to logistical concerns (or, perhaps, in an attempt to keep reading short to promote student interest and participation). To help put the size of the Mahabharata in perspective, it might help to know that it is roughly twice as long as the entire Harry Potter series combined, and roughly twice as long as the entire Lord of the Rings Trilogy combined. It is just slightly shorter than reading the Bible...three times in a row. The second longest text on the AC, and the longest novel ever written, is Marcel Proust's In Search of Lost Time, at 1.26 million words. This text enjoys far more support than its longer predecessor, with 11CR (all 11 are GR). The only other text to total over one million words is the entire compendium of Middle Eastern tales, One Thousand and One Nights. This text also enjoys 11GR (and also 1UR, likely a single story from the anthology). The highest-scoring high-length text in terms of UR is Malory's Le Morte d'Arthur, which has 7UR and 8GR (15CR), and clocks in at 804,810 words. The only text with double-digit UR over 270,000 words is the King James Bible (783,137 words, 12UR, 28CR).
Conversely, the shortest text on the AC is Lewis Carroll's "Jabberwocky," at only 167 words (and we use the term "words" loosely here)! As you might suspect, poetry utterly dominates the field at this short length, with five of the six shortest texts on the AC being poems. Just longer than "Jabberwocky" is "Dulce et Decorum Est" (222 words, 8 CR), "Dover Beach" (244 words, 7CR), and "The Raven" (1097 words, 11CR). The shortest prose on the AC is Ambrose Bierce's short story "An Occurrence at Owl Creek Bridge" (814 words, 5CR).
In attempting to find a trend within text length and popularity, I devised our tier list and ranked texts according to CR/text within each respective tier. Results are shown in the chart at right. As you will note, CR/text stays almost completely flat (demonstrated by the light blue trendline), even as text length fluctuates wildly. Most AC texts are within the 10k-49k word range (novellas, epic poetry, most drama), with a host of short novel-length work in the next tier (50k-99k words) and "typical" length novels in the next tier of 100k-199k words. The rate begins to drop thereafter, with long novels (200k-499k) representing about half of the texts of the previous tier, and then drops off a cliff to single-digit entries at the higher tier levels. This data indicates, interestingly, that there is little to no correlation between text length and popularity or support; longer texts do not necessarily receive more, or less, support, based solely on length.
Length Tiers
The texts have been, as previously stated, grouped into tiers of length for easier analysis, which are represented on the chart above. They are:
Unnecessarily Long (more than 1mi words): 3 texts
Incredibly Long (750k-999k words): 4 texts
Marathon-Length (500k-749k words): 7 texts
Very Long Fiction (200k-499k words): 41 texts
Novel Length (100k-199k words): 73 texts
Short Novel Length (50k-99k words): 73 texts
Short Fiction (10k-49k words): 108 texts
Very Short Fiction (less than 10k words): 34 texts
While popularity is seemingly not impacted by text length, a highly common critique when examining the most difficult texts to read is one of length. Anecdotal evidence from online comments typically notes that famously long texts like War and Peace, Infinite Jest, and Atlas Shrugged are consistently listed as some of the more difficult to read; not because of their content or lexical complexity, but instead because of their length. We will further examine this later in the chapter when we discuss reading level, but for the moment, suffice to say that the longer the text, the more difficult to read, a fact that should be self-evident.
Trends Throughout History
As discussed at length in Chapter Two, the number of texts adopted onto the AC has changed significantly over the centuries. The graph at right details the rate of adoption into the AC not based on text number or CR, but rather word count, further illustrating the absolute dominance of the 19th and 20th centuries in terms of literature added to the canon. We can see that contributions to the canon increased exponentially after the industrial revolution, not just by number of texts, but also by the number of words (a product, of course, of the number of texts). Conversely, however, as shown in the graph at right, the relative average length of text actually decreased quite significantly, and continues to do so, through the 20th century and beyond. This may be at least partially explained by the relatively low sample size of earlier time periods (the 1st-12th centuries in particular, peaking thanks to Dante and the Divine Comedy), but as the pink trend line illustrates, this is a pattern that must be investigated. We are trending toward shorter and shorter texts, though at a healthy ~100,000 words, we are still comfortably in the realm of the novel, which might proffer another possible explanation for the downward trend. Specifically, as the novel grew in popularity, so too did the average expected length of text begin to normalize around its typical word count. This is partially contradicted by the decrease from the 19th to the 20th centuries, however, as both centuries had their fair share of novels, and the preferred (canonized) text from the 20th century has about 5% fewer words than those from the 19th century (120,739.36 words/text in the 19th century, and 115,897.17 words/text in the 20th). Please note again that these statistics are relegated to AC fiction only.
It may surprise most individuals to realize, especially when confronted with data like that above, that despite the seeming "waning" interest in longer texts, or, for the truly cynical, reading literature in general, readers today consume far more material than at any point in human history. The proliferation of written communiction like SMS, email, and, in the collegiate setting, LMS, has given students more opportunity in the twenty-first century to read and write than at any point before in the history of man. This is to say nothing of the entire institution of the internet itself, which is of course built upon the very idea that we can use html (hypertext markup language) to edit and modify text for user consumption. Despite the mathematical evidence above, which suggests that reading is dying away - at least, in the "classic" sense - or anecdotal evidence from anyone at or above a particular generational age group, reading and writing are both very much alive and well, albeit in a form that may not necessarily reflect the classic tradition.
This, in addition to promoting further the idea that there may in fact be a need for multiple specific canons of literature, each dedicated to a platform or medium, also highlights the importance of digital studies moving forward. No longer is "email and message boards" a suitable method for introducing the digital to the pedagogical; the evolution not only of technology but also of reading itself has changed that. We would do well to change along with it.
There is one final possible explanation for the dip in text length at the newest stage, and it is far more optimistic: with over two thousand years of practice, we've, quite simply, gotten better at communication through literature, and therefore, need smaller amounts of text to accomplish the same goals. Perhaps this is due to the proliferation of things like metaphor and allusion, which, when working from a larger pool of resources (that is, the larger the canon becomes, the larger the pool of references from which we may draw also becomes) allows more to be said with less. Perhaps this is an example of linguistic rearrangement and language evolution: the waning and eventual near complete erasure of "whom" and "thus" and "thee" from the lexicon, for example. Perhaps, also, more poignant text may be said with less. This fact has been known for a long, long time, codified in our most prolific author, in fact, as Shakespeare wrote in Hamlet that "Brevity is the soul of wit." Do not look at the above trends and either despair or decry the death of literature; just the opposite, in fact, may in all likelihood be true.
Reading Level and Length
The second part of this chapter deals primarily with reading levels within the texts of the AC, but as reading level - specifically, difficulty of text - is closely tied to length of text, a few important caveats must first be made regarding the following calculations. Examination of reading level is a difficult and far from perfect science; despite multiple various algorithmic techniques and a suite of advanced machine-reading devices, the true "readability" of a text remains elusive to us, as texts which are, colloquially speaking, considered difficult to bordering on impossible score "easy" scores, while far less challenging texts score far lower. Because of the tendency of most of the algorithms to follow the expected rules of English (ie, punctuation, etc), many texts which eschew their use, or else use it atypically, like modernist writing or poetry, tends to score far outside of its expected range, either high or low. This being said, every effort has been made to identify each text's reading level based on a number of various formulas, and has used this information combined with additional research (a "hardest text AC," if you will) to generate a final Difficulty Score, which is outlined below. The methodology for finding the DS of each text is as follows: first, a random snippet of the text (3000 words) was copied from a digital version of the book, and then processed using readability algorithms. Note that because of the required length of the sample, any text shorter than 3000 words was automatically disregarded, as was any poetry, owing to the shortcomings of the formulas. Also note that there is a significant difference, as any literature scholar can assure you, between reading and understanding a text. Finally, note that the formulas are, of course, utterly unable to accomodate the subtleties of literature, so things like content level, subtext, allusion, and the like are wholly ignored. In other words, please take these results with a grain of salt.
Once the 3000 word sample was acquired, the algorithm used to process the sample was the Gunning-FOG formula, defined as .4 * (ASL + PHW), where ASL is Average Sentence Length and PHW is percentage of "Hard words," that is, words three syllables and longer. Essentially, the formula looks at how frequently challenging words appear and in what context, then multiplies this by a mathematical constant. Though not a perfect measure, this number provides a good first step into seeing the rough reading level of a text. Once this number was acquired, the next step in the methodology was to multiply by a "Length Mod," reflective of a texts' difficulty according to length. The length mod is equivalent to the word count / 100. A text with a 5.15 length mod had 515,000 words (in this case, Stephen King's The Stand. Finally, each text was scored with a "Comp. Mod," otherwise known as a comprehension score. This is a corrective number that helps rein in particularly poor showings from the G-FOG algorithm, and is typically situated around 1 (although is drastically higher for some texts).
The graphs at right outline the top twenty-five most difficult and the top twenty-five easiest texts on the AC, according to length, lexical level, and comprehensive modifications, organized by DS (difficulty score). Keep in mind that some texts were necessarily excluded from the chart for "easiest" as they did not meet the G-FOG algorithmic threshold (insufficient data; where most poetry ended up), and no nonfiction texts were included.
Immediately, we can begin to draw some conclusions from this proffered data. First and foremost, note that difficulty appears to most often be a product of textual length. While putting this score together, I included the length mod as a nod to this potentiality, although it ended up absolutely dominating the difficulty of text. I can't necessarily argue against it, however; both Carroll's "Jabberwocky" and Joyce's Finnegan's Wake maintain exceedingly difficult, almost nonsensical writing styles, Carroll's length permits it to be read by elementary school children, while Joyce's magnum opus is read only by those extremely dedicated to its scholarship. The G-FOG score, you will also notice, is not always an accurate predictor of difficulty. Joyce's two primary texts, Finnegan's Wake and Ulysses, each scored extremely low on the G-FOG algorithm (6.5, each), meaning, theoretically, a sixth grade student could read either of those texts. From a lexical standpoint, this may indeed be accurate, but there is absolutely no way a sixth grade student would make any sense of either work. The same is true on the bottom half of the scale for surprising results like Marlow's Doctor Faustus, a text accorded to tenth grade students which, in all likelihood, is far more difficult. Marlowe also sees this low result likely due to the format of the work (drama), which, with a lower word count, finds itself falling to the bottom of the rankings. This is the same reason that Shakespeare, for example, did not populate the top twenty-five. For reference, Shakespeare's most difficult work was Hamlet (also his longest), while his easiest work was A Comedy of Errors (also his shortest). This is likely not coincidental.
If, instead of Difficulty Score (DS), the G-FOG algorithm was used as our sole metric of difficulty, the most difficult texts on the AC would be, in order: Lolita (Nabakov, 15.4),One Hundred Years of Solitude (Marquez, 15.1), Don Quixote (Cervantes, 15.05), The Hobbit (Tolkien, 14.4), On The Road (Kerouac, 13.5), and Midnight's Children (Rushdie, 13.2).
The easiest, according to G-FOG, are: The Color Purple (Walker, 3), The Sound and the Fury (Faulkner, 3.0), The Postman Always Rings Twice (Cain, 4.2), and Cathedral (Carver, 4.4). You'll undoubtedly note the presence of Faulkner here as tied for "the easiest text on the canon to read," a thoroughly undeniable inaccuracy to go along with those found within the Joyce texts (though you'll also likely note the tendency of G-FOG to place little emphasis on the modernist writing style as difficult, given both of those authorial pedigrees). This is why the DS includes a number of different features, including length and lexical score, so account for these errors. Even so, I cannot guarantee the accuracy of these results - especially considering some texts are considered "difficult" due not to any empirical evidence, but rather their content, an unmeasurable facet of literature. We could perhaps cross-reference this data with that found within Genre to look for patterns, but that analysis is outside the scope of this project.