It was, perhaps, a Herculean task to discover the "most popular texts ever within the English language." Given differences in style, vocabulary, literary era, authorship, difficulty, length, and genre, a quantitative analysis, as opposed to the traditional method employed by the Humanities in general, and literary studies in specific of qualitative analysis, was perhaps an impossibility. "What about text x?" I can hear critics say. "You forgot about this or that, your methodology is flawed, your analysis is meaningless!" To be quite frank, I expect those criticisms. As I stated in the introduction, one of the primary goals of this project was not to make "one canon to rule them all," in the words of Tolkien, but rather to shed light on the processes that fuel what our culture and psychology consider worthwhile, important, and/or valuable. It was an attempt to refute the notion that the canon is concrete and unalterable. It demonstrates that technology (specifically, the industrial revolution, rather than advances in mass communication) play perhaps a larger role in literature, both production and consumption, than we may have once believed. It demonstrates that free time has perhaps as much to do with quality literature as does "literary genius." It provides a brief glimpse, a rough snapshot, if you will, into the world of literary studies as presently constituted, and it suggests a potential manner in which the field may evolve in the future, for better or for worse. Although attempting to quantify all of human literature through statistical analysis may have been an impossibility, the fact remains that invaluable data was discovered, useful research was performed, interesting conclusions were constructed, and thoughtful ideas were implemented. Whatever your personal perspective on the canon, whether it should exist, or not, and what texts or authors should or should not comprise it, is irrelevant in the face of the traditional pressure and pragmatic implications the existence of such a list demands. Whether or not we should change its contents, add or excise, divide into its constituent components, or utterly annihilate, is an argument for another project; though, now, at least, we have data upon which to base those arguments.
Disclaimer: The research for this project began in late 2020, and has proceeded until the date of last revision, December 2022. Note that all gathered sources and data are from late 2020-early 2021 by necessity, and this project cannot, of course, generate findings or calculations about texts that were published after this date, nor does it take into account data generated after 4/5/2020 (the date of research conclusion). I anticipate and expect the results would be (a least slightly) altered if this project was performed again in the near future, and I further expect a "canon refresh" will have to be performed a number of times through each generation to get an accurate snapshot of texts as they are published and/or canonized.
A number of particularly poignant, and perhaps unexpected, results rise to the surface at the conclusion of this project, as well as a series of reactions and reflections I had while assembling the various components. We shall begin the conclusion there.
Unexpected Incidents
When I first concieved of the project I expected it to be relatively simple, if quite labor intensive. Expectations were such that with enough searches, and with enough data, I could successfully map the information I was after. I was proven quite wrong, and numerous stages of the project took far longer than anticipated. Cleaning the data alone was a monumental task, assigning various stipulations to text and ensuring all qualifications were met was quite a bit of work. I severely miscalculated the amount of time it would take to fill in nearly 3000 cells on a spreadsheet, even allowing for use of automatic equations. I was also unprepared for the level of prior research I needed to conduct before I could even begin to clean the data. The chapter on genre, in particular, was a prime example of this effect, requiring a complete rethinking and overhaul of how I considered genre within text before I could accurately assign data values to my textual sources. Eventually, this resulted in the use of my genre classification system, which I believe to be far more accurate then simply labeling texts in one category or another, but I will be the first to admit, ultimately, it is still imperfect. Further, I was also unprepared for my inability, many times, to draw an accurate conclusion from my collected data. For the most part, given enough data (and enough time) a pattern or an outlier inevitably emerges, and conclusions may be based upon the presence of that datum. In many instances during this project, however, such was not the case. A good example of this effect may be found in our geographical data map, in which the only applicable conclusion I discovered - if it could even be considered as such - was the presence of coastlines and trade as impetus for the exchange of cultural ideas (and therefore literature). Finally, and this is for the better, I was pleasantly surprised at the number of texts I had never heard of, or only heard of in passing, that I now have a much greater interest in or appreciation for. I absolutely consider myself a more well-rounded literary scholar after constructing this project.
Unexpected Results
Aside from my own learning experiences while assembling the AC, a number of surprising results emerged from the collected data. In no particular order, I will rank the four that were most surprising to me. First, I was wholly unprepared for the outsized impact Shakespeare would have on the canon. Going in to the project, I fully expected his works to stand alone at the top; he is considered the most ubiquitous for a reason. I was amazed, however, at the level to which his work permeated the scholarship. At nearly 10% of all texts within the top 100 (and, at the time they were written, a far greater percentage, it is not a stretch to state that more people are more familiar with the works of the Bard than they are foundational documents of politics, theology, or society. The second surprise I had while assembling the AC was the historical impact of the industrial revolution. Again, while I expected the rate of adoption to climb (especially after Gutenberg's Press) it was truly the ability of individuals to acquire the free time to write and read via the quality of life improvements made by industrialization that really caused the canonical explosion of the 19th and 20th centuries. Given that the canon increased in size by a sixfold factor, this technological impact was astounding, especially considering the outsized attention modern mass communication technologies (internet, social media, smart phones, etc.) recieve. My third surprise was the absolute domination of the Novel format. Colloquially and anecdotally, novels are seen as equal parts popular and, in the internet age, I expected, perhaps even passe. Not so, says the data: they are single-handedly the most popular format, and it's not particularly close. With more Recs than almost every other format combined, a canon only of novels would not only not be a stretch, an argument could well be made for its requirement. My fourth surprise was the text that took the top spot on the list. While I was of course familiar with Pride and Prejudice, in fact, I suspect few literary scholars are not, its status as the most recommended text in English literature was astounding to me. Having spent many years in institutions both in pursuit of my degree and subsequently teaching, I expected a text like Frankenstein, Invisible Man, or Huck Finn to take the top spot. Of course, this is colored by my own anecdotal experiences, and it was both surprising and enlightening to note that the texts that comprised my education were, while placed highly on the AC, not the seminal texts of literary study, at least according to the data.
Wrapping Up
On the right you will find a series of discussion questions aimed at generating debate about the project, should you choose to have students read and/or respond to it. Each is categorized by chapter, and is designed to get minds turning. Feel free to use all, or none, of these, or to insert your own content. The appendices following this chapter include an exhaustive copy of the data I employed for this project, as well as my citations list. I would like to take the time to thank you for your time and interest in browsing through this project, and I invite you to ask these same discussion questions of yourself, your own educational background, and your own pedagogy. This is the end of the project - until the next one, perhaps years in the making, like this one was - keep blending math and literature and creating knowledge! For more Digital Humanities projects and tools, check out the mDh homepage (linked at the top of this page).
Introduction and Results
1) How accurate do you believe the Google Books compilation to be? Is this figure reliable? Why or why not? How might that change the data?
2) Which text appears in the list of top texts that is most surprising to you? Which did you expect to see, but did not?
3) How far down the list did you browse until you ran across a text you had never read? What about one you had never heard of?
4) At what position on the list does your favorite text appear, if at all? Why do think that is?
5) What text do you despise that scored highly on the AC? Why do you think it scored highly? What position do you believe it deserves?
6) Are you a fan of fiction or nonfiction? Does the data on the AC concerning the domination of fiction surprise you?
7) Is there a "perfect amount" of canonized texts? What might that number be? Will the canon ever be "complete?" Do you wish there was an authoritative canonical list, or an insitution that curates such a list?
History
1) Which time period was the biggest surprise to you, either because of its contributions or its lack thereof?
2) Would you consider this data to align with your expectations, or are you surprised at one of the results?
3) Given the number of texts already on the AC and its apparent accelerated rate of adoption, what is the best course of action for the canon moving forward? Should we consider a diasporic canon, an obliterated canon, or simply change our requirements for canonicity?
4) What impact will the internet and digital age have on the canon, for example, fifty years from now? Will a large number of texts from this era be adopted into it, or will relatively few? How many, and why?
5) Given the impact the industrial revolution had on textual adoption, do you believe the internet age will likewise be a boon? Why or why not?
Authors
1) Given this chapter's nominees, which author do you believe made the biggest impact on the canon?
2) Given the statistical demographic representations (gender, culture, and geography), which authors are under and/or over-represented?
3) Which author or authors do you believe was most "snubbed" by the AC? Who should appear, but doesn't, or, who should appear more frequently?
4) Does even a single canonical work mean an author is worthy of study, or, like a "one-hit wonder," do you believe sustained success is more important?
5) Given that, as described in the introduction, canonicity is something of a popularity contest, why do you think the popular authors on this list are considered popular?
6) What criteria do you use for determining a "best" author? Number of Recs? CR, GR, or UR? Number of works? Most texts in high positions? Single strongest work? Success in multiple genres or formats? Sustained success over time? Stiffest competition? What metric (potentially including one not discussed in the AC) do you believe is most relevant and accurate?
Format
1) Why do you believe the novel is so crushingly popular in comparison to other formats?
2) What is your own favorite format? How did it fare on the canon? What is the seminal example of that format that was perhaps ignored?
3) How do you reconcile the favorites of the university (UR) with the favorites of the individual (GR)? Does one hold more weight than another? Should it? Why or why not?
4) To what extent, do you think, the extreme multimodality of contemporary literature, especially born-digital literature, impacts the popularity of format? Do you anticipate a warping of this data in the future, perhaps a new format added? If so, what format?
5) How might a pedagogical setting avoid an overreliance on short texts (owing to logistics and time)? Is such a thing possible, and should educators try?
Genre
1) Given the genre taxonomy present in this chapter, how would you classify your favorite text?
2) Which genre inclusion was most surprising to you, either because it was over or under-represented?
3) Given our authorial demographics and time period, it is perhaps not surprising that the data demonstrates London is far and away the most popular urban setting (and setting in general). Is this what you expected? Has it simply become tradition, or is its reputation as a literary city very well deserved?
4) How important is genre to your personal enjoyment of a text?
5) Do you believe the genre taxonomy herein is too granular, not granular enough, or just right? What would you add or eliminate, if anything?
Conclusion
1) Have you read any of the longest texts ever written? Did you enjoy it/them? What was the experience like?
2) What was the most difficult text you've ever read? How do you quantify difficulty? Would you read that text again or recommend it to another reader or student?
3) What is your reading speed? How important is reading speed - that is, how important is it to "finish the list" - as opposed to understanding the content that's on it?
4) How many of the top ten texts have you read or studied? What about the top one hundred? What about the entire canon?
5) Is it necessary to read a majority of these 532 works to be considered "well-read" or "well-educated? If so, how many? If not, why not?
6) What texts are on your own personal canon (a "canon of you")? How close or different is it from the AC? How much overlap is there between the two?