Discover more from Analyze the Data
A quarter of the Internet Archive's books are going away
A citizen science investigation
The Internet Archive lost an early ruling in a case against major publishers, and it led to a settlement. Based on a review of 100 books, 25% of the books will be removed because there are Kindle copies. Another 60% have cheap hard copies, and 15% would be unavailable if removed from the Internet Archive.
As I mentioned in an earlier post, one of the amazing things that the Internet Archive does is “controlled digital lending”, where they buy up a bunch of books, digitize them, and then stick them in a warehouse somewhere. They then argue that gives them permission to lend out one copy of each book, just like a normal library would.
If you are familiar with your local library, you might know about Libby/Overdrive or about Hoopla. These are services that your library contracts with to provide digital copies of (mostly) popular books. However, the library doesn’t actually own those books. They have to buy them at rates that are much higher than commercial ebook purchases, and they’re really more licenses than purchases, since they can only be leant out so many times.
They’re severely limited because of the massive category of orphaned works/hostage works. Some works are still under copyright but it’s unclear who the copyright owner is. Others are under copyright and it would be totally impractical to make a digital copy (as it’s expensive to produce and they would make very little revenue on it).
The Internet Archive lost an early ruling in a case against major publishers, and it led to a settlement. The Internet Archive’s settlement says “The lawsuit only concerns our book lending program. The injunction clarifies that the Publisher Plaintiffs will notify us of their commercially available books, and the Internet Archive will expeditiously remove them from lending. Additionally, Judge Koeltl also signed an order in favor of the Internet Archive, agreeing with our request that the injunction should only cover books available in electronic format, and not the publishers’ full catalog of books in print.”
This is my citizen science estimation of how many books that actually is.
I am working on a book about the Chicago Conspiracy Trial, and have been doing research on the Internet Archive. I saved a list of favorites, which are books and magazines and public records for future reference. That’s the list that I’m working from. I do not claim it’s broadly representative of all books (the 1970s are way overrepresented as a decade, there are a lot of non-fiction books), but some data is better than no data.
I ignored things that weren’t books, which were mostly magazines and the Congressional Record. I don’t know what’s going to happen to those.
How I checked
For each of these books, I searched them on Amazon. I looked at whether I could buy the book on Kindle and I also looked at, if you were going to buy a physical book on Amazon, how much it would cost. For these, I ignored paperback vs. hardback and the condition of the book. (The important thing is the knowledge, not the medium.) I picked $30 as my line for a “cheap” book, since that’s kind of the line between an expensive hardback from your local independent bookseller and a weird special order.
I split things into 3 categories:
Kindle available - there is a Kindle book
Available in hard copy - there is no Kindle book and the hard copy is less than $30
Unavailable - there are no hard copies available or the only hard copy available is more than $30
25/100 books are available on Kindle. Looking through the list, many of them are things that you would expect, books with market appeal (even if it was a narrower slice):
There is a disturbing chunk of very old books with extremely expensive Kindle copies. I only spotted a few of them (3-5), but I would worry that rightsholders would take the Internet Archive’s OCR, not even bother formatting it, and turn it into a very expensive Kindle book.
These aren’t textbooks or books that are likely used in courses; those have a much more obvious reason to have a $30+ Kindle book. The way you can tell is that physical copies are still very inexpensive (<$5).
Available in Hard Copy
60/100 books have no digital copy and an available hard copy for less than $30. This is the majority of these old books , and absolutely where the Internet Archive shines. No digital copy exists, and if you had one more bookshelf, you could fill it with cheap copies of Steal This Dream: Abbie Hoffman & the Countercultural Revolution in America ($6), Left At The Post ($4.39) and Countdown to chaos: Chicago, August, 1968 ($4.65). (I did not keep track of detailed prices for hard copies, since I knew they were going to change.)
It’s tough. Many of these books were purged from libraries because of lack of shelf space, and you’d find them on your block in the Little Free Library for nothing if you were lucky enough to live near a 1960s scholar or aficionado who is downsizing. You might be able to get them from a massive library system (like if you had a university library card or great Interlibrary loan), but no normal size library is going to hang onto these barely circulating books in hope that someday, a scholar will come along to pluck Excalibur from their shelves.
15/100 books have no digital copy and either a hard copy for greater than $30 or no hard copy.
This was the category I worried most about in hearing the ruling. I found Bobby Seale’s autobiography A Lonely Rage via the Internet Archive and was curious what I could do if it was pulled offline by the ruling. The cheapest hard copy is $400.
This is still better than Trial by Tom Hayden, which has no copies for sale on Amazon at all.
A more typical example is The barnyard epithet and other obscenities; notes on the Chicago conspiracy trial ($39.03). It was not a luxury item at the time of purchase, but has now become quite rare.
I checked each of these books via the “Library Extension” Chrome extension, looked at whether it was available on Hoopla or at any of the 7 libraries I have cards at (San Francisco, Seattle, King County, Alameda, San Mateo County, Santa Clara County, and Northern California).
9/100 books are available at any of my libraries, so it still leaves a pretty big gap that you’d have to fill. 2/100 were on Kindle Unlimited (so you can get them with that Amazon add-on subscription).
Other online booksellers
I looked to see whether the books I didn’t see on Amazon were on Google Play Books. (Note: I work at Google, but this whole thing is in my personal capacity, not my work capacity.) 3 out of the first 25 books which weren’t available on Kindle were available on Google Play Books, so I’d expect an extra 9 books which are available on Google Play Books but not on Kindle. That brings us to 34 books coming offline.
Between a quarter and a third of the Internet Archive’s books may be going offline as a result of the settlement. Most of these have reasonably priced Kindle options, but a worrying minority have very expensive Kindle books to replace them.
It’s not all bad news. The cheap books which are interesting to a subset of users are staying around, as are the really expensive or impossible to get ones.
The Internet Archive is still amazing, and if you can, please donate to keep them going.
Thanks for reading Analyze the Data! Subscribe for free to receive new posts and support my work.