If you’re anything like me, you read a lot. A lot. I’ve got a collection of around three thousand books. I’ve given up all hope of taking them with me when I move (semi-nomadic student; they don’t even fit in my car). I’ve got a smaller subset of my favorites that follow me around wherever I go–but we’re talking thousands of pounds of paper.
When I was younger, I used to read whole books at a time, but I find that now I primarily use them for reference. Being well-read and having a large supply of material to draw from lets me churn out thoughtful papers in record time. But most of my books aren’t really in the “reference” genre–I might remember a quote by Mark Twain, but who knows which of his twenty or so books I might have to read through before I find it.
Sure, I could bite the bullet and get a Kindle (or just use the iPhone app), but I’m no fan of DRM. I want to be able to grep and parse my books; I want to read them from various computers I have lying around the house, and I want the images and technical drawings to render well. I read a lot of old, obscure books that Amazon and Co. will never get around to digitizing. Plus, I’m a poor college student, who doesn’t want to re-buy every book he already owns in digital format.
So, with a little trepidation, I decided to embark on a quest to scan and digitize every book I own. I’ve been at this in my spare time for several months now, and I’ve got it down to a science. It’s really not as bad as you might think once you know the pitfalls to avoid.
Step 1: Get the cover off. This is often the trickiest part of the entire enterprise. Paperbacks covers can be carefully peeled off (I’ve torn a few). Hardbacks will require you to slice through the cover lining.
Step 2:

Take the bound pages to a printshop and get them to cut it into loose-leaf pages. Most printshops will cut a book for about a quarter; I found an on-campus bookshop that does it for free.
Step 3: Scan. I found an awesome scanner for my mac, the ScanSnap S510M (no, I am not getting paid):

This thing has the smallest footprint I have ever seen, and it burns through my textbooks faster than can be described. I can burn through a small paperback in 5-6 minutes, and large reference works like von Mises’ thousand-page Human Action over the evening news. I got it for a mere $350 refurbed on NewEgg, and it’s the best $350 I’ve ever spent. Don’t let its cute appearance fool you, this scanner is SERIOUS.
4. OCR. OCR software for Mac is notoriously bad. The stuff that comes with the ScanSnap is actually quite decent, although rather slow. Fortunately it requires no interaction of any kind, so you can just let it run and come back later. For text-and-illustration kinds of things, I leave it on “text and images only.” There’s another option that embeds the text in an invisible layer under the page image, which creates much larger files, but you have the original page to refer to, which works well for complex diagrams and such. I’ve also looked at both ReadIris and OmniPage, and have been rather unimpressed.
5. Binding: For awhile, this was the slowest part; I would add ten pages or so and apply some glue, wait a few hours to dry, repeat. Then I discovered gorilla glue. This stuff (not the wood glue crap, the real glue) is seriously strong–strong enough to bind everything with just one or two layers. I stack all the pages in between two phone books, clamp them with some c-clamps, and apply a layer of Gorilla Glue:

In an hour or so, my book is bound more firmly than it was the day it was printed. Another dab of glue puts the spine on, and that’s about it! Finished result:
Grepping through Harry Potter:

Reading on my iPod with Stanza:

Spotlighting for “Narnia” in my (still-incomplete) library:

Did C.S. Lewis ever say anything about the Snow White Disney movie?

Let’s hand my phone to a friend so he can see what books are on my shelf:

I should be on track to digitize EVERYTHING by this summer. What I’ve learned:
- Having my whole library on my iPhone vastly changed my definition of “reading time.” Now, I can catch a paragraph while I’m waiting at the checkout line. It’s one tap to bring up the last book I was reading with the correct page.
- I find myself reading–really reading–more than I’ve ever done since childhood
- Spotlighting a personal library is ridiculously powerful. You feel almost giddy. ”I don’t know the frequency of EM fields, but I did scan this electricity textbook…” BAM, there’s the answer.
- You start doing random searches, and the results are scary. How many books reference the wood chuck chuck question? Let’s graph my books by publishing date. Can we use a bayesian network to classify my books by genre? Can we write a script to rip cover art from Amazon.com? The possibilities are endless.