If you’re anything like me, you read a lot. A lot. I’ve got a collection of around three thousand books. I’ve given up all hope of taking them with me when I move (semi-nomadic student; they don’t even fit in my car). I’ve got a smaller subset of my favorites that follow me around wherever I go–but we’re talking thousands of pounds of paper.
When I was younger, I used to read whole books at a time, but I find that now I primarily use them for reference. Being well-read and having a large supply of material to draw from lets me churn out thoughtful papers in record time. But most of my books aren’t really in the “reference” genre–I might remember a quote by Mark Twain, but who knows which of his twenty or so books I might have to read through before I find it.
Sure, I could bite the bullet and get a Kindle (or just use the iPhone app), but I’m no fan of DRM. I want to be able to grep and parse my books; I want to read them from various computers I have lying around the house, and I want the images and technical drawings to render well. I read a lot of old, obscure books that Amazon and Co. will never get around to digitizing. Plus, I’m a poor college student, who doesn’t want to re-buy every book he already owns in digital format.
So, with a little trepidation, I decided to embark on a quest to scan and digitize every book I own. I’ve been at this in my spare time for several months now, and I’ve got it down to a science. It’s really not as bad as you might think once you know the pitfalls to avoid.
Step 1: Get the cover off. This is often the trickiest part of the entire enterprise. Paperbacks covers can be carefully peeled off (I’ve torn a few). Hardbacks will require you to slice through the cover lining.
Step 2:
Take the bound pages to a printshop and get them to cut it into loose-leaf pages. Most printshops will cut a book for about a quarter; I found an on-campus bookshop that does it for free.
Step 3: Scan. I found an awesome scanner for my mac, the ScanSnap S510M (no, I am not getting paid):
This thing has the smallest footprint I have ever seen, and it burns through my textbooks faster than can be described. I can burn through a small paperback in 5-6 minutes, and large reference works like von Mises’ thousand-page Human Action over the evening news. I got it for a mere $350 refurbed on NewEgg, and it’s the best $350 I’ve ever spent. Don’t let its cute appearance fool you, this scanner is SERIOUS.
4. OCR. OCR software for Mac is notoriously bad. The stuff that comes with the ScanSnap is actually quite decent, although rather slow. Fortunately it requires no interaction of any kind, so you can just let it run and come back later. For text-and-illustration kinds of things, I leave it on “text and images only.” There’s another option that embeds the text in an invisible layer under the page image, which creates much larger files, but you have the original page to refer to, which works well for complex diagrams and such. I’ve also looked at both ReadIris and OmniPage, and have been rather unimpressed.
5. Binding: For awhile, this was the slowest part; I would add ten pages or so and apply some glue, wait a few hours to dry, repeat. Then I discovered gorilla glue. This stuff (not the wood glue crap, the real glue) is seriously strong–strong enough to bind everything with just one or two layers. I stack all the pages in between two phone books, clamp them with some c-clamps, and apply a layer of Gorilla Glue:
In an hour or so, my book is bound more firmly than it was the day it was printed. Another dab of glue puts the spine on, and that’s about it! Finished result:
Grepping through Harry Potter:
Reading on my iPod with Stanza:
Spotlighting for “Narnia” in my (still-incomplete) library:
Did C.S. Lewis ever say anything about the Snow White Disney movie?
Let’s hand my phone to a friend so he can see what books are on my shelf:
I should be on track to digitize EVERYTHING by this summer. What I’ve learned:
Comments are closed.
Incredible article. I have been considering doing this with my college books but I didn’t know a hands-off nearly labor-free way to do it.
Thank you for sharing.
Inspiring idea. I don’t think I have the patience to capitalize all my books but textbooks are another story.. I wrote an article describing how this is a good idea: http://www.philbergeronburns.com/drupal/node/8
As the Director of Sales for a publishing company, I realize that it’s inevitable that the vast majority of books will be going digital. I believe textbooks in particular should be available first and foremost digitally, the benefits far outweigh the negatives. The problem again lies in the owner of the property, the college/school and where they generate the vast majority of their income. College bookstores count on the funds from the textbooks – it’s very similar to the music industry and the decentralization of it. Publishing can and I believe should be decentralized. There is so much content yet to be discovered, savored and enjoyed. As for digitizing the sector – I believe the titles such as textbooks, novels and trade material will be going digital and less and less bookstores, including the college owned will be shelving fewer and fewer. Whether it’s Amazon’s Kindle or not that we’ll be using is another question. As for transforming existing titles into a digital library, I would love to see hardware and apps that would allow someone to digitize their collection without destroying the book. A handheld wand perhaps?
Finally, I applaud you for the daunting task you have taken upon yourself. Being an absolute bibliophile, I can not bring myself to undo the lovely work from my collection of beautiful titles. And I believe that is where current publishing will end up – with beautiful books that require actual tactile printing, and that, is the place that you will find me some day!
Again, I bow to your tenacity!
Hi,
I also read a lot. But I don’t want to remove-covers/cut/glue my books, so I started working on a vertical solution using photos instead of scanner, it’s fast. The only problem, you need programs to modify perspective-curves and a 5mpixel camera needs a much better OCR algorithm (I’m creating my own too) because letters are joined together in the photo.
I have to work with ScanSnap and they are really quite good despite their tiny appearance. Also their software is excellent.
What is the file format of the books? Can you give us an idea of some typical file sizes?
I guess you probably already understand the fact that you have opened yourself up to a world of litigation. You have just copied and transformed the format of a load of copyrighted works, and in many countries the publishers of those books would have every right to sue you.
In the UK its in breach of copyright to convert music from a CD, convert to MP3 and copy to an iPod. Its done everywhere but its not actually allowed based on the licence granted.
Your copied books are almost certainly in breach of that copyright so watch yourself.
Hey Paul,
In the country where I live (US), there’s a well-established right called timeshifting, recognized by the US Supreme Court in the 70s (see Sony v. Universal). Basically, changing the format of a copyrighted work from one form to another form is considered fair use under U.S. Copyright Law. Because of that precedent, it is no longer necessary to go through the four-clause test to show timeshifting as fair use.
I should add that this doesn’t necessarily authorize me to reverse-engineer DRM as per the DMCA. However, since all the books I’m scanning are paper (plaintext), the DMCA doesn’t apply.
The timeshifting right is the same legal right that is behind DVRs, VCRs, etc. I am not a lawyer, but I understand similar principles are at work in many countries, including the UK.
Neat article, with good disbinding tips. I’ll definitely have to consider this (I already have a ScanSnap so I’m all set).
I actually don’t mind the file sizes, really, so I would probably scan page images (with the OCRed text in the PDF for searching). Disk is cheap and gets cheaper fast.
A couple notes to go along with this:
1. Google does all of their scanning for Google Book Search using digital cameras. They have an automated process for determining the curvature of the book and compensating for it after taking the photo. Alas, those machines aren’t available.
2. I’ve noticed that technical publishers are increasingly putting out PDFs of their books. THANKS!
3. I’m actually interested in getting a Kindle DX, since it has native PDF reading. To me, the Kindle DX seems like it would be a much more pleasant reading experience than an iPhone.
Excellent article. Thanks!
I’m absolutely amazed at your ingenuity and patience. You’ve taken such a daunting task upon yourself, and you’ve found a way to do it efficiently (though I can’t imagine it’ll ever be easy).
I applaud you, sir (or madam).
@Paul: If everyone does it, then law doesn’t apply. It’s the drylaw, politics can’t put 95%percent of the people in jail in democracy. If they try they go out of politics before they wink an eye.
I don’t care a publisher puts a message in MY BOOK that I purchased that says I can’t copy it for my own use. I can copy it, because it’s mine, I can’t give it to others but I can copy it for my own use, and I will. They can try to put me in jail because I bought their book and wanted to use it on another device, like an ebook reader or a laptop like this:
http://jkkmobile.blogspot.com/2009/06/hands-on-with-pixel-qis-new-epaper.html
They will wake a giant if they do (>90%population think this is fair use).
I want to do this to my library of ~500 technical books, so thanks very much for writing this article. I just bought a ScanSnap S510M refurbished on Amazon. I have a few questions about Step 1. Could you include some photos of this step, with hardback and paperback? I can’t really imagine what it entails to remove the cover of a book without ripping the pages. Also, if you could upload larger images; the ones you have are too small for me to really see what’s going on. For step 2, how do they actually cut it into loose-leave pages; do they somehow cut it page by page? That’s kinda cool. For step 5, more photos would be helpful. How do you avoid getting a gooey mess? I don’t really understand where you’re applying the glue, either. Maybe this would be easier if I just found a book I didn’t care about and destroyed it and then put it back together…
Hi,
this is most fascinating! Why have you chosen the SnapScan. Was it just the price and you found out that you got a decent deal or had you some specs in mind, some experience with this particular model, whatever?
If it weren’t for having to manual put each page on the scanner – would you go for a flatbed scanner? Has positioning ever been a problem with the SnapScan (which feeds itself the pages like a xerox machine does, I assume)? Have you had paper jams and things like that?
Really, this is a great project. It makes me want to do the same. Maybe we could all persuade you to a sequel? Something like “The joy of making electronic books”??
Thanks for sharing!
I chose the ScanSnap because it was the highest-reviewed ADF scanner in the price point that was mac-compatible. I wasn’t disappointed. This thing is truly pretty epic.
I did the flatbed scanner thing for a couple of books, but due to my crappy scanner and crappy OSX software I was working at like 30 seconds per page. With my new scanner and an ADF, I’m averaging about two pages per second, 60x the speed. If I only load about 12-15 sheets in the ADF at a time, I never have a jam. I believe you can put in about 25 sheets without many problems. For really thin paper that sticks together (bible-like paper, etc.) I will sometimes put in 2-3 sheets at once to avoid them sticking together.
@Drew Crawford
Ah. Okay, I understand speed was one of your priorities. There the SnapScan rulez.
The reason I was asking about the flatbed one is the option *not* to have to tear the books apart. Maybe bend them a little harder than usual.
Could you just name the OCR-software that comes with the SnapScan, please? Maybe it’s available for other models of scanners as well.
@Wolf
It’s a mac port of ABBYY finereader. The mac port is only available with Scansnap scanners.
Great article! Can you share the settings on the Scansnap you find work best, such at DPI?
Neat idea. If folks are looking for a shortcut that has SOME of the benefit of having all your books scanned, then try creating a “My Library” over at Google Books and you can add all your books and search them. For most books, it will give you the page number(s) where the search term was found of any book in your “My Library.” This isn’t full proof because 1) they don’t have every book and 2) some publishers restrict them from displaying the page number where the search term is found.
This does work for MOST books and doesn’t take too much time to set up.
Hi Drew, this article is awesome!!! I have just bought an Iliad Irex reader and I’m also trying to digitize all my library… I have also the Mises’ Human Action and Socialism….too much work.
I’m getting troubles with the OCR…may it does not work pretty fine in Spanish. But this is not important.
What I’m getting really troubles is with the size of the PDF files. For example, with the Third Volumen of the Stieg Larsson’s Millenium I got a PDF of 5 Mb when there is only text in the file. This book shouldn’t weight more than 1 Mb…but I cannot manage to reduce the file size.
Do you have any tricky point for this?
Thank you very much in advance…
@drew Drew, your comment about time shifting is out to lunch. New York Times Co. v. Tasini, 533 U.S. 483 (2001), blew it right out of the water. Check it out.
Writers own their work unless they specifically assign their copyright to someone else (e.g. a publisher). You can try and steal it if you want, but you’d better believe that’s actionable.
I’d like to direct you here.
With all the doggone snow we have gotten lately I am stuck inside , fortunately there is the internet, thanks for giving me something to do.
The next step is to convert certain of those books to audiobooks so you can devour them when your eyes are elsewhere employed. Actually, I believe the Kindle will already do this for you, allowing you to pick up aurally where you left off. The voice is alright, but not wonderful. The best of the text-to-speech voices I’ve found, which I use daily, come from Ivona. Check out the British ones (Amy & Brian). I think the new American ones (Kendra & Joey) are pretty good too.
Oh, and if you want to speed up the scan process, forget the initial OCRing. Just OCR the whole folder using Acrobat’s (which comes with the the Scansnap) batch processing feature (Document > OCR Text Recognition > Recognize Text In Multiple Files Using OCR).
Thanks for the good article!
I know this is a 2 and a half year old post, but I was wondering if you still had the images around somewhere. I’ve been looking into doing this for awhile myself, thanks.