Idealog Columns: TEXT TWEAKER

Dick Pountain/16 January 2011 12:54/Idealog 198

A couple of years back you'd still hear arguments about whether or not electronic readers could ever take over from print-on-paper. That already feels like a long time ago. I found myself in a mid-market hotel just before Christmas, and when I came down for breakfast in the morning at *every* table was someone (or a whole family) reading news from an iPad, except for the one that had a Kindle. I had to make do with my Android phone and felt a bit out of place.

I've written here before about my Sony Reader, but it hasn't make the grade and is now gathering dust. Its page turning is just too slow and it's too much of a fag to download content to it, but the final straw was the way it handles different ebook formats, unpredictably and far from gracefully. The problem is simply that I'm not in the market for commercial ebooks and never buy novels from Amazon or publishers' websites. What novels I read, I still read on paper (possibly decades or centuries old) and the rest of the time I read non-fiction that's rarely if ever available as an ebook. Commercial books properly formatted in ePub look fine on the Sony - cover, contents and navigation - but I rarely read them. Perhaps a third of my reading is nowadays done on screen, but almost always laptop or phone and off a web page: the Guardian website, Open Democracy, Arts & Letters, various blogs, and white papers from numerous tech sites.

I have however collected an extensive library of classic texts and reference works that I use a lot, stored on my laptop to be always available off-line, and it's there the Sony really fell down. I get most of these books from the Internet Archive where they're typically available in several formats: PDF and PDF facsimile (scanned page images), ePub, Kindle, Daisy, plain text and DjVu (an online reading format). However the Internet Archive is a non-profit organisation that relies on voluntary, mostly student, labour to scan works in, so inevitably most documents are raw OCRed output that hasn't been cleaned up manually. Really old books set in lovely letterpress typefaces like Garamond and Bodoni are the saddest, because OCR sees certain characters as numerals so the texts are peppered with errors like "ne7er" and "a8solute". Many such books also contain a lot of page furniture - repeating book and chapter titles in headers or footers for example - that scanning leaves embedded throughout the text, extremely irritating if you consult them often.

One solution is to download a facsimile version, but that's glacially slow to read on the Sony Reader, taking ten seconds to turn each page and looking crap in black-and-white: on laptop or iPad in colour it's a fine way to read (it even preserves pencilled margin notes) but it isn't searchable which defeats half the purpose, so I always have to download a text-based PDF or plain text version too. Unfortunately the Sony Reader displays PDFs unpredictably: it only has three text sizes and if you're unlucky none of them will look right, either being too huge or too tiny.

I started cleaning up certain books myself, downloading a plain text version and using Microsoft Word (of all things), which actually has powerful regular expression and replacement expression facilities, though well hidden and with lamentably poor Help. I soon learned how to quickly bulk-remove all page numbers and titles, auto locate and reformat subheads, and even cull improbable digits-in-the-middle words like "ne7er". However outputting the cleaned up result as PDFs proved a lottery on the Sony as regards text sizing, contents page and preserving embedded bookmarks (you need one per chapter for navigation purposes). For many books I found that an RTF version actually looks and works better than a PDF.

Someone tipped me off to try Calibre (http://calibre-ebook.com/), a shareware ebook library manager that converts between different ebook formats, and in particular can output in Sony's own LRF file format which proved more reliable than PDF. It was already too late for me though. Calibre works well but is quite techie to use and, like Sony's proprietory Reader software, it maintains its own book database, so yet another file system to deal with. Eventually I just couldn't be bothered. I've checked Google Books offer of a million free-to-download public domain titles, only to discover that they are of course the same hastily-scanned copies I already have from archive.org (which stands to reason as once some public-spirited volunteer has scanned an obscurity like Santayana's "Egotism in German Philosophy", no-one else is ever going to do it).

My own gut feeling is that, Kindle notwithstanding, none of the current ebook formats will be the eventual winner and that plain old HTML, in its 5 incarnation, will become the way we all read stuff on our tablets in a couple of years time. Perhaps PDF too if Adobe puts its house in order in time. And we'll need to recruit a whole second generation of volunteer labour to clean up all those documents scanned by the first generation once the Google book project gets into its full stride: the bookworms' equivalent of toiling in the cane fields...

Idealog Columns

Tuesday, 3 July 2012

TEXT TWEAKER

No comments:

Post a Comment

INTERESTING TIMES?

Search This Blog