Peter Sommer is Research Fellow at the Computer SecurityResearch Centre, London School of Economics, earns most of his income fromadvising insurers and in expert witness work and is Special Advisor to theCommons Trade and Industry Select Committee on E-Commerce. Details:http://csrc.lse.ac.uk/Sommer/sommer.html
Nearly all of the Optical Character Recognition packages available assume thatthe source material consists of individual detached sheets of print matter andthat users will be operating from a fixed computer. But vast amounts of legal,academic, consultancy and many other forms of work rely heavily on the insertionof fragments of quotations from elsewhere; and these are often to be found inbound books and in locations remote from, or inconvenient to, the desk-topcomputer and flat-bed scanner.
Conventional OCR has developed apace – good quality scanners may now be hadfor under »100 (though industrial strength scanners with automatic page feederscost rather more); the price typically includes reasonable quality ‘LightEdition’ versions of OCR software. Another »80 or less will buy the fullversions of software like Textbridge, Pagis and Omnipage and for that you canexpect the text to be read with a high degree of accuracy and to retain itsformatting; some even convert on the fly into HTML for Web publishing or intoAdobe PDF. Ten years ago, to achieve adequate OCR quality meant spending »7,500or so on Xerox’s Kurzweil Discover.
But for quotes on the go one must turn to a far less publicised technology– the OCR pen. It is operated rather like a high-lighter; in place of the tipdispensing the fluorescent ink is a very small TV camera which can suck in print(or rather, patterns of black and white images) one line at a time and thenrender them into computer-readable characters. At the moment there are threemain contenders, the Primax/IRIS DataPen, the Siemens Pocket Reader, and theC-Pen. There are significant differences in design philosophy, effectiveness andprice between them; some of the questions a potential purchaser ought to beasking may not initially be obvious. For example: Is complete portability andindependence from an associated computer important? Will there be frequent needto insert small sections of OCRed text into existing documents, forms andtables? How is the OCR pen hardware connected to the desk-top? How ergonomic isthe pen? How easy is the software to use?
OCR Pen Challenges
All OCR pen devices face a common set of challenges. Regular OCR has to beable to distinguish 1 from l, i and I, o from c, m from ni, h from li, d from cl.It must do so in wide range of sizes and for fonts varying from elegant spindlyBodonis, Caslons and Baskervilles to robust modern Univers, Arials and Gills.For legal citations it must distinguish between round and square brackets andthe digit 1, all of which may be adjacent to each other. It must cope withshiny, reflective magazine paper and absorbent newsprint which spreads andsmudges the type image. It must not be fazed if the line of print has beenscanned at a skew. Mere pattern recognition alone is not enough, for accuracy itmust also consider characters in context based on dictionaries. But when an OCRpen is used, line and character skewing and stretching is more extreme as thewrist often twists while the hand moves across the page; the speed at which thepen moves along a line may vary widely; and the pen’s camera may be capturingnot only the desired line but parts of those above and below it. The reddishlight used to illuminate the line being scanned – and to which the camera (CCD– charge-coupled device) is particularly sensitive – is also hostile toprint of an orange or red cast. All the reviewed devices can read sizes down toat least 8 pt with high levels of accuracy.
Primax/IRIS DataPen
The Primax/IRIS DataPen is the oldest of the trio: the hardware firstappeared in 1995 and has been modified only slightly since. Unlike the other twoit has no inbuilt processing power and must at all times be connected to acomputer. But for some purposes this design approach is also a strength: all theprocessing is carried out by software on the desktop, which means that suchsoftware can be very sophisticated and easily upgraded – and it has been. IRISis now in Version 3.0 and there is also a more feature-rich ‘Executive’version. The product carries slightly different labels and features in differentmarkets: Primax make the hardware and, in the version described here, thesoftware comes from a Belgian company, IRIS. In fact all three products weredeveloped in continental Europe. The pen hardware is connected to a PC’sprinter (parallel) port with a pass-through device so that a printer can be usedmore or less at the same time. Although this type of arrangement has the virtueof cheapness, I have never liked it as there are often clashes between printerdrivers and the software supporting the pass-through device. And so, every sooften, it proved here. The solution – and it applies equally to flat-bedscanners, parallel ZIP dives and others which also use pass-through ports – isto install a second parallel port; that is, if you have a spare slot in yourPC’s motherboard for one.
The basic software is designed as a substitute for keyboard input: type asection of regular text in a given font type and size, stop and move the DataPenover the desired segment of print from your source, and the words appear onscreen at the cursor position and in the font type and size already in use.Depending on the precise software package bought, there is also a spreadsheetmode optimised for digits as opposed to alphanumerics and designed to ignore theuprights at cell boundaries, support for 28 languages including some which usenon-Roman characters, bar-codes in various standards and speech synthesis(useful mostly to give confidence that the text is being accurately read in fromthe source). There is also limited sentence-based bilateral translation betweenEnglish and French, German and Japanese. Unlike any of the other products it canalso operate in image mode, capturing icons and signatures; it even has a go atreading hand-written text. Of all the devices the DataPen has easily the mostsophisticated software.
But the product is not portable. Even when used with a A5 portable PC likethe Toshiba Libretto, it is constrained to the duration of the Libretto’sbattery life. In addition, the DataPen needs power and can’t receive it via aparallel port – so that an additional battery pack or DC to mains adapter isneeded. (For fixed desktop use, the DataPen picks up power from the keyboardconnector via a piggyback connector; some large portable PCs allow the use of anexternal keyboard, but some of these simultaneously disable the internalkeyboard).
Siemens PocketReader
For portable, offline reading, recourse must be had to the SiemensPocketReader, at £109 the cheapest of the trio. It is powered by two AAAbatteries which keep things going for about 20 hours. Along the side of thedevice is a small LCD which has one scrollable line of 20 characters. The LCD isused principally to verify the accuracy of input and to view the storedcharacters. Total capacity is described as about 20 pages of A4 text.
When connection to a PC is desired it is done through a nine-pin RS-232 port,typically into COM1 or COM2. The result is read into a small notepad-likeutility from where it can be moved via the clipboard into the actual finalapplication. The PocketReader recognises English, German, French, Spanish andItalian.
Unfortunately the software, both on the pen and on the PC, lackssophistication. All the data the pen picks up ends up in one large file, withonly carriage returns to separate out discrete chunks. Thus if you areresearching in a library from several sources, all the text appears in the samesingle file. It is not possible to use the device as a keyboard substitute andswitch easily from material you have written to material you wish to quote.Repetitive clipboarding between applications can soon become tiring.
C-Pen
The C-Pen is also portable, but it has a four-line LCD and much richersoftware features. It also costs more than twice as much. Siphoned-up data canbe stored in a series of files, and indeed it is possible to arrange the filesinto hierarchical folders or directories just as on a disk. The connection tothe PC is via Infra Red; once the C-Pen software on the PC has been fired up andthe IRDA port on the C-Pen hardware comes into range, the C-Pen appears inWindows 95/98 as ‘My C-Pen’ and it can be addressed as though it were anadditional disk. In this respect the C-Pen is a little like the SiemensPocketReader in that data has to be clipboarded into the final application onthe PC; however the filing system on the C-Pen makes data management much easier– up to 6 MB is available, enough for 3 or 4 whole books stored as ASCII. Itis even possible to edit data while on the C-Pen, though obviously with only avery limited number of buttons on the pen this is not exactly a slick process.Dictionaries and an address book application give the C-Pen some PersonalOrganiser-like qualities.
The C-Pen also has a keyboard substitute mode like the IRIS DataPen; it iscalled C-Direct. When active, the C-Pen is used to siphon up print one line at atime and can then be aimed at the PC’s IRDA port, at which point it is suckedinto the PC to appear at the cursor of whichever application is active.
But of course, by convention most desk-top PCs don’t have inbuilt IRDAports and those on portables tend to face the rear. Thus for many potentialusers a further expenditure of £70 or so will be needed to purchase a moveableunit like Extended Systems’ JetEye PC which provides industry-standard IRDAfrom a conventional COM port.
Overview
None of these devices is perfect but all in the right circumstances aresufficiently reliable to be very useful. The C-Pen has the most features and, invery informal testing, the highest accuracy rates and is ergonomically the mostsatisfying but the dependence on Infra Red is a little eccentric ( a lowercost/lower specification version is expected to be released later this year).The IRIS DataPen is fine for use at a fixed desk-top. The SiemensPocket„Reader is inexpensive. If you have an Apple MAC, you are confined tothe IRIS DataPen.
IRIS DataPen Executive £235 incl VAT
From http://www.scan.co.uk/
Siemens PocketReader £109 incl VAT
C-Pen £290 incl VAT
Both from http//:www.mobiletech.co.uk
JetEye PC Adapter £80 incl VAT
From http://www.exended-systems.com/uk/