US to have 30m newspaper pages online by 2006

Posted on November 19, 2004
Filed Under Digitisation, Newspapers | Leave a Comment

US announced this week that they will have 30m newspaper pages on net by 2006.

This article mentions that …

The span of the joint project is limited because type faces of printers used before 1836 are too difficult for optical scanners to read, and copyright restrictions are in force on papers published after 1923.

They have developed a prototype at the Library of Congress site - The Stars and Stripes, 1918-1919. It is a pretty basic interface - they are clearly focusing on getting the basics right before developing the front end.  If you go to the above link and look to the bottom left of the page you will see that you are able to view “the OCR-generated text transcription of this page”. This gives a reasonably accurate OCR version of the page.  The PDF’s and the OCR accuracy look to be excellent from a quick look.  However they don’t look to have done that well on “segmenting” the page into it’s constituent elements but to be fair this is clearly an early version.

Reuters slows NewsML roll-out

Posted on November 19, 2004
Filed Under Paid Content | Leave a Comment

Blogroll


Categories


Archives