'nother neato google project
Dec. 14th, 2004 16:05MOUNTAIN VIEW, Calif. - December 14, 2004 - As part of its effort to make offline information searchable online, Google Inc. (NASDAQ: GOOG) today announced that it is working with the libraries of Harvard, Stanford, the University of Michigan, and the University of Oxford as well as The New York Public Library to digitally scan books from their collections so that users worldwide can search them in Google.
this is an extension of the google print project, and i am excited about it.
this is an extension of the google print project, and i am excited about it.
no subject
on 2004-12-15 01:20 (UTC)Sure, being able to do text search on books in the library is interesting and useful, but it's not the entirety of the problem. The long-term goal really should be to get all of the information digitized and accessible in digital form so that we don't use it when the books decompose. Obviously that can't be fully public in the case of things that are in copyright, but I also don't want to leave the content owned by a commercial company, even a nice commercial company.
The Harvard project said that they were going to put the text of public domain books on-line as well, so maybe I'm just being too paranoid and the library will end up owning the data -- but does that include the search metadata? Or does that all stay with Google? And if so, what happens if Google changes their business model or for whatever reason doesn't want to provide free access to this any more?
Anyway, if this is just something Google is doing for free to be nice, hey, I don't want to look a gift horse in the mouth. But if this ends up being used as a reason to not get search indexes and digital information owned and controlled by the non-profit university and library systems, and instead just settling for having that information controlled by a for-profit company, that worries me.
Re: 'nother neato google project
on 2004-12-15 01:46 (UTC)https://print.google.com/publisher/terms
(man, i wish LJ would copy the subject header; it pisses me off that they don't. subject headers are a good thing.)
Re: 'nother neato google project
on 2004-12-15 09:10 (UTC)This article also raises the other things that I'm concerned with, though, and I'm sure that the search metadata is being kept private to Google since I'm sure it's part of their proprietary search technology.
The free software advocate in me always looks at these sorts of things as interesting technological demonstrations, but not actually the real thing yet. It's not real until it can be done by anyone without paying Google money. (But the great thing about a commercial company doing it, provided that the intellectual property land grab in the US doesn't get any worse, is that it means that everyone will be able to do it at least in 20 years or so, which is still within my lifetime. Hopefully by the time that we all have good access to this sort of digitized information, they will have perfected the technology that can print me a real book from electronic data, or at least will have improved the electronics of hand-held readers a lot over what they're like now.)
Re: 'nother neato google project
on 2004-12-15 18:46 (UTC)Re: 'nother neato google project
on 2004-12-16 04:06 (UTC)but i don't like reading books on the currently available palm-size thingies. they're too damn small. the screens are not friendly to my eyes. i also don't like reading books on my laptop because that's too large, *heh*. i like reading in bed, or other comfortable positions, and a paperback (not too thick, please) is just the right size for that. so, until there's an ebook reader like that, i am not shelling out any money for suboptimal tech. oh, and i don't want to store all my reading material on it either, *shudder*. must have some alternative backup so if i need to, i can print.
no subject
on 2004-12-15 18:20 (UTC)Stuff that I personally have knowledge of involves things google's been doing a good job of issuing press releases about, that are actually hosted, in many cases, where I work, which isn't google. Those things are all subject to access control and the terms are dictated by the publisher, who owns the publishing rights. The whole "information wants to be free" debate is, of course, one that I've always been interested in personally, and one that has really interesting twists now that I've spent 6+ years dealing with not-for-profit academic publishing.
It's WAY more than I can readily distill down to an LJ comment, but, it turns out there are a lot of not-for-profit (which doesn't mean free) and commercial operations which already have indexing metadata as major pieces of business, a lot of rivalries and politics about who gets to do what, and who wants to make what free. It's all quite the morass of confusing stuff, frankly.
Anyway. Bringing stuff back to what I know about specifically, for example, a lot of the stuff indexed in scholar.google.com is actually hosted here -- and we don't own the content, we just put it online. And there are slews of metadata-searchers who link to the sites we host, but permission to access the full content is still up to the owners of the content. A lot of those content-owners, however, choose to have metadata deposited not only in places like NIH-type repositories, but other commercial operations that then link back, and so forth.
All of that to say, a lot of publisher type people, at least in the not-for-profit scholarly world, seem to be less concerned about the metadata than you or I might be, so long as the actual content is controllable. It's interesting.
no subject
on 2004-12-15 02:25 (UTC)no subject
on 2004-12-15 11:53 (UTC)The burning question (from my PG Distributed Proofreaders perspective) is how much, if any, they will clean up the raw OCR. Old books present special problems to standard OCR programs--caused by such things as yellowing and speckling, non-modern typefaces, book dust, and old typesetting practices.
Has the masterful Google developed improvements on OCR processing that will make browsing and reading the public-domain texts that they will make available a non-painful experience? Will a human actually look at the OCR output to correct major errors?
Do people care as long as they getting something for nothing?