Friday, October 30, 2009

The New York Times Blunders Into Linked Data, Pillages Freebase and DBPedia

Notwithstanding Larry Lessig, when you you try to use the precision of code to express squishiness of the legal system, you are bound to run into problems, as I've explored in my posts on copyright.

This Thursday, the New York Times took advantage of the International Semantic Web Conference to make good on their previous promise to begin releasing the New York Times subject index as Linked Data. No matter how you look at it, this is a big advance for the semantic web and the Linked Data movement. It's also a potential legal disaster for the New York Times.

To understand the what the New York Times did wrong, you have to understand a little but about the workings of RDF, the data model underlying the semantic web. In particular, you have to understand about entailment. Entailments are the sets of facts that can be deduced from the meaning of semantic web data. The crucial difference between plain-old data and Linked Data is that Linked Data includes these entailments.

Consider the English-language statement "apples are red". Because it is expressed in a language, it has meaning in addition to the single fact that apples are red. If we also assert that a specific object is an apple, then there is an entailment that the object is also red.

The New York Times Linked Data is expressed in the RDF language and uses vocabularies called OWL, SKOS, Dublin Core, and Creative Commons (denoted here by the prefixes "owl:", "skos:", "dc:" or "dcterms:", and "cc:"). You can download it yourself at http://data.nytimes.com/people.rdf (11.9 MB download)

Here's a simplified bit of the New York Times Linked Data. It defines a concept about C. C. Sabathia, a baseball pitcher who lost a game on Wednesday for the New York Yankees:
<rdf:Description rdf:about="http://data.nytimes.com/N24334380828843769853">
<skos:prefLabel>Sabathia, C C</skos:prefLabel>
<owl:sameAs rdf:resource="http://dbpedia.org/resource/CC_Sabathia"/>
<owl:sameAs rdf:resource="http://rdf.freebase.com/rdf/en.c_c_sabathia"/>

<dc:creator>The New York Times Company</dc:creator>
<cc:License>http://creativecommons.org/licenses/by/3.0/us/</cc:License>
<dcterms:rightsHolder>The New York Times Company</dcterms:rightsHolder>
<cc:attributionName>The New York Times Company</cc:attributionName>
</rdf:Description>
The first thing this does is it creates an identifier, "http://data.nytimes.com/N24334380828843769853", for the "C. C. Sabathia" subject concept. The New York Times uses this set of subjects to create topic pages, and the main purpose of releasing this data set is to help people link concepts throughout the internet to the appropriate New York Times topic pages.

Next, it gives a label for this concept, "Sabathia, C C". So far so good. The next two statements say that the New York Times Topic labeled by "Sabathia, C C" is the same concept previously identified by DBPedia, a Linked Data version of Wikipedia, and by Freebase, another large collection of Linked Data. This is even better, because this tells us that we can use information from Wikipedia and Freebase to help us infer facts about the New York Times C. C. Sabathia topic. "sameAs" is term is defined as part of the "OWL" standard vocabulary, which defines how machines should process these assertions of sameness.

The last four lines, highlighted in red, assert that the C. C. Sabathia concept was created by "The New York Times Company", which is the rights holder for the C. C. Sabathia concept, and that if you want to use the C. C. Sabathia concept, the The New York Times Company will license the concept to you under the terms of a particular Creative Commons License.

There are two separate blunders made by the stuff in red. The first blunder is that the New York Times is attempting to say that the C. C. Sabathia concept is a work "PROTECTED BY COPYRIGHT AND/OR OTHER APPLICABLE LAW." This is complete rubbish. The information provided by the New York Times about the C. C. Sabathia concept consists of a few facts that cannot be protected by copyright or any other law that I know of. (The entire 5,000 entity collection, however, is probably protectable in countries other than the US).

The second blunder is much worse. Where the first blunder is merely silly, the second blunder is akin to attempted property theft. Because the New York Times has asserted that it holds the rights to the C. C. Sabathia topic, and further, that the C. C. Sabathia topic is the same as the Freebase "c_c_sabathia" topic and the Wikipedia "CC_Sabathia" topic, by entailment, the New York Times is asserting that is the rights holder for those concepts as well.

You might argue that this is a harmless error. But in fact, there is real harm. Computers aren't sophisticated enough to deal with squishy legal concepts. If you load the New York Times file into an OWL-aware data store, the resulting collection will report that the the New York Times Company is the rights holder for 4,770 concepts defined by Wikipedia and 4,785 concepts defined Freebase.

Now before you start bashing the New York Times, it's important to acknowledge that RDF and Linked Data don't make it particularly easy to attached licenses or attributions to semantic web data. The correct ways to do this are all ugly and not standardized. You would think that this would be a requirement for commercial viability of the semantic web.

People trying to use New York Times Linked Data can deal with this in three ways. They can decide not to use data from the New York Times, they can ignore all licensing and attribution assertions that the Times makes, or they can hope that the problem goes away soon.

A fourth way would be to sue the New York Times Company for damages. At long last there's a lucrative business model for Linked Open Data.

Update: I have two follow-up posts: The Blank Node Bother and the RDF CopyMess and The New York Times Gets It Right; Does Linked Data Need a Crossref or an InfoChimp?
Reblog this post [with Zemanta]

Wednesday, October 28, 2009

Ralph Waldo Emerson on URI Mysticism

It's hard for me not to think of Linked Data and the Semantic Web when reading this part of an Ralph Waldo Emerson's The Poet (quoted from RWE.org) (Not that I make a habit of reading Emerson, but I just finished Neal Stephenson's Anathem.):
The poet did not stop at the color, or the form, but read their meaning; neither may he rest in this meaning, but he makes the same objects exponents of his new thought. Here is the difference betwixt the poet and the mystic, that the last nails a symbol to one sense, which was a true sense for a moment, but soon becomes old and false. For all symbols are fluxional; all language is vehicular and transitive, and is good, as ferries and horses are, for conveyance, not as farms and houses are, for homestead. Mysticism consists in the mistake of an accidental and individual symbol for an universal one.

Reblog this post [with Zemanta]

Tuesday, October 27, 2009

Rehashing the Copyright Salami

I got a lot of feedback on my post on "Copyless Crowdscanning" from a variety of people. The comments, taken as a whole, do an excellent job of illuminating the ambiguities and difficulties of copyright law as applied to digitization.

Some people who read the article, such as Robert Baruch, assumed that I was trying simply to evade copyright by distributing the copying among many people and then enabling users to reassemble books in their browser. Many critics doubt that a salami strategy is much of a defense against copyright infringement, because a judge will see a copy appearing on a users browser, and won't be impressed at the details of how the copy was assembled. James Grimmelmann wrote me:
The problem is in your assumptions; two of them depend on a belief that copying small portions (or making small portions available) is categorically not infringement. It probably isn't, taken alone. But that doesn't mean we can aggregate those small amounts and have them stay non-infringing, any more than we can integrate infinitesimals and have them stay infinitesimal. I think a court would wind up saying that the activities of the various people in the system could be treated as part of a common plan of action, at which point there's full-text copying going on, and thus potential infringement.
I agree with this assessment, and thus a "copyless crowdscanning" organization would have to figure out how to make the reconstitution of the fulltext impossible. And as I explained, engineers hate to hear that something is impossible!

My main interest, however, was not to make salami by reassembling slices, but to make hash. I was assuming that an index of a book is not a copy of a book or a derivative work, but rather a collection of facts about the book. I am willing to grant that if a copy of a significant fraction of a book can be generated from a collection of facts, then that collection is equivalent to a copy, but if not, then it is neither a copy nor a derivative work.

I recognize however, that there are lawyers willing to argue that any index is inherently a derivative work and that rights holders should be able even to control the indexing of their work. It's clear from looking at cases such as The Harry Potter Lexicon case and Seinfeld Aptitude Test case that judges use a variety of tests to determine if a work is indeed a collection of facts. In the latter case, infringement was established because the "facts" were fictional. Because judges look at a variety of factors, we can't remove copyright law from the picture just because there's no copying. Still, the judge in the Harry Potter Lexicon case was pretty clear that an index per se is a transformative and thus allowed use of the work:
... the Lexicon identifies more than 2,400 elements from the Harry Potter world, extracts and synthesizes fictional facts related to each element from all seven novels, and presents that information in a format that allows readers to access it quickly as they make their way through the series. Because it serves these reference purposes, rather than the entertainment or aesthetic purposes of the original works, the Lexicon’s use is transformative and does not supplant the objects of the Harry Potter works.
Another set of criticisms said that the slices in my salami were too big. Grimmelmann says "A scan of a single page is probably enough to infringe." Richard Nash asserts that "copying a sentence *is* prima facie infringement." And Wes Felter, in a comment, pointed at this article which goes into the rather deep philosophical and abstract problems created by information reducible to a number that can be "coloured" with infringing intent.

We could modify the crownscanning software to use an even smaller slice, of course, but here we get to an interesting question: what is the smallest bit of text that can be protected by copyright? I think it would be silly to argue that single words could be copyrighted. As poetry, I'm sure a thousand word sentence could be. But as a practical matter, I'm guessing most any single sentence could be reused fairly. In the US, at least. And it would be really difficult for anyone to prove that a particular sentence had not been previously used in a copyrighted work. But unfortunately there's no obvious rule.

Is there any objective principle that could be used to determine the copyrightability of text fragments? A computer scientist might argue that a relevant measure should be the probability that a given text fragment could be generated at random, and if there were a lot of math-major judges, that might work. It seems to me that copyright law should at the very least be built using solid blocks. A legal apparatus built on a foundation of copyrightable 3-word sequences, for example, would quickly melt into uselessness.

The squishyness of today's copyright system imposes a huge cost on both users and owners of copyright. If the rules were clearer, rightsholders would find it easier to monetize their work, and society would benefit from the increased non-infringing use. For example, there could be a decision that said that reuse of less than 10 words cannot by itself be infringing and that 100 words is by default infringing. You don't need any calculus to enjoy a good bite of salami!

Friday, October 23, 2009

Copyless Crowdscanning: How to Legally Index the World's Books

Here's how I know that I have engineering in my DNA. Whenever I hear something labeled as impossible, impractical or unlawful, I can't restrain myself from trying to think of ways around the physical, logistical and legal constraints that supposedly imply impossibility. "That", "is" and "impossible" are fighting words to an engineer. And that's why I've admired the proposed Google Books Settlement. By way of a spectacular feat of legal engineering, it has suggested a way to do the seemingly impossible- to build a database of all the worlds books- in the face of the tremendous obstacle posed by an extremely messy legal situation.

But despite my admiration for the "engineering" involved in the settlement, there have always been some things I didn't like about it. And despite all that's been written about it, and the many aspects that people people have objected to, I've never seen anyone voice my particular misgivings, perhaps because of their peculiar engineer's orientation.
  1. The settlement uses a legal innovation to accomplish its goals. I don't like that (the "legal" part, not the "innovation" part). Many people have objected to the particular innovation that is used, arguing that this precedent could lead to a reign of tyranny and/or other cataclysm, but I've not seen any objection to the use of legal apparatus in the first place. I've often made the disclaimer here that I Am Not A Lawyer, but I've generally downplayed my ingrained bias for using technology rather than law to solve the world's problems.
  2. The settlement seems to be based on a presumption that Google's database of all the world's books cannot be built without making copies. I don't like to assume things are impossible. I should also note that several of the arguments opposing the Google Books Settlement rely on exactly the same presumption!
As the months have dragged on and the postponements pile up, I'm thinking that my first objection is starting to make more and more sense. After thinking it over for over 6 months I'm starting to think that my second objection is also valid. The rest of this post describes how it might be possible to build a full-text database of all the worlds' books without doing any copyright-infringing copying. I'll call this scheme "Copyless Crowdscanning".

What got me started on this line of thought were some simple cost calculations I presented in my article on Dan Reetz' DIY book scanner. It made me realize that the idea of having hundreds of thousands of people scanning their books with cheap scanners was not out of the realm of possibility. The main barrier to assembling a database of all the world's books will no longer be the scanning, but rather the laws governing copyright. So my focus is on how to do crowdscanning so that copyrights are not infringed; the easiest way to do that is to not make any copies.

Here are the assumptions I start with. As I've been learning about copyright, I've learned that there will always be a copyright lawyer somewhere willing to contest any common-sense assumption about copyright, so it's important to start somewhere. First, I'm assuming that scanning a small number of pages of a book (suppose that number is 1% of the book) for the purpose of indexing those pages is not a violation of copyright, as long as I don't redistribute the scans and destroy them after I finish my indexing. The indices are things I should be able to keep and redistribute.

Second, I'm assuming that it is not a violation of copyright to redistribute single sentences from a book. So, for example, publishing the following sentence:
The punishment lay in knowing that you were putting all of that effort into letting a kind of intellectual poison infiltrate your brain down to its very roots.
is not a violation of Neal Stephenson's copyright to the book Anathem. A corollary of that is that if I shuffle the order of all the sentences in a book, I can redistribute that jumble without violating copyright.

Finally, I'm assuming that scanning and distributing the title page of a book and its verso cannot be a violation of copyright; such distribution would be necessary in many cases just to convey statements of fact and as such are not subject to copyright. I recognize that artwork on these pages may need excision.

Let's suppose that we had a large number of people participating in our database building project. Suppose for example, that 100,000 people participated. Each person would scan a small fraction of each book they owned, along with its title pages. The title pages would be submitted to a book identity server, which would return a book identifier. The rest of the page scans would be processed by software, and the scans would then be destroyed. The software would digitize the scans, then chop the pages into individual sentences. An index of the pages would be generated and submitted to an "index aggregation" service. The sentences would be shuffled and submitted to a "sentence serving" service.

After many people have made partial scans and submitted partial indices to the index aggregator, a complete index would emerge that can be used just as Google Book Search is used. The complete sentences would be provided by the sentence server to provide the context of the result sets.

Note neither the index aggregator nor the sentence server would be able to reconstitute a book or even the pages from a book. It seems to me that it should be possible to add some encrypted information and send the keys to yet another party so as to allow reconstitution of the pages in authorized circumstances, such as for use by people with disabilities. If you can't use the information to reconstitute the book, then it seems to me that no copy exists and no copyrights have been infringed.

If my assumptions are incorrect, then I should expect that Harper-Collins will soon be suing me for copyright infringement. I'll be sure to let you know. If they are correct, but there's some theory that would expose any of the crownscanning participants to liability, then perhaps someone who Really-Is-A-Lawyer could elaborate in the comments. I recognize that copyless crowdscanning wouldn't be applicable without modification to things like art books, artwork in books, poetry collections, sheet music, periodicals, reference works, but it would be a start. And it would make some engineers happy.

Update: Several people (including real lawyers) have commented to me that crowdscanning would not help much as an infringement defense if the result of the entire system had the effect of making the entire text available. I just want to emphasize that I think a system can be engineered so as to enable indexing while preventing text reconstruction and avoiding the use of copies.
Reblog this post [with Zemanta]

Sunday, October 18, 2009

My Optimized Baseball Media Diet and Why Motoko Rich Can't Count

40 years ago I started following the Philadelphia Phillies. I think that it started the month that my family rented a beach house on Long Beach Island. Every morning I would walk to the store to buy a newspaper- the Philadephia Inquirer- because I wanted to read everything about Apollo 11. After the astronauts got home safely I continued my morning newspaper ritual, and that's when I started reading about the baseball. Between then and now, there were some years when it was hard to follow my team, and I don't mean because they were bad. When I lived in California, the local newspapers barely covered my team, even though it was a National League city. I would study the boxscores and the three sentences in the AP summaries to retain an emotional connection to my team. That's when I first imagined a newspaper of the future that could be customized and printed for me so that I could have the New York Times front page along with an Inquirer sports page with my morning coffee. Then the internet happened, and all of a sudden I could track the Phillies games on Yahoo Sports and read articles on the Inquirer's web site, Philly.com, even though I was living in Mets and Yankees-land. I barely read the sports section of my local paper any more. With cable television, I could watch games whenever the Phils played Atlanta or the Mets. My baseball media diet had reverted to what it was growing up, except now I got it over wires instead of the airwaves and on paper. Over the past three years, however, my baseball media diet has changed profoundly, and not just because the Phillies won the World Series. This season, I was able to watch most games on my iPhone or on my laptop via MLB.com. Every day I read the blog of the best sports writer covering the Phillies, Jason Weitzel. I get breaking news via Twitter from Scott Lauber, a writer for some paper in Wilmington, Delaware. I read game summaries from Todd Zolecki and other writers who work for MLB.com. I read news about Phillies prospects at PhuturePhillies.com, and I read stat-head analysis (partly enabled by huge volumes of game transactional data released by major league baseball) at the Hardball Times. In my optimized Phillies media diet, there's not much role for traditional media, or even for transitional media aggregators like Yahoo or Philly.com. The media providors I've ended up with have all specialized in areas of strength. I don't have to endure sports writers I don't like just because they've managed to gain special access to the flow of information. The same sort of change is happening all over the landscape of news reporting. Two weeks ago, I had a chance to see first hand how the professional media reported a rather minor event in a story I'd been following quite closely. I went to a federal courtroom in New York and witnessed a meeting of a judge and the parties of a lawsuit involving Google, copyright and ebooks. At the end of my report, I added links to other reports published about the same event. It's interesting to read these reports, and think about how they fit into an optimized media diet. The most knowledgable report was by James Grimmelmann, a professor at NYU law school. Those of us who have followed the lawsuit closely have come to rely on Prof. Grimmelmann's blog for insight into the relevant law. The best written coverage, in my opinion, was that of Motoko Rich of the New York Times. She condensed the event down to its bare essence, and chose exactly the right story lines. At the event, I watched her in action. After its conclusion, she made a beeline for a publishing executive, sitting two seats away from me, and asked him exactly the right question. But it seems Motoko Rich made a small mistake. If you compare her well-written story with my notes-dump, you'll note a tiny discrepancy. She reports that there were "fewer than 70 people" in the courtroom. I was amazed to see so many people, and it was my very first time in a Federal courtroom, so I decided to make a careful count. There were four rows of benches, filled with 12 people each. There were 8 members of the press seated in the jury box. There were 8 attorneys for the parties and the Department of Justice at the lawyers tables. Seated along the back wall were 16 people, eight on each side. So not including the Judge and his staff or the courtroom official and security, there were 80 people in the courtroom. Where did Motoko Rich's "fewer than 70" number come from? Perhaps she meant to write "more than 70". Perhaps an editor or fact-checker could believe that so many people could fit in the courtroom. I don't know. I left a comment on the Times' website, but for whatever reason it was not approved. Perhaps the correction was considered so trivial that it was better to leave the mistake in the story. In fact, version of the story put the number at "approximately 70 people", and the print version omitted any reference to the audience size. This episode got me thinking about the proper role of professional reporters in my media diet. I don't expect a reporter to have the expertise of a Law Professor, but I really want people like Motoko Rich to be asking the right people piercing questions. Although I can go to the same event that she can, it's just not my job to badger people with questions, even if I do happen to know them. But having been accustomed to the accountability of sports reporting that has to stand up to hundreds of reader comments, I would really like to see similar accountability in the news reporting I read. It should matter more, not less. I'm also worried about the business models that support my news sources. I hope that Jason Weitzel is making enough from his blog to support himself- he's probably made only a few dollars from me (I bought his book last year). I'm glad Scott Lauber is supported by his newspaper, but it has close to zero revenue from me. Major League Baseball is getting significant revenue from me- I hope they're smart enough to add to the media that they support. Whith the whole news industry experiencing the wholesale rearrangement of roles that has already happened for me for baseball, what is a reporter to do? Should she focus on developing contacts, asking questions and crafting stories, or should she focus more on building a reader contituency? Should a "newspaper" business focus on aggregating news or nurturing reporters? Should it be building a information access platform, or should it be developing a community news resource? Maybe it should be contributing to the cloud of linked data. I don't have answers for these questions, but I can tell you why you should trust me to count courtroom spectators more accurately than Motoko Rich. I'm taller than she is. I can see better over people's heads. And somehow we should figure out a way for Motoko Rich's physical stature to not be relevant to her stature as a reporter.
Reblog this post [with Zemanta]

Thursday, October 15, 2009

Normal and Inverse Network Effects for Linked Data

The human brain has an amazing capacity to recognize familiar patterns in unfamiliar environments. One manifestation of this is pareidolia, the phenomenon of seeing an image of the Virgin Mary in a grilled cheese sandwich or a mesa on Mars that looks like a face. (picture) Another manifestation is our tendency to apply newly popularized or trendy concepts to to totally inappropriate circumstances. For example, once Clayton Christensen popularized "disruptive innovation", any situation where technology brought about change was all of a sudden being labeled as "disruptive".

My latest peeve is what I perceive to be pareidolic use of the term "network effect" to describe almost any example of positive feedback in markets. For example, here's an example that Tim O'Reilly thinks is a network effect:
Google is better at spidering that network than their competitors. They thus benefit more powerfully from the network that we are all collectively building via our web publishing and cross-linking.
While there is definitely a network that enables Google's spidering, it's not a "network effect" that makes Google a good spiderer. Economies of scale are what make Google a good spiderer, even if that scale has resulted in part from network effects.

The originator of the term "network effect" was Bob Metcalfe, the co-inventer of Ethernet. He used it to refer to a mathematical description of how the value of a network scaled with the number of nodes it connected. His reasoning was that the value of each networked node is proportional to the number of other nodes it can connect with, so that the total value of the network scales with the square of the number of nodes.

It's pretty silly to expect that a scaling rule that works for small networks would continue to apply for large networks, and Andrew Odlyzko and Benjamin Tilley have pointed out that more modest scaling laws are a much better fit to market valuations of networks. Still, their suggestion that inappropriate application of Metcalfe's law was to blame for the internet bubble and its subsequent collapse bears reflection.

Recently there's been some discussion of how to apply Metcalfe's law for the network effect to Linked Data and the Semantic Web. Linked Data is information published using standards so that machines can understand its meaning and make inferences from the totality of data that has been collected. One argument says that the value of any set of Linked Data increases in value with every new bit of linked data is added to the world wide cloud of Linked Data. So how does the value of this Linked Data "Network" really scale with the links it contains?

Since I don't know of any way to value an arbitrary bit of Linked Data, I'll pick a simple system where I can compute utility. I'll focus on the direct effects and benefits of linking data together, and ignore for now indirect benefits such as those which result from the use of standards.

Suppose we have two sets of Linked Data entities, Movies and Actors. Let's also assume that both of these sets are essentially complete. We'll then consider the effect on the system utility of adding random "actedIn" links between Actors and Movies. In our utility computation, we'll assume that answering questions about which actors acted in which movies is the primary utility of our set of links, and the number of these questions the set can answer will be the utility measure.

For the questions "What movies did X act in?" and "who acted in the movie Y?" the value of the link collection scales linearly with the number of "actedIn" links. There's no network effect at all for these questions, because the fact that Marlon Brando acted In On the Waterfront adds no utility to the fact that Humphrey Bogart acted in Casablanca.

For the question "Who else acted in movies that X acted in?" the result is different. For this question, the ability of our collection of links to answer usefully scales as the square of the number of links, just as in the classic network case. For this question, there clearly is a network effect.

For the "Kevin Bacon" question, ("how many acted-with degreees of separation are there between X and Kevin Bacon?") the Network effect is even stronger, with the network value scaling as a higher power of the number of actedIn links. Notice that for Linked Data, the network effect is not inherent in the data, but rather is implicit in the types of queries that are made on the data.

What we really wanted to know was the total value of the set of links. We might guess that the value is proportional to the total number of questions that the set of links can answer. That number grows exponentially with the number of links. We can see that an exponentially increasing value doesn't make sense, however, by considering the total value as a power series. We've already discussed the first two terms of that series, but we've not considered the relative coefficient. It's really hard to argue that the value of a system that can answer only the acted-with questions is hugely more valuable than the one that answers only the acted-in questions, despite the fact that it answers N2 questions compared to only N for the acted-in answering system. It seems to me that the Kevin Bacon answering system is less valuable than the other two systems, despite an even larger number of questions (about N12) that it would be able to answer; they're just really stupid questions.

Even if network effects are not inherent in Linked Data, threshold effects can result in the existence of a "critical mass", above which positive reinforcement kicks in to drive the entire system. In our toy system, we can easily imagine that a collection of links might be worthless until there were a sufficient number of links to exceed critical mass. A system that can tell me 90% of the movies that someone has acted in is a lot more than nine times as valuable as a system that can tell me only 10%. That's because an acted-in answering system is worthless unless it's better than the random guy sitting next to me at the bar! So this is kinda-sorta a network effect, but really it's a threshold effect.

It's rather easy to confuse threshold effects for network effects. I'll put it this way: it's not a network effect that causes me to avoid doing my laundry until I have a full load, it's a threshold effect! Never mind that it's really three loads.

Rod Beckstrom, currently CEO of ICANN has described the "inverse" network effect, which occurs in situations where the addition of nodes reduces the value of a network to each participant. Golf clubs are cited as examples- they have an optimum size of about 500 members because additional members make it more difficult for existing members to get playing time. I see two types of inverse network effects at play in the Linked Data world. The first is the cost of expanding a database; the second is the law of diminishing returns.

In an optimally designed database, the cost of accessing any given record is proportional to the logarithm of the number of records. This is a slowly increasing function- if you have this scaling, you can increase from a million records to 10 million records and only increase your cost by 18%. Alas, optimal design is rarely achieved, and to get to that optimum, you have costs that scale much less gently. The result has been that most practical applications of Linked Data use only the most relevant subsets of available data. If data network effects were pervasive and stronger than inverse network effects, this would generally not happen.

In my movie and actor example, I made the assumption that the value of any particular query was equal in value to any other query. In practice this is not true. Most people would agree that there's more utility in knowing that Humphrey Bogart acted in Casablanca, than in knowing that Michael Ripper acted in The Reptile. In many data sets, the 80/20 rule applies (also known as the Pareto Principle) - 80% of the real-world queries would exercise only 20% of the links. The least useful data is typically the most expensive data to acquire, so if we start by adding the most valuable actedIn links, then every additional link reduces the average value of the links in the collection. This mimics an inverse network effect, as the the total value of the collection grows more slowly with every added link, rather than growing more quickly.

The main take-away from this is that you can't look at Linked Data objectively and conclude that it exhibits strong network effects without taking into account the application it's being used for. Some applications will exhibit strong, even exponential network effects, others may exhibit inverse network effects. And sometimes a grilled cheese sandwich is just a sandwich.

Tuesday, October 13, 2009

The Revolution Will Be Digitized (By Cheap Book Scanners)

It's always a good sign when you meet a literary character at a conference. Last June, I wrote about meeting a Bilbo Baggins at the Semantic Technology Conference; on Friday I met a character out of a Neal Stephenson novel at D is for Digitize.

D is for Digitize was a small conference organized by James Grimmelmann of NYU Law School. It brought together legal luminaries with people from publishing, business, academia, advocacy, technology, and the press. It had been organized to coincide with the scheduled Fairness Hearing for the Google Book Search Settlement. As it turned out, the Fairness Hearing was postponed, to be replaced by a brief "status conference". The effect of the postponement on the conference was beneficial- with the Google settlement officially on the shelf, the participants were able to have real discussions on the future of book digitization without getting too bogged down in legal argument.

That future was brought into very clear focus by the two digital cameras in Daniel Reetz' do-it-yourself book scanner. Reetz's presentation and demonstration blew away everyone in the room. Like Stephenson's Waterhouse characters in Cryptonomicon
and the Baroque Cycle, Reetz is a tinkerer and a liberator of information. He spent some time in Russia and became accustomed to the conveniences of digital books in a society that doesn't pay much attention to copyright laws. On his return home to North Dakota, he was shocked at the high price of textbooks and the low price of digital cameras. He resolved to build himself a book scanner and went dumpster diving for materials, then posted instructions for how to make the scanner online.

In May, he was awarded the Grand Prize (a laser cutter) in the Epilog Challenge, a competition sponsored by the manufacturer of a laser cutter to promote "open design" manufacturing. The laser cutter has enabled Reetz to refine his scanner design to use precision-cut plywood. His first third-generation scanner, which folds up neatly for portability, was finished just in time for him to bring to the conference. (He had fun getting it through airport security!).

Compared to robotic scanners such as the one manufactured by Kirtas the DIY Book Scanner is strikingly simple. It is built with rubber bands, drawer sliders, white LEDs and two commercial off-the-shelf digital cameras. Some Russian friends of Reetz's have figured out how to hook into the camera's firmware so that scan acquisition can be triggered by pressing a single button. Open source software is used to do image management and post-processing. An operator turns the pages and average throughput is about a thousand pages per hour. The total cost of the scanner parts are under $300, including cameras. For more pictures of Reetz's new scanner, he's posted some here.

Reetz is not the only one building cheap scanners based on his design. A small but vital community is growing around the open-source design. Although book publishers might unthinkingly assume that this group is primarily interested in book piracy, they would be wrong. Several people just want to read books they've purchased in print on their iPhones or Kindles. An engineering student in Arizona is reading disabled and must digitize to be able to read his textbooks. One Indonesian man built a scanner with donated cameras because his town's property records had been damaged in a flood. More than one book aficionado has turned to scanning in response to a too-many-books spousal ultimatum.

For other perspectives on Reetz's presentation, see Harry Lewis' post at Blown to Bits and Robin Sloan's post at The Millions.

In my article on the impact of the Americans with Disabilities Act on selling non-accessible books, I speculated that the as cost of digitizing books drops, society's expectations for the bookselling industry would change. Now that I've come face-to-face with a cheap book digitizer, I realize that much will be transformed. For example, let's assume that an effective book digitizer can be built and deployed for $500. (Even if DIY turns out not to be the way this happens, commercial manufacturers such as ATIZ are likely to be able to meet similar price points.) Then the cost of putting a book scanner in 20,000 libraries would be $10,000,000. If these libraries digitized an average of even one book per day, they could digitize 10,000,000 books in two years. Since 10 books per day should be well within the capabilities of an inexpensive digitizer, the libraries should have no technical difficulties with digitizing 4 million books per month.

If libraries acquired the capability of digitizing millions of books per month, then Google's erstwhile monopoly on digitized out-of-print books could evaporate quickly in an appropriate legal environment. Rightsholders who have been angry at Google for working with libraries on digitization should think ahead to a future in which their works can be ripped, mixed, and burned by cheap book digitizers in millions of homes and offices. The world will be different.

In Stephenson's Cryptonomicon, Randy Waterhouse develops a data haven in a Pacific island country to evade crude laws governing cryptography. I hope that Daniel Reetz doesn't have to retreat to a digitization haven country to able to bring the sensible benefits of book digitization to people who need it.


Reblog this post [with Zemanta]

Sunday, October 11, 2009

Wave is Better on the iPhone

The more I play with Google Wave, the less I like the user interface. In contrast, the more I learn about the Wave technology stack, the less I care about the user interface, because it's clear to me that the important innovations are underneath the skin. That became very clear to me when I got on the train on Friday morning and decided to see whether Wave worked on the iPhone. I've decided that wave is better on the iPhone than it is in a full screen browser.

On a full screen browser, Wave takes up 3 columns. When you get to try Wave, the first thing you should do is get rid of the leftmost column. This gives the rest of Wave a bit of room to breathe. On the iPhone, by contrast, you get only one column of content per screen. The result is much, much easier to digest. You don't have waves yipping at you while you read another wave.

The iPhone version on Wave is implemented as a web app running inside Safari. Inexplicably, when you log in, there's a message that says that the iPhone browser is not fully supported. This is inexplicable for 2 reasons.
  1. The iPhone implementation is quite a bit more mature than the Firefox 3.5 implementation I run on my laptop.
  2. The main bug in the iPhone implementation is that the screen telling you the iphone browser is not fully supported breaks links into Wave.
I would expect to see a iPhone native Wave app very soon.

I think most people running Wave will eventually choose to run Wave in native applications that talk the Wave protocol, just as most people (i.e. me) use native email clients to do their email. I just hope that Google's choice to launch Wave inside browsers is not just a plan to establish Chrome as a new web operating system. Twitter owes its success in no small part to the ecosystem of client applications that have sprung up around it. Although I use the Nambu client myself; I'm sure that many satisfied Tweetdeck users would be unhappy if the only way they could use Twitter was through Nambu.

I've read the assertion that Google Wave is the Segway of email, but I think that's wrong. The better analogy is the videophone. Although it's often thought that AT&T's videophone failed because there was no one to call, that's only half of the story. The reason that there was no one to call was that the videophone didn't fit into any social practice- no one really want to have people popping up on little screens in their living rooms.

I'm betting that Wave will turn out to be more popular than the Picturephone, but the real turning point will be third-party clients.

This article is also posted inside Wave.

Wednesday, October 7, 2009

Judge Chin Confronts the Scanning Problem

Although our judicial system may often seem to be prehistoric in its use of technology, there is at least one District Judge who is eager to confront the problem of turning print into digital.

Over seventy spectators, including lawyers, professors, students, publishing executives, authors, journalists, and at least one technologist coming to grips with the fact that mobile phones and laptops are not even allowed in the building, packed a lower Manhattan courtroom today to witness a "status conference" on the settlement of the lawsuit between Google, the Author's Guild, and the American Association of Publishers. The lawsuit began when Google began scanning and indexing out-of-print books without getting permissions from rightsholders.

Originally, today's hearing had been scheduled to be the fairness hearing for the settlement, which is partly explains why so many people attended such a mundane proceeding. A few even came all the way from Japan to attend. After a pointed "Statement of Interest" by the US Department of Justice, the parties to the settlement had asked for a postponement of the fairness hearing to make changes to the agreement, but Judge Denny Chin (who was just yesterday officially nominated for a seat on the court of appeals) was eager to keep the case on track and decided to schedule this "status conference".

Here are my notes; I posted them quickly (before the Phils game) and updated them later (after they crushed the Rockies, 5-1):

The judge's practical manner contrasted with the odd formality of the whole affair. He went straight to business, asked to "see where we are" and verified that the settlement agreement which had been submitted was in fact not on the table any more. He then asked for a status report on the deliberations. Michael Boni, lead attorney for the author's subclass, gave a prepared statement on behalf of all the parties. The parties have been working "assiduously", "around the clock", with the Justice department and representatives of all the parties. They intend to come up with amendments to the existing Settlement Agreement, and expect to have these ready by early November.

Although there was substantial work to do, they had the "full attention" of the Justice Department and hope to be able to satisfy its concerns. They intend to seek approval for a supplemental notice program substantially shorter than the original notice program, given that the amendments would be providing additional benefits to the settling classes. There will have to be time for class members to opt in, opt out or object, but they hoped objections could be limited to the amendments. A motion for final approval would occur in late December or early January. They acknowledge this to be an ambitious schedule. Boni noted that there is a deadline of January 5 in the current agreement for authors and publishers to claim books to be eligible for lump-sum compensation; the parties have agreed that this should be extended to June 5, 2010.

William Cavanaugh from the Justice Department then spoke briefly. He repeated that they had been in discussion with the parties, but would need to see the amendments before giving their support. He also requested that the government be given a week to 10 days after the deadline for objections to the amendments to prepare their position.

Judge Chin then said "I like this schedule. I think I agree with the concept that limited or supplemental notice is all that will be required", given that he'd received a "large body of thoughts for and against". Anything else would result in a delay of many months, which would not be acceptable to the court.

At this point, the dreary scheduling and status reporting having concluded, Judge Chin pointed out that he had not been on the case at its beginning, and hoped that the whole process could be made smoother than it had been. He complained about all the hard copies being sent around, over-burdening his staff who had access to just one small scanner. "Really. In this case of all cases, couldn't the submissions be electronic?" he suggested, to general amusement and relief. Maybe there could be an email address set up for submitting objections.

Michael Boni noted that there are requirements to "serve" the parties, and that his firm could be served via e-mail. Judge Chin interjected that no one had envisioned having over 400 objections, and expressed a desire to see only electronic documents.

Judge Chin wanted to cover all eventualities, and asked whether the parties had contemplated a possible breakdown in the negotiations, and asked whether discovery had occurred prior to the settlement talks. Boni reported that millions of pages of documents had been produced, but that no depositions had been done. Daralyn Durie, representing Google, reiterated that it expects that the parties will be able to present a settlement agreement.

Judge Chin then set November 9th as the date for submission of the amended settlement.

A final matter that I don't know the background of was that the American Society of Media Photographers had submitted a motion for consideration. Judge Chin would allow the parties a week to respond before he rules on the motion.

Update: The ASMP had filed a motion to intervene in the case. They represent a class that is in an odd position- they were originally part of the plaintiff class, but abruptly found themselves excluded from the settlement. So they want to have their bases covered so they can be sure to have standing to object to the settlement. Judge Chin ruled against all the motions to intervene, citing timeliness, but the ASMP asked for reconsideration, and filed an appeal. You can find all the motions at the Public Index.

More coverage:
Secondary reports:
Reblog this post [with Zemanta]

Tuesday, October 6, 2009

Telephone Dread and Wave Inbox Puppies

It took me 30 years to get over telephone dread. Before my cure, I could only make a telephone call in a quiet environment without anyone to disturb me. Having decided to call someone, I would screw up my courage, pre-decide what to say to every possible answerer, and then pick up the phone. The dial tone would taunt me until I started dialing. With every digit of the phone number, I would experience doubt and anxiety. The touch-tone made it better; in the days of pulse-tone dialing, I hated numbers with lots of zeroes. Maybe the number was wrong. Maybe the person I was calling would be out. Maybe I would have to talk with someone's mother. Maybe the person wouldn't know who I was, and I'd have give a humiliating explanation of who I was. Half the time I would click the hook part way through the number after thinking of some unconsidered terror. By the time I got through, I would be an emotional wreck.

It was the mobile phone that cured me. I could just key in the number, double check it, stare at it for a minute, and then I could press the send button and I was launched into the call. Ten buttons of anxiety was reduced to a single button, and that, combined with a small bit of emotional maturity, has enabled me to use the phone like a normal human being.

I've always been comfortable with email. There's a single send button, and once an email has been sent, I can forget about it completely until I get a reply. And there's no need to reply immediately- the ease with which email can get lost is perhaps its best attribute!

The tele-presence afforded by Instant Messaging is quite comfortable. You can see who's "around" and strike up casual conversation. The intermittent immediacy can be very stimulating.

Twitter was very un-threatening to start out with. No one was ever going to read that first status message, or so you thought. Gradually, Twitter will reel you in because once in a while, without rhyme or reason, people will react to what you say. Retweets are like the random rewards that kept pigeons pecking in the famous random reinforcement schedule experiment.

Google Wave is different, and I still can't tell if it will terrorize me as the telephone dial did or whether it will comfort me the way IM status messages do. Although it's conceived as email re-invented, it moves beyond familiar modes of communication. My first impressions post got a commenter who pointed to a blog post from June that included a note about the unexpectedness of one of Wave's features.
Wave is changing paradigms. People can no longer take back what is released. Even if someone deletes part of the document, the deleted part can be seen in playback. While this "permanent memory" was there almost since the beginning of the Internet, it was never before real-time. How could we take back an information from a Wave? Imagine you have misplaced your password to the wave instead of password input box. It will always be visible. OK, I could change my password, but what about unfortunate copy&paste event with a credit card number?
I had never really considered the importance of forgettability in communication.

Wave's waves are described as "living things" because they can be continually added to. They also live somewhere that's not on your computer. Google's conception is that there will eventually be many Wave Service Providers running the Wave open-source service platform, but for now, the waves all reside on Google's servers. I can't think of any form of human communication that "lives" in remotely the same way. Maybe people playing music or games together is the closest example.

Email was easy to start out with because it wasn't so different from the kind of mail that requires stamps and envelopes. You wrote a message and sent it off. Instant messaging was not so different from having a real-word or telephone chat. Wave, by contrast, maps most closely to other things you do on the internet, like IM and Wiki editing, each of which is a degree removed from things you do in real life. Wave is two degrees removed from normal human intercourse.

Since it's a living thing you can't just send a wave and then forget it. Your "inbox" is like a litter of puppies that are yelping for attention and growing in front of your eyes. Or maybe they just sit there dead, waiting to be "archived". It will take a while before I'll know whether the Wave window will be filling me with dread or delight. At least I've learned how to give my puppies a bit more breathing room.

After four days of Wave, I'm most excited to see how the Wave ecosystem is developing, even though many features are still hidden because important things aren't working yet. A sorely needed feature has been a way to do ranking of public waves. Today, that feature is being born. This is a great example of the benefits to having a platform open to developers from the very beginning. Needed functionality is being quickly added using the robot mechanism. I'm starting to think about how to use Wave robots to do useful things in libraries and scholarly communication.

This post is posted in Wave.

Reblog this post [with Zemanta]

Friday, October 2, 2009

Google's Wave is a Tsunami of Conversation

Human communication is a magical thing, and over the last decade, we've acquired some super-powers. We've had the telephone for a hundred and thirty three years and it is still developing. Nowadays you can hardly set your twelve-year-old loose in the mall without packing a cell phone, and people are dying because they just have to send texts while driving. Just think about all the "conversation" tools we've added since the birth of the internet: e-mail, instant messaging, chat rooms, mailing lists, bulletin boards, blogs, various and sundry social networks, Facebook, Skype, Twitter. Our use of these tools will continue to evolve, but at some point there's going to be a consolidation. Do we really need both Twitter and Facebook?

Yesterday, I was lucky enough to get an invitation to the Google Wave Preview. I have no idea whether Wave will be the next big thing, but I can tell you it won't fail for lack of ambition. Unlike Twitter, which started with a concept so simple that it sounded really stupid, Wave starts by imagining what email would look like if its designers were to start from scratch, knowing what we know today. The result is a daunting attempt to roll all of our communication superpowers into a single user interface. At first, I couldn't really figure out how to get Wave to do what I wanted it to do. Part of the problem was that some of the features mentioned in the help videos had not been activated in the Preview version, though they are still working in the "Sandbox" version which has been available to developers for a few months now. Other things just don't work yet- the contacts module seems very buggy, which is a big problem because you need that to connect with other "Wavers". Luckily, my multitool communications mischmash is still working, and I was able to find some friends (via Twitter and Facebook) who were in the same situation as myself, still stumbling around a dark room of functionality, looking for someone to connect with.

With some help here and there, I gradually figured things out. Wave introduces several new (to me) user-interface widgets; it struck me that web applications rarely introduce new interface widgets, unlike applications like Excel or Photoshop that put powerful capabilities in inteface objects. I still haven't figured out the sliders. The threaded discussions have little colored boxes that indicate who is typing something. The overall effect is that gMail has been tricked out and turbocharged. After a day of playing with it, I'm not sure I like the user interface, but I can't really think af any way to make it better given the scope of what Wave is trying to do.

The focus of Wave is editable, threaded conversations, i.e. waves. (There's already a convention to use lower case w for the conversations and upper case W for the platform as a whole.) These are quite well done. They support hierarchical threading, styled text, auto-linking, distributed editing, history, history playback, insertion of images, videos, and software objects known as extensions (there is a poll widget pre-installed). The one thing that's missing is undo. I can think of all sorts of uses for these capabilities; I intend to try a few over the next week or so.

The intro videos plug a tool called Bloggy that allows you to publish a wave to your blog. Bloggy is a robot that you add as a participant in your conversation. I spent about an hour trying to figure out how to get Bloggy to work, until I discovered (via Twitter) that Bloggy had not been activated on the Preview version yet.

Having failed with Bloggy, I started thinking "Is that all there is?" and envying Google's white-hot overhype machine. But then @jillmwo pointed me to the way to search on public waves (put "with:public" before a search). Immediately the little ripply waves I was seeing turned into a tsunami, and I began to see the enormous possibilities of Wave. To make a wave public, you add a participant called "Public" (public@a.gwave.com) to the wave; "Bloggy" does the same thing.

The ability to search public waves (and to make your own public wave) is a feature qualitatively different from anything I've seen before. A public wave is sort of the bastard offspring of a Twitter hashtag mated with a Wikipedia topic page. When it matures, this creature will surely be a monstrous beast; what no one can tell yet is whether the beast will be a tame workhorse or whether it will be a velociraptor requiring a good strong cage. New technology is always easy to create and control compared to new social practice.

If it was me trying to invent email from scratch, I would spend 80% of my effort on spam prevention. It looks to me as though many of Wave's feaures have been hidden in the Preview because the functions needed for spam prevention are not ready yet. This is not to say that the Wave team is not spending 80% of its effort on spam prevention- these are incredibly hard things to get right. As an example, it's currently not possible to remove a participant from a wave, and any public wave participant can see the addresses of all the other participants. The problems with the contacts module are also probably related- waves from people you might not know just appear in your inbox, even though it looks like the intent is for them to first appear in the "Requests" folder in the "Navigation" module. Groups are also not yet implemented. It's difficult to know how this will play out until we see everything working.

Since the Preview roll-out, there have been quite a number of negative reviews by people pointing to all the problems in Google Wave. I think these are missing the point to some extent. Sure, Wave could end up being a complete failure, but even if that happens, Wave is giving us today a glimpse of what the future could look like. If not Wave, then surely there will be a SuperTwitter or a SpaceBook or maybe even a MagicForce that will contend for the consolidation of our one hundred flowers of conversation.

Note: this post will also be published as a public wave.

Reblog this post [with Zemanta]