Wednesday, December 31, 2014

The Year Amazon Failed Calculus

In August, Amazon sent me a remarkable email containing a treatise on ebook pricing. I quote from it:
... e-books are highly price elastic. This means that when the price goes down, customers buy much more. We've quantified the price elasticity of e-books from repeated measurements across many titles. For every copy an e-book would sell at $14.99, it would sell 1.74 copies if priced at $9.99. So, for example, if customers would buy 100,000 copies of a particular e-book at $14.99, then customers would buy 174,000 copies of that same e-book at $9.99. Total revenue at $14.99 would be $1,499,000. Total revenue at $9.99 is $1,738,000. The important thing to note here is that the lower price is good for all parties involved: the customer is paying 33% less and the author is getting a royalty check 16% larger and being read by an audience that’s 74% larger. The pie is simply bigger.
As you probably know, I'm an engineer, so when I read that paragraph, my reaction was not to write an angry letter to Hachette or to Amazon - my reaction was to start a graph. And I have a third data point to add to the graph. At Unglue.it, I've been working on a rather different price point, $0.  Our "sales" rate is currently about 100,000 copies per year. Our total "sales" revenue for all these books adds up to zero dollars and zero cents. It's even less if you convert to bitcoin.

($0 is terrible for sales revenue, but it's a great price for ebooks that want to accomplish something other than generate sales revenue. Some books want more than anything to make the world a better place, and $0 can help them do that, which is why Unglue.it is trying so hard to support free ebooks.)

So here's my graph of the revenue curve combining "repeated and careful measurements" from Amazon and Unglue.it:

I've added a fit to the simplest sensible algebraic equation possible that fits the data, Ax/(1+Bx2), which suggests that the optimum price point is $8.25. Below $8.25,  the increase in unit sales won't make up for the drop in price, and even if the price drops to zero, only twice as many books are sold as at $8.25 - the market for the book saturates.

But Amazon seems to have quit calculus after the first semester, because the real problem has a lot more variables that the one Amazon has solved for. This is because they've ignored the effect of changing a book's price on sales of ALL THE OTHER BOOKS. For you math nerds out there, Amazon has measured a partial derivative when the quantity of interest is the total derivative of revenue. Sales are higher at $10 than at $15 mostly because consumers perceive $15 as expensive for an ebook when most other ebooks are $10. So maybe your pie is bigger, but everyone else is stuck with pop-tarts.

While any individual publisher will find it advantageous to price their books slightly below the prevailing price, the advantage will go away when every publisher lowers its price.

Some price-sensitive readers will read more ebooks if the price is lowered. These are the readers who spend the most on ebooks and are the same readers who patronize libraries. Amazon wants the loyalty of customers so much that they introduced the Kindle Unlimited service. Yes, Amazon is trying to help its customers spend less on their reading obsessions. And yes, Amazon is doing their best to win these customers away from those awful libraries.

But I'm pretty sure that Jeff Bezos passed calculus. He was an EECS major at Princeton (I was, too). So maybe the calculation he's doing is a different one. Maybe his price optimization for ebooks is not maximizing publisher revenue, but rather Amazon's TOTAL revenue. Supposing someone spends less to feed their book habit, doesn't that mean they'll just spend it on something else? And where are they going to spend it? Maybe the best price for Amazon is the price that keeps the customer glued to their Kindle tablet, away from the library and away from the bookstore. The best price for Amazon might be a terrible price for a publisher that wants to sell books.

Read Shatzkin on Publishers vs. Amazon. Then read Hoffelder on the impact of Kindle Unlimited. The last Amazon article you should read this year is Benedict Evans on profits.

It's too late to buy Champagne on Amazon - this New Year's at least.

Sunday, December 7, 2014

Stop Making Web Surveillance Bugs by Mistake!

Since I've been writing about library websites that leak privacy, I figured it would be a good idea to do an audit of Unglue.it to make sure it wasn't leaking privacy in ways I wasn't aware of. I knew that some pages leak some privacy via referer headers to Google, to Twitter, and to Facebook, but we force HTTPS and make sure that user accounts can be pseudonyms. We try not to use any services that push ids for advertising networks. (Facebook "Like" button, I'm looking at you!)

I've worried about using static assets loaded from third party sites. For example, we load jQuery from https://ajax.googleapis.com (it's likely to be cached, and should load faster) and Font Awesome from https://netdna.bootstrapcdn.com (ditto). I've verified that these services don't set any cookies and allow caching, which makes it unlikely that they could be used for surveillance of unglue.it users.

It turned out that my worst privacy leakage was to Creative Commons! I'd been using the button images for the various licenses served from https://i.creativecommons.org/ I was surprised to see that id cookies were being sent in the request for these images.
In theory, the folks at Creative Commons could track the usage for any CC-licensed resource that loaded button images from Creative Commons! And it could have been worse. If I had used the HTTP version of the images, anyone in the network between me and Creative Commons would be able to track what I was reading!

Now, to be clear, Creative Commons is NOT tracking anyone. The reason my browser is sending id cookies along with button image requests is that the Creative Commons website uses Google Analytics, and Google Analytics sets a domain-wide id cookie. Google Analytics doesn't see any of this traffic- it doesn't have access to server logs. But without anyone intending it, the combination of Creative Commons, Google Analytics, and websites like mine that want to promote use of Creative Commons have conspired to build a network of web surveillance bugs BY MISTAKE.

When I inquired about this to Creative Commons, I found out they were way ahead of the issue. They've put in redirects to HTTPS version of their button images. This doesn't plug any privacy leakage, but it discourages people from using the privacy spewing HTTP versions. In addition, they'd already started to process of moving static assets like button images to a special-purpose domain. The use of this domain,  licensebuttons.net, will ensure that id cookies aren't sent and nobody could use them for surveillance.

If you care about user privacy and you have a website, here's what you should do:
  1. Avoid loading images and other assets from 3rd party sites. consider self-hosting these.
  2. When you use 3rd party hosted assets, use HTTPS references only!
  3. Avoid loading static assets from domains that use Google Analytics and set id domain cookies.
For Creative Common license buttons, use the buttons from licensebuttons.net. If you use the Creative Commons license chooser, replace "i.creativecommons.org" in the code it makes for you with "licensebuttons.net". This will help the web respect user privacy. The buttons will also load faster, because the "i.creativecommons.org" requests will get redirected there anyway.

Saturday, November 22, 2014

NJ Gov. Christie Vetoes Reader Privacy Act, Asks for Stronger, Narrower Law

According to New Jersey Governor Chris Christie's conditional veto statement, "Citizens of this State should be permitted to read what they choose without unnecessary government intrusion." It's hard to argue with that! Personally, I think we should also be permitted to read what we choose without corporate surveillance.

As previously reported in The Digital Reader, the bill passed in September by wide margins in both houses of the New Jersey State Legislature and would have codified the right to read ebooks without letting the government and everybody else knowing about it.

I wrote about some problems I saw with the bill. Based on a California law focused on law enforcement, the proposed NJ law added civil penalties on booksellers who disclosed the personal information of users without a court order. As I understood it, the bill could have prevented online booksellers from participating in ad networks (they all do!).

Governor Christie's veto statement pointed out more problems. The proposed law didn't explicitly prevent the government from asking for personal reading data, it just made it against the law for a bookseller to comply. So, for example, a local sheriff could still ask Amazon for a list of people in his town reading an incriminating book. If Amazon answered, somehow the reader would have to:
  1. find out that Amazon had provided the information
  2. sue Amazon for $500.
Another problem identified by Christie was that the proposed law imposed privacy burdens on booksellers stronger than those on libraries. Under another law, library records in New Jersey are subject to subpoena, but bookseller records wouldn't be. That's just bizarre.

In New Jersey, a governor can issue a "Conditional Veto". In doing so, the governor outlines changes in a bill that would allow it to become law. Christie's revisions to the Reader Privacy Act make the following changes:
  1. The civil penalties are stripped out of the bill. This allows Gov. Christie to position himself and NJ as "business-friendly".
  2. A requirement is added preventing the government from asking for reader information without a court order or subpoena. Christie gets to be on the side of liberty. Yay!
  3. It's made clear that the law applies only to government snooping, and not to promiscuous data sharing with ad networks. Christie avoids the ire of rich ad network moguls.
  4. Child porn is carved out of the definition of "books". Being tough on child pornography is one of those politically courageous positions that all politicians love.
The resulting bill, which was quickly reintroduced in the State Assembly, is stronger but narrower. It wouldn't apply in situations like the recent Adobe Digital Editions privacy breach, but it should be more effective at stopping "unnecessary government intrusion". I expect it will quickly pass the Legislature and be signed into law. A law that properly addresses the surveillance of ebook reading by private companies will be much more complicated and difficult to achieve.

I'm not a fan of his by any means, but Chris Christie's version of the Reader Privacy Act is a solid step in the right direction and would be an excellent model for other states. We could use a law like it on the national level as well.

(Guest posted at The Digital Reader)

Wednesday, November 5, 2014

If your website still uses HTTP, the X-UIDH header has turned you into a snitch

Does your website still use HTTP? It not, you're a snitch.

As I talk to people about privacy, I've found a lot of misunderstanding. HTTPS applies encryption to the communication channel between you and the website you're looking at. It's an absolute necessity when someone's making a password or sending a credit card number, but the modern web environment has also made it important for any communication that expects privacy.

HTTP is like sending messages on a postcard. Anyone handling the message can read the whole message. Even worse, they can change the message if they want. HTTPS is like sending the message in a sealed envelope. The messengers can read the address, but they can't read or change the contents.

It used to be that network providers didn't read your web browsing traffic or insert content into it, but now they do so routinely. This week we learned that Verizon and AT&T were inserting an "X-UIDH" header into your mobile phone web traffic. So for example, if a teen was browsing a library catalog for books on "pregnancy" using a mobile phone, Verizon's advertising partners could, in theory, deliver advertising for maternity products.

The only way to stop this header insertion is for websites to use HTTPS. So do it. Or you're a snitch.

Sorry, Blogger.com doesn't support HTTPS. So if you mysteriously get ads for snitch-related products, or if the phrase "Verizon and AT&T" is not equal to "V*erizo*n and A*T*&T" without the asterisks, blame me and blame Google.

Here's more on the X-UIDH header.

Tuesday, November 4, 2014

Reading Privacy Enables Reader Sharing

Digital privacy is a weird thing. People confuse it for digital security, but it's much more than that. Privacy isn't keeping secrets, it's controlling the information we share. What we think of as privacy depends on trusting that the people we share with won't do bad things. Privacy isn't digital at all. Maybe instead of "digital privacy" we should talk about "digital discretion".

The recent revelations of how Adobe Digital Editions was spewing the users' reading activity, unencrypted, to a logging server are an instructive example of poor digital discretion. I thought that Adobe was working on an ebook synchronization system, but it now looks like ADE was doing the logging "to support new business models" rather than for ebook sync.  It got me thinking about how ebook synchronization can and should be done.

Synchronization is a useful function. I'd like to be able to start reading a book on my iPhone while on the train in the morning, then pick up reading where I left off in the evening using my iPad. But to accomplish this function, I need to trust someone with information that discloses what I'm reading. It's easy to design a centralized sync system that requires a reader to register who they are,  what book they're reading and an activity stream of what pages are being read.

But a sync system designed for privacy doesn't need all that information. The central server doesn't even need to know the identity of the book! As Jason Griffey pointed out in his article on Adobe's spyware, the book's identifier could be hashed with a password, effectively hiding its identity from the central server.

I wrote about how Bluefire is doing sync for their apps while trying their best to respect user privacy. Rather than obscuring the identity of the book, they focus on making it hard to identify users in their system.

Adobe was justifiably criticized for sending lots of information back to its central server without encryption. Although their version 4.0.1 sends less information, mostly Adobe is just encrypting the stream and claiming the privacy problem is solved. The core privacy problem remains- when a DRM ebook is read, an encrypted activity stream is sent back to Adobe. If the information is sensitive or useful, why should Adobe get the benefit of this information at all? At the very least, providing your activity stream to Adobe should be opt-in.

There's a second privacy problem that hasn't been discussed anywhere. It may seem contradictory, but central-server synchronization systems impose TOO MUCH privacy. In many situations, a reader will want to share their reading stream. Look at GoodReads - you can share your opinions with friends. Look at Kobo Reading Life - you get awards and statistics in return for your stream. In classroom situations, students could sync their readers with the instructors'. These sorts of affordances can't be developed without access to the reading-activity stream, and won't work unless everyone participating in the stream is in the same reading ecosystem, using the same central server.

If instead of encrypting the reading-event stream, encryption were applied to the events themselves, the events could be shared over most any messaging system, and distributed according to the user's choices and desired application. In fact, you could use Twitter.

Every user of a Twitter-reading-sync system would create a Twitter feed to publish their reading activity. Other users could subscribe to the event stream. Direct messages could be used to send decryption data for private reading streams. The system could be engineered so that even Twitter would be unable to know what's being read privately. And the whole world would have access to reading that's being done publicly. In addition to page turning,  bookmarking and annotation activity could be of interest.

It's interesting to think about what might happen in a reading ecosystem where readers, not corporations, control the access to their reading activity streams. Publishers and authors might provide incentives to readers who share their reading-events with them. Social networks might match users reading the same page of the same book. Libraries could learn how to meet the needs of their communities. Teachers might be alerted to passages that students find to be difficult. Ironically, these public uses are enabled by a system design which puts a premium on privacy for the reader.

Dave Egger's novel "the Circle" gave us the expression "Privacy is Theft". The novel imagines a social norms that consider privacy to be a reflection of selfishness. But in the real world, it's the lack of discretion by companies building up vast private collections of personal information that's the true threat to social sharing. Too bad that theft is not a crime.

Wednesday, October 29, 2014

GITenberg: Modern Maintenance Infrastructure for Our Literary Heritage

One day back in March, the Project Gutenberg website thought I was a robot and stopped letting me download ebooks. Frustrated, I resolved to put some Project Gutenberg ebooks into GitHub, where I could let other people fix problems in the files. I decided to call this effort "Project Gitenhub". On my second or third book, I found that Seth Woodworth had had the same idea a year earlier, and had already moved about a thousand ebooks into GitHub. That project was named "GITenberg". So I joined his email list and started submitting pull requests for PG ebooks that I was improving.

Recently, we've joined forces to submit a proposal to the Knight Foundation's News Challenge, whose theme is "How might we leverage libraries as a platform to build more knowledgeable communities? ". Here are some excerpts:
Abstract 
Project Gutenberg (PG) offers 45,000 public domain ebooks, yet few libraries use this collection to serve their communities. Text quality varies greatly, metadata is all over the map, and it's difficult for users to contribute improvements. 
We propose to use workflow and software tools developed and proven for open source software development- GitHub- to open up the PG corpus to maintenance and use by libraries and librarians. 
The result- GITenberg- will include MARC records, covers, OPDS feeds and ebook files to facilitate library use. Version-controlled fork and merge workflow, combined with a change triggered back-end build environment will allow scaleable, distributed maintenance of the greatest works of our literary heritage.  
Description 
Libraries need metadata records in MARC format, but in addition they need to be able to select from the corpus those works which are most relevant to their communities. They need covers to integrate the records with their catalogs, and they need a level of quality assurance so as not to disappoint patrons. Because this sort of metadata is not readily available, most libraries do not include PG records in their catalogs, resulting in unnecessary disappointment when, for example, a patron want to read Moby Dick from the library on their Kindle. 
Progress 
43,000 books and their metadata have been moved to the git version control software, this will enable librarians to collaboratively edit and control the metadata. The GITenberg website, mailing list and software repository has been launched at https://gitenberg.github.io/ . Software for generating MARC records and OPDS feeds have already been written.
Background 
Modern software development teams use version control, continuous integration, and workflow management systems to coordinate their work. When applied to open-source software, these tools allow diverse teams from around the world to collaboratively maintain even the most sprawling projects. Anyone wanting to fix a bug or make a change first forks the software repository, makes the change, and then makes a "pull request". A best practice is to submit the pull request with a test case verifying the bug fix. A developer charged with maintaining the repository can then review the pull request and accept or reject the change. Often, there is discussion asking for clarification. Occasionally versions remain forked and diverge from each other. GitHub has become the most popular sites for this type software repository because of its well developed workflow tools and integration hooks. 
The leaders of this team recognized the possibility to use GitHub for the maintenance of ebooks, and we began the process of migrating the most important corpus of public domain ebooks, Project Gutenberg, onto GitHub, thus the name GITenberg. Project Gutenberg has grown over the years to 50,000 ebooks, audiobooks, and related media, including all the most important public domain works of English language literature. Despite the great value of this collection, few libraries have made good use of this resource to serve their communities. There are a number of reasons why. The quality of the ebooks and the metadata around the ebooks is quite varied. MARC records, which libraries use to feed their catalog systems, are available for only a subset of the PG collection. Cover images and other catalog enrichment assets are not part of PG. 
To make the entire PG corpus available via local libraries, massive collaboration amoung librarians and ebook develeopers is essential. We propose to build integration tools around github that will enable this sort of collaboration to occur. 
  1. Although the PG corpus has been loaded into GITenberg, we need to build a backend that automatically converts the version-controlled source text into well-structured ebooks. We expect to define a flavor of MarkDown or Asciidoc which will enable this automatic, change-triggered building of ebook files (EPUB, MOBI, PDF). (MarkDown is a human-readable plain text format used on GitHub for documentation; MarkDown for ebooks is being developed independently by several team of developers. Asciidoc is a similar format that works nicely for ebooks.) 
  2. Similarly, we will need to build a parallel backend server that will produce MARC and XML formatted records from version-controlled plain-text metadata files.
  3. We will generate covers for the ebooks using a tool recently developed by NYPL and include them in the repository.
  4. We will build a selection tool to help libraries select the records best suited to their libraries.
  5. Using a set of "cleaned up" MARC records from NYPL, and adding custom cataloguing, we will seed the metadata collection with ~1000 high quality metadata records.
  6. We will provide a browsable OPDS feed for use in tablet and smartphone ebook readers.
  7. We expect that the toolchain we develop will be reusable for creation and maintenance of a new generation of freely licensed ebooks.

The rest of the proposal is on the Knight News Challenge website. If you like the idea of GITenberg, you can "applaud" it there. The "applause' is not used in the judging of the proposals, but it makes us feel good. There are lots of other interesting and inspiring proposals to check out and applaud, so go take a look!

Wednesday, October 15, 2014

Adobe, Privacy and the Big Yellow Taxi

Here's the most important thing to understand about privacy on the Internet: Google doesn't know your password. The FBI can't march into Sergey Brin's office and threaten to put him in jail unless he tells them your password (if it thinks you're making WMD's). Because it wouldn't do them any good. If Google could produce your password, it would be a sign either of gross incompetance or the ill-considered choice of your cat's name, "mittens" as your password.

Because Google's engineers are at least moderately competent, they don't store your password anywhere.  Instead, they salt it and hash it. The next time they ask you for your password, they salt it and hash it again and see if the result is the same as the hash they've saved. It would be easier for Jimmy Dean to make a pig from a sausage than it would be to get the password from its hash. And that's how the privacy of your password is constructed.

Using similar techniques, Apple is able to build strong privacy into the latest version of iOS, and despite short-sighted espio-nostalgia from the likes of James Comey,  strong privacy is both essential and achievable for many types of data. I would include reading data in that category. Comey's arguments could easily apply to ebook reading data. After all, libraries have books on explosives, radical ideologies, and civil disobediance. But that doesn't mean that our reading lists should be available to the FBI and the NSA.

Here's the real tragedy: "we take your privacy seriously" has become a punch line. Companies that take care to construct privacy using the tools of modern software engineering and strong encryption aren't taken seriously. The language of privacy has been perverted by lawyers who "take privacy seriously" by crafting privacy policies that allow their clients to do pretty much anything with your data.

CC BY bevgoodin
Which brings me the the second most important thing to understand about privacy on the Internet. Don't it always seem to go that you don't know what you've got till it's gone? (I call this the Big Yellow Taxi principle)

Think about it. The only way you know if a website is being careless with your password is if it gets stolen, or they send it to you in an email. If any website sends you your password by email, make sure that website has no sensitive information of yours because it's being run by incompetents. Then make sure you're not using that password anywhere else and if you are, change it.

Failing gross incompetence, it's very difficult for us to know if a website or a piece of software has carefully constructed privacy, or whether it's piping everything you do to a server in Kansas. Last week's revelations about Adobe Digital Editions (ADE4) were an example of such gross incompetence, and yes, ADE4 tries to send a message to a server in Kansas every time you turn an ebook page. Much outrage has been directed at Adobe over the fact that the messages were being sent in the clear. Somehow people are less upset at the real outrage: the complete absence of privacy engineering in the messages being sent.

The response of Adobe's PR flacks to the brouhaha is so profoundly sad. They're promising to release a software patch that will make their spying more secret.

Now I'm going to confuse you. By all accounts, Adobe's DRM infrastructure (called ACS) is actually very well engineered to protect a user's privacy. It provides for features such as anonymous activation and delegated authentication so that, for example, you can borrow an ACS-protected library ebook through Overdrive without Adobe having any possibility of knowing who you are. Because the privacy has been engineered into the system, when you borrow a library ebook, you don't have to trust that Adobe is benevolently concerned for your privacy.

Yesterday, I talked with Micah Bowers, CEO of Bluefire, a small company doing a nice (and important) niche business in the Adobe rights management ecosystem. They make the Bluefire Reader App, which they license to other companies who rebrand it and use it for their own bookstores. He is confident that the Adobe ACS infrastructure they use is not implicated at all by the recently revealed privacy breeches. I had reached out to Bowers because I wanted to confirm that ebook sync systems could be built without giving away user privacy. I had speculated that the reason Adobe Digital Editions was phoning home with user reading data was part of an unfinished ebook sync system. "Unfinished" because ADE4 doesn't do any syncing. It's also possible that reading data is being sent to enable business models similar to Amazon's "Kindle Unlimited", which pays authors when a reader has read a defined fraction of the book.

For Bluefire ( and the "white label" apps based on Bluefire), ebook syncing is a feature that works NOW. If you read through chapter 5 of a book on your iPhone, the Bluefire Reader on your iPad will know. Bluefire users have to opt in to this syncing and can turn it off with a single button push, even after they've opted in. But even if they've opted in, Bluefire doesn't know what books they're reading. If the FBI wants a list of people reading a particular book, Bluefire probably doesn't have the ability to say who's reading the books. Of course, the sync data is encrypted when transmitted and stored. They've engineered their system to preserve privacy, the same way Google doesn't know your password, and Apple can't decrypt your iphone data. Maybe the FBI and the NSA can get past their engineering, but maybe they can't, and maybe it would be too much trouble.

To some extent, you have to trust what Bluefire says, but I asked Bowers some pointed questions about ways to evade their privacy cloaking, and it was clear to me from his answers that his team had considered these attacks.  Bluefire doesn't send or receive any reading data to or from Adobe.

For now, Bluefire and other ebook reading apps that use Adobe's ACS (including Aldiko, Nook, Apps from Overdrive and 3M) are not affected by the ADE privacy breech. I'm convinced from talking to Bowers that the Bluefire sync system is engineered to keep reading private. But the Big Yellow Taxi principle applies to all of these. It's very hard for consumers to tell a well engineered system from a shoddy hack until there's been a breach and then it's too late.

Perhaps this is where the library community needs to forcefully step in. Privacy audits and 3rd party code review should be required for any application or website that purports to "Take privacy seriously" when library records privacy laws are in play.

Or we could pave over the libraries and put up some parking lots.

Thursday, October 9, 2014

Correcting Misinformation on the Adobe Privacy Gusher

We've learned quite a lot about Adobe Digital Editions version 4 (ADE4) since Nate Hoffelder broke the story that "Adobe is Spying on Users, Collecting Data on Their eBook Libraries". Unfortunately, there's also been a some bad information that's been generated along with the furor.

One thing that's clear is that Adobe Digital Editions version 4 is not well designed. It's also clear that our worst fears about ADE4 - that it was hunting down ALL the ebooks on a user's computer on installation and reporting them to Adobe - are not true. I've been participating with Nate and some techy people in the library and ebook world (including Galen, Liza, and Andromeda) to figure out what's really going on. It's looking more like an incompetently-designed, half-finished synchronization system than a spy tool. Follow Nate's blog to get the latest.

So, some misconceptions.
  1. The data being sent by ADE4 is NOT needed for digital rights management. We know this because the DRM still works if the Adobe "log" site is blocked. Also, we know enough about the Adobe DRM to know it's not THAT stupid.
  2. The data being sent by ADE4 is NOT part of the normal functioning of a well designed ebook reader. ADE4 is sending more than it needs to send even to accomplish synchronization. Also, ADE4 isn't really cloud-synchronizing devices, the way BlueFire is doing (well!).
On the legal side:
  1. The ADE4 privacy policy is NOT a magic incantation that makes everything it does legal. For example, all 50 states have privacy laws that cover library records. When ADE4 is used for library ebooks, the fact that it broadcasts a user's reading behavior makes it legally suspect. Even if the stream were encrypted, it's not clear that it would be legal.
  2. The NJ Reader Privacy Act is NOT an issue...yet. There's been no indication that it's been signed into law. If signed into law, and upheld, and found to apply, then Adobe would owe a lot of people in NJ $500.
  3. The California Reader Privacy Act is NOT relevant (as far as I can tell) because it's designed to protect against legal discovery and there's not been any legal process. However, California has a library privacy law.
  4. Europe might have more to say.
The bottom line for now is that ADE4 does not so much spy on you as it stumbles around in your closet and sometimes tells Adobe what it finds there. In a loud voice so everyone around can hear. And that's not a nice thing to do.

Update 1PM: Galen Charlton of Equinox Software has now reproduced the scanning behavior reported by Nate Hoffelder. This is important because there was always the possibility that Nate, whose reporting on ereaders has him trying out a lot of stuff, had some strange and unique system configuration.

Thursday, October 2, 2014

The Perfect Bookstore Loses to Amazon

My book industry friends are always going on and on about "the book discovery problem". Last month, a bunch of us, convened by Chris Kubica, sat in a room in Manhattan's Meatpacking district and plotted out how to make the perfect online bookstore. "The discovery problem" occupied a big part of the discussion. Last year, Perseus Books gathered a smattering of New York's nerdiest at "the first Publishing Hackathon". The theme of the event, the "killer problem": "book discovery". Not to be outdone, HarperCollins sponsored a "BookSmash Challenge" to find "new ways of reading and discovering books".  

Here's the typical framing of "the book discovery problem". "When I go to a bookstore, I frequently leave with all sort of books I never meant to get. I see a book on the table, pick it up and start reading, and I end up buying it. But that sort of serendipitous discovery doesn't happen at Amazon. How do we recreate that experience?" Or "There are so many books published, how do we match readers to the books they'd like best?"

This "problem" has always seemed a bit bogus to me. First of all, when I'm on the internet, I'm constantly running across interesting sounding books. There are usually links pointing me at Amazon, and occasionally I'll buy the book.

As a consumer, I don't find I have a problem with book discovery. I'm not compulsive about finding new books; I'm compulsive about finishing the book I've read half of. When I finish a book, it's usually two in the morning and I really want to get to sleep. I have big stacks both real and virtual of books on my to-read list.

Finally, the "discovery problem" is a tractable one from a tech point of view. Throw a lot of data and some machine learning at the problem and a good-enough solution should emerge. (I should note here that  book "discovery" on the website I run, unglue.it, is terrible at the moment, but pretty soon it will be much better.)

So why on earth does Amazon, which throws huge amounts of money into capital investment, do such a middling job of book discovery?

Recently the obvious answer hit me in the face, as such answers are wont to do. The answer is that mediocre discovery creates a powerful business advantage for Amazon!

Consider the two most important discovery tools used on the Amazon website:
  1. People who bought X also bought y.
  2. Top seller lists.
Both of these methods have the property that the way to make these work for your book is for your book to sell a lot on Amazon. That means that any author or publisher that wants to sell a lot of books on Amazon will try to steer as many fans as possible to Amazon. More sales means more recommendations, which means more sales, and so on. Amazon is such a dominant bookseller that a bookstore could have the dreamiest features and pay the publisher a larger share of the retail selling price and still have the publisher try to push people to Amazon.

What happens in this sort of positive feedback system is pretty obvious to an electrical engineer like me, but Wikipedia's example of a cattle stampede makes a better picture.

The number of cattle running is proportional to the overall level of panic, which is proportional to...the number of cattle running! "Stampede loop" by Trevithj. CC BY-SA
Result: Stampede! Yeah, OK, these are sheep. But you get the point. "Herdwick Stampede" by Andy Docker. CC BY.
Imagine what would happen if Amazon shifted from sales-based recommendations to some sort of magic that matched a customer with the perfect book. Then instead of focusing effort on steering readers to Amazon, authors and publishers would concentrate on creating the perfect book. The rich would stop getting richer, and instead, reward would find the deserving.

Ain't never gonna happen. Stampedes sell more books.

Saturday, September 27, 2014

Online Bookstores to Face Stringent Privacy Law in New Jersey

Before you read this post, be aware that this web page is sharing your usage with Google, Facebook, StatCounter.com, unglue.it and Harlequin.com. Google because this is Blogger. Facebook because there's a "Like" button, StatCounter because I use it to measure usage, and Harlequin because I embedded the cover for Rebecca Avery's Maid to Crave directly from Harlequin's website. Harlequin's web server has been sent the address of this page along with you IP address as part of the HTTP transaction that fetches the image, which, to be clear, is not a picture of me.

I'm pretty sure that having read the first paragraph, you're now able to give informed consent if I try to sell you a book (see unglue.it embed -->) and constitute myself as a book service for the purposes of a New Jersey "Reader Privacy Act", currently awaiting Governor Christie's signature. (Update Nov 22: Gov. Christie has conditionally vetoed the bill.) That act would make it unlawful to share information about your book use (borrowing, downloading, buying, reading, etc.) with a third party, in the absence of a court order to do so. That's good for your reading privacy, but a real problem for almost anyone running a commercial "book service".

Let's use Maid to Crave as an example. When you click on the link, your browser first sends a request to Harlequin.com. Using the instructions in the returned HTML, it then sends requests to a bunch of web servers to build the web page, complete with images, reviews and buy links. Here's the list of hosts contacted as my browser builds that page:

  • www.harlequin.com
  • stats.harlequin.com
  • seal.verisign.com (A security company)
  • www.goodreads.com  (The review comes from GoodReads. They're owned by Amazon.)
  • seal.websecurity.norton.com (Another security company)
  • www.google-analytics.com
  • www.googletagservices.com
  • stats.g.doubleclick.net (Doubleclick is an advertising network owned by Google)
  • partner.googleadservices.com
  • tpc.googlesyndication.com
  • cdn.gigya.com (Gigya’s Consumer Identity Management platform helps businesses identify consumers across any device, achieve a single customer view by collecting and consolidating profile and activity data, and tap into first-party data to reach customers with more personalized marketing messaging.)
  • cdn1.gigya.com
  • cdn2.gigya.com
  • cdn3.gigya.com
  • comments.us1.gigya.com
  • gscounters.us1.gigya.com
  • www.facebook.com (I'm told this is a social network)
  • connect.facebook.net
  • static.ak.facebook.com
  • s-static.ak.facebook.com
  • fbstatic-a.akamaihd.net (Akamai is here helping to distribute facebook content)
  • platform.twitter.com (yet another social network)
  • syndication.twitter.com
  • cdn.api.twitter.com
  • edge.quantserve.com (QuantCast is an "audience research and behavioural advertising company")

All of these servers are given my IP address and the URL of the Harlequin page that I'm viewing. All of these companies except Verisign, Norton and Akamai also set tracking cookies that enable them to connect my browsing of the Harlequin site with my activity all over the web. The Guardian has a nice overview of these companies that track your use of the web. Most of them exist to better target ads at you. So don't be surprised if, once you've visited Harlequin, Amazon tries to sell you romance novels.

Certainly Harlequin qualifies as a commercial book service under the New Jersey law. And certainly Harlequin is giving personal information (IP addresses are personal information under the law) to a bunch of private entities without a court order. And most certainly it is doing so without informed consent. So its website is doing things that will be unlawful under the New Jersey law.

But it's not alone. Almost any online bookseller uses services like those used by Harlequin. Even Amazon, which is pretty much self contained, has to send your personal information to Ingram to fulfill many of the book orders sent to it. Under the New Jersey law, it appears that Amazon will need to get your informed consent to have Ingram send you a book. And really, do I care? Does this improve my reading privacy?

The companies that can ignore this law are Apple, Target, Walmart and the like. Book services are exempt if they derive less than 2% of their US consumer revenue from books. So yay Apple.

Other internet book services will likely respond to the law with pop-up legal notices like those you occasionally see on sites trying to comply with European privacy laws. "This site uses cookies to improve your browsing experience. OK?" They constitute privacy theater, a stupid legal show that doesn't improve user privacy one iota.

Lord knows we need some basic rules about privacy of our reading behavior. But I think the New Jersey law does a lousy job of dealing with the realities of today's internet. I wonder if we'll ever start a real discussion about what and when things should be private on the web.

Wednesday, September 24, 2014

Emergency! Governor Christie Could Turn NJ Library Websites Into Law-Breakers

Nate Hoffelder over at The Digital Reader highlighted the passage of a new "Reader Privacy Act" passed by the New Jersey State Legislature. If signed by Governor Chris Christie it would take effect immediately. It was sponsored by my state senator, Nia Gill.

In light of my writing about privacy on library websites, this poorly drafted bill, though well intentioned, would turn my library's website into a law-breaker, subject to a $500 civil fine for every user. (It would also require us to make some minor changes at Unglue.it.)
  1. It defines "personal information" as "(1) any information that identifies, relates to, describes, or is associated with a particular user's use of a book service; (2) a unique identifier or Internet Protocol address, when that identifier or address is used to identify, relate to, describe, or be associated with a particular user, as related to the user’s use of a book service, or book, in whole or in partial form; (3) any information that relates to, or is capable of being associated with, a particular book service user’s access to a book service."
  2. “Provider” means any commercial entity offering a book service to the public.
  3. A provider shall only disclose the personal information of a book service user [...] to a person or private entity pursuant to a court order in a pending action brought by [...] by the person or private entity.
  4. Any book service user aggrieved by a violation of this act may recover, in a civil action, $500 per violation and the costs of the action together with reasonable attorneys’ fees.
My library, Montclair Public Library, uses a web catalog run by Polaris, a division of Innovative Interfaces, a private entity, for BCCLS, a consortium serving northern New Jersey. Whenever I browse a catalog entry in this catalog, a cookie is set by AddThis (and probably other companies) identifying me and the web page I'm looking at. In other words, personal information as defined by the act is sent to a private entity, without a court order.

And so every user of the catalog could sue Innovative for $500 each, plus legal fees.

The only out is "if the user has given his or her informed consent to the specific disclosure for the specific purpose." Having a terms of use and a privacy policy is usually not sufficient to achieve "informed consent".

Existing library privacy laws in NJ have reasonable exceptions for "proper operations of the library". This law does not have a similar exemption.

I urge Governor Christie to veto the bill and send it back to the legislature for improvements that take account of the realities of library websites and make it easier for internet bookstores and libraries to operate legally in the Garden State.

You can contact Gov. Christie's office using this form.

Update: Just talked to one of Nia Gill's staff; they're looking into it. Also updated to include the 2nd set of amendments.

Update 2: A close reading of the California law on which the NJ statute was based reveals that poor wording in section 4 is the source of the problem. In the California law, it's clear that it pertains only to the situation where a private entity is seeking discovery in a legal action, not when the private entity is somehow involved in providing the service.

Where the NJ law reads
A provider shall only disclose the personal information of a book service user to a government entity, other than a law enforcement entity, or to a person or private entity pursuant to a court order in a pending action brought by the government entity or by the person or private entity.  
it's meant to read
In a pending action brought by the government entity other than a law enforcement entity, or by a person or by a private entity, a provider shall only disclose the personal information of a book service user to such entity or person pursuant to a court order.
Update 3 Nov 22: Governor Christie has conditionally vetoed the bill.

Monday, September 22, 2014

Attribution Meets Open Access

Credits Dancer (see on YouTube)
It drives my kids crazy, but I always stay for the credits after the movie. I'm writing this while on a plane over the Atlantic, and I just watched Wes Anderson's Grand Budapest Hotel. Among the usual credits for the actors, the producers, the directors, writers, editors, composers, designers, musicians, key grips, best boys, animators, model makers and the like, Michael Taylor is credited as the painter of "Johannes von Hoytl's Boy with Apple" along with his model, Ed Munro. "The House of Waris" is credited for "Brass Knuckle-dusters and Crossed Key Pins". There's a "Drapesmaster", a Millener and two "Key Costume Cutters". There are even "Photochrom images courtesy of The Library of Congress". To reward me for watching to the end there's a funny Russian dancer over the balalaika chorus.

It says a lot about the movie industry that so much work has gone into the credits. They are a fitting recognition of the miracle of a myriad of talents collaborating to result in a Hollywood movie. But the maturity of the film industry is also reflected in the standardization of the form of this attribution.

The importance of attribution is similarly reflected by its presence is each of the Creative Commons licenses. But many of the digital media that have adopted Creative Commons licensing have not reached the sort of attribution maturity seen in the film industry. The book publishing industry, for example, hides the valuable contributions of copy editors, jacket designers, research assistants and others. It's standard practice to attribute a work to the author alone. If someone spends time to make an ebook work well, that generally doesn't get a credit alongside the author.

The Creative Commons licenses require attribution, but don't specify much about how the attribution is to be done, and it's taken a while for media specific conventions to emerge. It seems to be accepted practice, for example, that CC licensed blog posts require a back-link to the original blog post. People who use CC licensed photos to illustrate a slide presentation typically have a credits page with links to the sources at the end.

Signs of maturation were omnipresent at the 6th Conference for Open Access Scholarly Publishing, which I'm just returning from. Prominent in the list of achievements was the announcement of a "Shared Statement and Community Principles on Expectations of Scholarly Standards on Attribution", a set of attribution principles for open access scholarly publications, signed by all the important open access scholarly publishers.

The four agreed-upon principles are as follows:

  1. Researchers choosing Open Access and using liberal licenses do so because they wish to maximise access to and re-use of their work. We acknowledge the tradition of both freely giving knowledge to our communities and also the expectation that contributions will be respected and that full credit is given according to scholarly norms.
  2. Authors choose Creative Commons licenses in part to ensure attribution and the assignment of credit. The community expects that where a work is reprinted, collected, aggregated or otherwise re-used substantially as a whole that the original source, location and free availability of the original version will be both made explicit and emphasised.
  3. The community expects that where modifications have been made to an article that this will be made explicit and every practicable effort will be made to make the nature and scope of modifications explicit. Where a derivative is digital all practicable efforts should be made to make comparison with the original version as easy as possible for the user.
  4. The community assumes, consistent with the terms of the Creative Commons licenses, that unless noted otherwise authors have not endorsed any republication or modification of their original work. Where authors have explicitly endorsed the republication or modified version this should be made explicit in a way which is separate to the attribution.

These principles, and the implementation guidelines that will result from further consultations, are particularly needed because many scholars, while supporting the reuse enabled by CC BY licenses, are concerned about possible misuse. The principles reinforce that when a work is modified, the substance of the modifications should be made clear to the end user, and that further, there must be no implication that republication carries any endorsement by the original authors.

One thing that is likely to emerge from this process is the use of CrossRef DOI's as attribution urls. DOIs can be resolved (via redirection) to an authoritative web and can be maintained by the publisher so that links needn't break when content moves.

As scholarly content gets remixed, revised and repurposed, there will increasingly be a need to track contributions every bit as elaborate as for Grand Budapest Hotel. Imagine a paper by Alice analyzing data from Bob on a sample by Carol, with later corrections by Eve. Luckily we live in the future and there's already a technology and user framework that shows how it can be done. That technology, the future of attribution (I hope), is Distributed Version Control. A subsequent post will discuss why every serious publisher needs to understand GitHub.

The emphasis on community in the the "Shared Statement" is vitally important. With consultation and shared values, we'll soon all be dancing at the end of the credits.

Monday, September 15, 2014

Analysis of Privacy Leakage on a Library Catalog Webpage

My post last month about privacy on library websites, and the surrounding discussion on the Code4Lib list prompted me to do a focused investigation, which I presented at last weeks Code4Lib-NYC meeting.

I looked at a single web page from the NYPL online catalog. I used Chrome developer tools to trace all the requests my browser made in the process of building that page. The catalog page in question is for The Communist Manifesto. It's here: http://nypl.bibliocommons.com/item/show/18235020052907_communist_manifesto .

You can imagine how reading this work might have been of interest to government investigators during the early fifties when Sen. Joe McCarthy was at the peak of his power. Note that, following good search-engine-optimization practice, the URL embeds the title of the resource being looked at.

I chose the NYPL catalog as my example, not because it's better or worse than any other library catalog with respect to privacy, but because it's exemplary. The people building it are awesome, and the results are top-notch. I happen to know the organization is working on making privacy improvements. Please don't take my investigation to be a criticism of NYPL. But it was Code4Lib-NYC, after all.

As an example of how far ahead of the curve the NYPL catalog is, note that the webpage offers links to free downloads at Project Gutenberg. The Communist Manifesto is in the public domain, so any library catalog that tells you that no ebook is available is lying. The majority of library catalogs today lie about this.

So here are the results.

In building the Communist Manifesto catalog page, my browser contacts 11 different hosts from 8 different companies.
  • nypl.secure.bibliocommons.com
  • cdn.bibliocommons.com
  • api.bookish.com
  • contentcafe2.btol.com
  • www.google-analytics.com
  • www.googletagmanager.com
  • cdn.foxycart.com
  • idreambooks.com
  • ws.sharethis.com
  • wd-edge.sharethis.com
  • b.scorecardresearch.com
Each of these hosts is informed of the address of the web page that generates the address. They are told, essentially, "this user is looking at our Communist Manifesto page". Some of the hosts need this information to deliver the services they contribute. Others get the same information via the "referer" header generated as part of the HTTP protocol.  If the catalog were served with the more secure protocol "HTTPS", the referer header would not be sent.

The first of these is Bibliocommons. I've written about Bibliocommons before. They host the NYPL catalog "in the cloud". I'm not particularly concerned about Bibliocommons with respect to privacy, because they contract directly with NYPL, and I'm pretty sure that contracts are in place that bind Bibliocommons to the privacy policies in place at NYPL. But since HTTP is used rather than HTTPS, every host between me and the bibliocommons server can see and capture the URL of the web page I'm looking at. At the moment, I'm using the wifi in a Paris cafe, so the hosts that can see that are in the proxad.net, aas6453.net, level3.net, firehost.com and other domains. I don't know what they do with my browsing history.

I've previously written about the NYPL's use of the Bookish recommendation engine.  The BTOL.com link is for Baker&Taylor's "Content Cafe" service that provides book covers for library catalogs. I'm guessing (but don't know for sure) that these offerings have privacy policies that are aware of the privacy expectations of library users.

Yes, Google is one of the companies that NYPL tells about my web browsing. I'm pretty sure that Google knows who I am. A careful look at the Google Analytics privacy policy suggests that they can't share my browsing history outside Google. Unless required to by law.

Foxycart is not a company I was familiar with. They provide the shopping cart technology that lets me buy a book from the NYPL website and benefit them with part of the proceeds. I've been in favor of enabling such commerce on library sites because libraries need to do it to participate fully in the modern reading ecosystem. But it's still controversial in the library world.

Foxycart's privacy policy, like all privacy policies ever written, takes your privacy very seriously. Some excerpts:
When you visit this website, some information, such as the site that referred you to us, your IP and email address, and navigational and purchase information, may be collected automatically as part of the site’s operation. This information is used to generate user profiles and to personalize the web site to your particular interests. 
The information collected online is stored indefinitely and is used for various purposes. 
Cookies offer you many conveniences. They allow FoxyCart.com LLC, and certain third party content providers, to recognize information, and so can determine what content is best suited to your needs.  
We also reserve the right to disclose your personal information if required to do so by law, or in the good faith belief that such action is reasonably necessary to comply with legal process, respond to claims, or protect the rights, property or safety of our company, employees, customers or the public.

Here I need to explain about cookies. When a website gives you a cookie, it acquires the ability to track you across all the websites that company serves. This can be a great convenience for you. When you fill out a credit card form with your name and address, Foxycart can remember it for you so you don't have to type it in again when you come back to order something else. You might find that creepy if the last order you placed was on a porn site. But while NYPL hasn't told FoxyCart anything that could identify you personally, your interaction with FoxyCart is such that you may well chose to identify yourself. And all that information is stored forever. And FoxyCart can pass that information to all the Sen. Joe McCarthys of 2020. As well as certain 3rd party content providers. FoxyCart probably doesn't give away your information today, but will they even be around in 2020?

IdreamBooks syndicates book reviews. I don't know anything about them, and their homepage doesn't seem to have a privacy policy.

ScorecardResearch "conducts research by collecting Internet web browsing data and then uses that data to help show how people use the Internet, what they like about it, and what they don’t." They probably know whether I like ScorecardResearch. Their cookie is set by the ShareThis software.

ShareThis was one of the companies I mentioned in my last post. ShareThis provides social sharing buttons for the NYPL catalog. They also take your privacy very seriously. Some more excerpts:
In addition to the sharing service offered directly to users, the technology we use to assist with user sharing also allows us to gather information from publisher Web sites that include our ShareThis Sharing Icon or use our advertising technology, and enables ShareThis and our partner publishers and advertisers to use the value of the shared content and other information gathered through our technology to facilitate the delivery of relevant, targeted advertising (the ShareThis Services). 
we also receive certain non-personally identifiable information (e.g., demographic information such as zip code) from our advertisers, ad network and publisher partners, and we may combine this information with what we have collected. We also collect information from third-party Web sites with whom you have registered, like social networks, that those third parties make publicly available. 
While using the ShareThis Services, We may place third party advertisers’ and publishers’ cookies and pixels on their behalf regarding Usage Information. 
We are not responsible for the information practices of these third parties and the cookies placed by ShareThis on behalf of those third parties.
So ShareThis turns out to be in the business of advertising. They use your browsing behavior over thousands of websites to help advertisers target advertising and content to you. That scene in Minority report where Tom Cruise gets personalized ads on the billboards he walks by? Thats what ShareThis is helping to make happen today, and the NYPL website is helping them.
Ad Mall from Minority Report
They do this by cookie-sharing. In addition to setting a sharethis.com cookie, they set cookies for other companies, so they also get to know what you're reading. And when they do this, they enable other companies to connect your browsing behavior at NYPL with information you've provided to social networks. The result is that it's possible for a company selling Karl Marx merch to target ads you based on browsing the Communist Manifesto catalog page.

But it's not like ShareThis is completely promiscuous. Their privacy agreement limits their cookie sharing to an exclusive group of advertising companies. Here's the beginning of the list:
  • 33across.png
  • accuen.png
  • Adap.png
  • adaramedia.com
  • adblade.com
  • addthis.com
  • adroll.com
  • aggregateknowledge.com
  • appnexus.com
  • atlassolutions.com
  • AudienceScience.com
That's just the A's.

In 1972, Zoia Horn, a librarian at Bucknell University, was jailed for almost three weeks for refusing to testify at the trial of the Harrisburg 7 concerning the library usage of one of the defendants. That was a long time ago. No longer is there a need to put librarians in jail.



Wednesday, August 13, 2014

Libraries are Giving Away the User-Privacy Store

AddThis makes some really nice widgets. Here are some for sharing this blogpost:

ShareThis is another company that does pretty much the same thing. Their share buttons are down at the end of the post. AddThis is bigger. It provides "behavioral, contextual, and interest based data that spans across hundreds of content categories and topics, reaching 1.7 billion uniques a month."

The widgets help users share your content. At the same time, AddThis and ShareThis widgets help a publisher figure out who is sharing what, while distributing the content into other websites. To do this, they track users, see what sort of web sites they like. They can also work with advertising networks to improve the relevancy of ads shown to users. The user tracking works by setting user cookies, or "web beacons" that enable the tracking of users across websites. In the case of AddThis, users are also tracked using "Canvas Fingerprinting", a technique that works even when a user blocks cookie tracking. ProPublica recently wrote about this technology, calling it the "Online Tracking Device that's Nearly Impossible to Block".

Here's what the ShareThis Privacy Policy says:
In some cases, if you have chosen to make PII (like your name) publicly available through third party sites like social networks, we may seek your consent to use that PII in connection with services we offer in conjunction with our partners. We will not disclose your PII without your consent.
We and our publisher, advertiser and ad network partners also use this data for other related purposes (for example, to do research regarding the results of our online advertising campaigns or to better understand the interests or activities of users of the ShareThis Services).
Similarly, AddThis says:
When an End User downloads a page that contains an AddThis Button, we may deploy a cookie on our own behalf or on behalf of our data partners, to record information about how an End User uses the web, such as the web search that landed the End User on a particular page or categories of the End User's interests. We may use the Data to target advertising toward the End User or authorize others to do the same. 
Many websites are using Google Analytics to measure usage; they let Google track their users in the same way (the website I run, Unglue.it, uses Google Analytics). However, the Analytics terms of service seem not to allow Google to share the collected data as freely as AddThis and ShareThis do.

Both AddThis and ShareThis assert in the legal terms that they mustn't collect usage information from children, so if children use your site, you're not supposed to use these services. Google Analytics does not have this restriction, which presumably means they can't use their data to advertise to children.

Together with "Cookie Syncing" and "Evercookies", the cumulative effect of all this tracking is that website users can be pretty comprehensively tracked, and if need be, identified, whether they like it or not. In exchange for deploying the trackers, websites get access to the valuable pool of information about their users.

Matt Mullenweg (of WordPress) has an interesting perspective:
services like AddThis and ShareThis will always spy on and tag your audience when you use their widgets, and you should avoid them if you care about that sort of thing.
This puts libraries in somewhat of a quandary. Traditionally, libraries have been havens of privacy for their users. Librarians have famously gone to jail for their refusal to turn over circulation records to law enforcement. But it seems that libraries are not much protecting their users from the sort of information gathering done by AddThis, ShareThis, and Google. For example, New York Public Library uses Google Analytics and ShareThis. OCLC and Worldcat use AddThis. My own public library catalog (hosted by BCCLS)  sets cookies for AddThis. I suppose they don't consider that their websites could be directed at children. Even the American Library Association's webpage extolling the important of privacy in libraries makes use of Google Analytics. (ironically, the link to a website privacy policy is broken on that page!)

It's true that these trackers are very common- even WhiteHouse.gov has employed AddThis buttons. But it seems to me that if libraries still think that user privacy is valuable  in this age of social media, they need to rethink out their use of web user tracking companies. What disturbs me most is there hasn't been much public discussion about the future role of privacy in library websites, even as it's rapidly being lost.

Update (Aug 15): AddThis says they're not using canvas fingerprinting and have terminated their test of it. I don't think this really changes the cost/benefit analysis for libraries. It remains true that libraries that use AddThis or ShareThis are allowing a third party to track their patrons' catalog browsing (not just their social sharing), under terms which permit the companies to use the data for advertising purposes. Use of Google Analytics allows Google to do the same tracking, but does not appear to permit use for advertising. Either way, libraries need to make informed choices and communicate those choices to their users. Same for Facebook "Like" buttons. Commercial sites, obviously, have different priorities and responsibilities.

Update (Aug 19): There are a number of free open-source solutions available both for social sharing and for analytics. There's a very useful discussion of these issues on Hacker News.



Thursday, July 31, 2014

Don't Bother Reading "Acts of the Apostles"

Read Biodigital instead.

After reviewing John Sundman's Biodigital, I promised to report back after reading Acts of the Apostles which shares about 60% of its text.

It's very unusual for a lay reader to have access to two versions of a book in this way. Biodigital is partly the result of the sort of editorial work that goes on behind the scenes of publishing, and to read Acts is to become aware of sausage making that is usually invisible.

The bottom line is that Biodigital is a much better book. You won't miss anything if you skip Acts. While there's a lot of tightening here and there, there are two big changes which lead me to urge you to set aside Acts.

The first is Gordon Biersch, which has been removed from the book. Gordon Biersch opened in 1988 on Emerson Street in Palo Alto, California. I remember when it opened, it was a revelation. The beer was pretty good, and the food was designed to go with the beer. Today, this sort of place has a name: "gastro-pub", but back in 1988, that word didn't exist, at least in the vocabulary of grad students like me. Yuppies flocked to the place and by the time Sundman was writing Acts, it signified everything good and bad about Silicon Valley. But since then, Gordon Biersch has gone all Vegas. No really, the founders were bought out by money from Las Vegas. Today, there's a Gordon Biersch gastropub in 34 places where restaurants are allowed to brew beer, including 4 in Taiwan. It's owned by the same company that owns "Rock Bottom" brewpubs.

In Biodigital, the events that occurred at Gordon Biersch have been moved a mile or so southeast to Antonio's Nut House. Antonio's is still around. Like everything else in the area, it's changed, but it's not like Silicon Valley changed into Las Vegas. It's like Sun Microsystems changed into Google. I went and had a beer there when I was visiting earlier this month. I took pictures. Google maps has a walk-through view.


View Larger Map





The other big change is the book's depiction of Bartlett Aubrey. Bartlett, the estranged wife of hero Nick Aubrey, is supposed to be a brilliant molecular biologist, but in Acts, she mostly has big breasts. It's not a realistic portrait at all, more of an adolescent fantasy character. In Biodigital, references to Bartlett's breasts are cut by 50%, and I swear that's not why I thought the character was a lot smarter than in Acts.

So, support your local author. Or your local beer bar. Better yet, do both at the same time.