Web Scientist: 2012

Friday, 3 August 2012

Queen Anne, Copyright and Illegal Numbers

I have spent a lot of time as a web scientist and open access advocate considering the role of copyright in the web, where by "considering" I mean "banging my head against a wall" and by "role" I mean "trump card for the intellectually bankrupt but politically powerful".

Three hundred years ago, the English crown invented a new kind of property which it declared to exist in the written expression of ideas, and which it granted directly to the individuals who created those written texts. Copyright declares that anyone who creates a new piece of written text becomes the only person who has the right to make a copy of that text. Publishing a text or CD used to require that the creator made (printed or burned) as many copies as were necessary and readers simply acquired a pre-copied item. In the Web, the creator makes available a single item (on their web server's disk) and then to read the page the readers have to make their own copies using their own computers' browsers.

The Web has created a new kind of user experience and raised all kinds of expectations that are not compatible with the idea of copyright; different people would like to reconcile these differences by either changing the Web or changing the notion of copyright.

Copyright controls the expression of ideas; it stops one person stealing and using using another person's work, but not stopping them using their ideas. However, now these ideas are expressed digitally, there is more at stake than a piece of text.

Copyright doesn't just apply to novels or articles; it has been argued that it applies to very much shorter forms of communication such as emails and tweets. All of these things are stored as documents or files on a computer; they are sequences of bytes that have to be interpreted according to some coding scheme (e.g. ASCII, Unicode, plain text, HTML or Word).

Imagine the tweet "I am a pink hotdog". It is brief, but as far as Google is concerned it is original and has never been written before in any other document, and asserting my own copyright would seem to be justified. When I save it in a file on disk, I can use the od command to either show it as a string of characters, or a very long number (in hexadecimal format). I can then translate that hexadecimal number into a more usual decimal format.

text	I a m a p i n k h o t d o g
hexadecimal	4920616d20612070696e6b20686f74646f67
decimal	6,370,215,410,492,649,031,668,884,346,259,210,575,834,983

As far as the computer is concerned, the very long number and the stream of characters are equivalent data. If the law says that no-one else is allowed to reproduce these characters, it is the same as saying that no-one else is allowed to reproduce this number. If these words are not allowed to be stored in a computer system as a document, then this number is not allowed to be stored in a computer system as the output of a calculation. In other words, copyright not only assigns property in the expression of ideas, it also assigns property in the use of numbers. Effectively, once you write down an (admittedly large) number, it becomes illegal for anyone else to use it until 70 years after you die.

I doubt that Queen Anne foresaw this outcome when she created legislation "for the encouragement of learning". I'll leave it as an exercise for my students to work out how many illegal numbers there are, and how much of which range of numbers they infest. See Wikipedia for other examples of illegal numbers.

Sunday, 25 March 2012

The Sky is Falling (again)

Matt Honan's recent article The Case Against Google tells us that Google is Evil, people are abandoning the Open Web in favour of Closed Ecosystems and it's impossible to search the web without surrendering enough privacy to make a gynaecologist blush. So far, so 2012.

Here's the über-challenge that Google has set itself in delivering relevant search results:

You are about to leave San Francisco to drive to Lake Tahoe for a weekend of skiing, so you fire up your Android handset and ask it "what's the best restaurant between here and Lake Tahoe?" It's an incredibly complex and subjective query. But Google wants to be able to answer it anyway. (This was an actual example given to me by Google.) To provide one, it needs to know things about you. A lot of things. A staggering number of things.
To start with, it needs to know where you are. Then there is the question of your route—are you taking 80 up to the north side of the lake, or will you take 50 and the southern route? It needs to know what you like. So it will look to the restaurants you've frequented in the past and what you've thought of them. It may want to know who is in the car with you—your vegan roommates?—and see their dining and review history as well. It would be helpful to see what kind of restaurants you've sought out before. It may look at your Web browsing habits to see what kind of sites you frequent. It wants to know which places your wider circle of friends have recommended. But of course, similar tastes may not mean similar budgets, so it could need to take a look at your spending history. It may look to the types of instructional cooking videos you've viewed or the recipes found in your browsing history.
It wants to look at every possible signal it can find, and deliver a highly relevant answer: You want to eat at Ikeda's in Auburn, California

In looking at the corner into which this search company has painted itself in its attempts to stay relevant, I can't help but compare it with the lengths to which the Sirius Cybernetics Corporation went with its Nutrimatic Drinks Dispenser:

When the 'Drink' button is pressed it makes an instant but highly detailed examination of the subject's taste buds, a spectroscopic analysis of the subject's metabolism, and then sends tiny experimental signals down the neural pathways to the taste centres of the subject's brain to see what is likely to be well received. However, no-one knows quite why it does this because it then invariably delivers a cupful of liquid that is almost, but not quite, entirely unlike tea. Hitchhiker's Guide to the Galaxy, Douglas Adams

You see, I'm fairly sure that after all that effort that Google puts into finding me a restaurant, it will end up sending me to a so-so kind of establishment - the kind of restaurant I 'normally' end up in on business trips. The kind of restaurant that averages out the likes and dislikes of all my companions. And after all that invasive knowledge elicitation, I'll end up somewhere which has put in more effort on search engine optimisation than culinary optimisation.

What I really want from Google in these circumstances is an answer to the question "are there any Michelin starred restaurants between here and there?" It's a good old fashioned objective question, with a written down answer. One that can be looked up on the Web, not divined from my cerebellum.

Thursday, 5 January 2012

Open vs Closed? An Explosion of Generativity

In a previous posting I have mentioned Jonathan Zittrain's book The Future of the Internet and How to Stop It, in which he argues that the Internet needs to be open as a "generative system" to allow unanticipated change to emerge through unfiltered contribution from broad and varied audiences. His argument is that the Internet (i.e. Web) innovation needs to be open to all comers, in the same way that PC development has been unrestricted and open. No-one controls what you can do with a PC, what programs you should be able to write, to run or what information you should be allowed to process. The very processes that could control the Internet to make it a "safer" place (with regards to kiddie porn, piracy, cyber bullying, identity theft &ct.) will also tend to restrict technological development and make the future of the Internet a much poorer place - both in terms of the user experience and in terms of the future economic activity that could be developed.

In developing this argument, Zittrain and others have tended to contrast PC development (open to all) with iPhone development (closed and controlled by Apple). The first edition of his book was written before the iPhone API was released and the remarkably successful App Store(TM) was released. Subsequent editions/additions to the book have finessed the argument but by and large people still believe that a manufacturer controlled smartphone with software development policed by the manufacturer is a bad thing for innovation and hence generatively.

Historical PC/Windows Package vs iOS Package Development per year

Is this "received wisdom" supported by the evidence? The chart to the right compares the annual contribution of software developers on the Windows PC platform available from download.com (a major software portal since the early days of the Web) and iOS iPhone/iPod/iPad platform available from Apple's app store, and apparently shows an order of magnitude more development being supported by the closed environment.

Now PC software is available from thousands of sources, not just this single aggregator, and so the number of Windows packages here is clearly underestimated, while the iOS figure is accurate (by the nature of a closed, single manufacturer environment). Still, it is not the number of downloads which is important, and which scales with the number of distribution channels, but the number of software packages that have been created. Since download.com is such a significant source of PC software, we might expect that it would provide a not-insignificant fraction of software that is available to the general public.

So, given the arguments made about innovation and open platforms, it is interesting that there is such a difference between these figures for the two platforms in favour of the closed environment. That might suggest the amount of innovation stimulated by the iPhone is significant in comparison to the PC, that the development of the next generation of Web environments could be triggered by an iPhone-like ecosystem and not throttled by it, and that the future of the Internet is not so alarmingly threatened as some have thought.

This naive investigation and its results are an excuse for further investigation into how we theorise and predict the emergence of future web developments. The Web, after all, is not defined by the particular experience of a browser on a computer (desktop, laptop, netbook or smartphone), but by the interaction of informational and social agents.