Web Scientist

Monday, 3 February 2014

The Importance of PhDs

The Illustrated Guide to a PhD is the latest article to come across my social feed whose intent is to pour cold water on the idea of studying for a PhD. It's reproduced in Business Insider, although the original can be found at Prof Matt Might's personal blog, where he posts about CS research (static analysis) at the University of Utah. Have a look at its characterisation of a PhD - it could be summarised as "why bother, you're just an insignificant pimple on the ignorance of humanity". The relevant diagram is included below, and shows the different levels of education (school, college, undergraduate, masters, PhD) as different coloured areas that might be all-round education that covers all topics (compulsory education) or niche and deep (higher education). A PhD touches and extends the boundaries of human knowledge (the black outer circle) in one very specific area, and so it appears as a small pimple that pushes the boundary slightly outwards.

To try to get another perspective on this, I did a few rough calculations (not so much data science as back-of-the-envelope science). In fact, the circles and pimples radically underestimate the situation - in order to represent the contribution that a single PhD student makes as a single pixel, the circle would need to be the about the size of a swimming pool. In other words the situation is worse (much worse) than the picture provided, but that's not because there's anything wrong with PhD research. It's because the world is so much bigger and more complex than we imagine; there is SO MUCH knowledge out there, and there are SO MANY researchers trying to extend the boundaries of human understanding, human craft and human ingenuity. Each PhD candidate may produce a tiny pimple of knowledge, but there are more than half a million of them across the world and but you know what they say - a million contributions here and a million contributions there and pretty soon you're talking real progress.

The aggregate achievement of higher degrees is significant, but what is the personal outcome? What is the significance of one pimple? The PhD doesn't just represent the attainment of a specific piece of knowledge, it represents the acquisition of specific expertise and human capability. It attests to a human being who knows how to address problems, how to propose novel solutions rationally synthesised from a broad base of evidence and experience. A PhD isn't someone stuck with niche and irrelevant knowledge, its someone who can apply crucial higher order skills and expertise for the economy and for society.

So don't lose heart about your PhD just because the world, the economy and society provide enormous challenges that can dwarf us all, remember that the point of a PhD is to transform YOU into part of the solution.

TL;DR

Here's my calculations about Matt's diagrams.

The width of the "bachelors speciality" is 57 pixels of a total 597 pixels salmon circumference, ie some 9.5% of all the subjects of human knowledge. In fact the number of separate undergraduate degree courses obtainable from my University is 331. Of course, many of these courses will overlap (let's guess that 1/3 of them are properly unique), on the other hand there are many areas of knowledge that we don't provide courses in (let's guess we cater for about 1/3 of all the subjects). Those two factors might tend to balance out, so I'll stick with 300 as the canonical number of undergraduate degree subjects. In other words, one person's undergraduate knowledge will represent a 0.33% covereage of all potential subject areas, and the UG pimple is 30 times too wide.

The diameter of the "whole of human knowledge" circle is 720 pixels, which makes the circumference 2262 pixels in length to represent the boundary of all human knowledge. How many subjects exist that could be studied at PhD level? The Dewey library classification system allows knowledge descriptors to be built from 3 digits upwards (e.g. 500 Science; 570 Biology; 576 Genetics and evolution; 576.8 Evolution; 576.83 Origin of life; 576.839 Extraterrestrial life). If we stick to 5 digits, that provides 100,000 identifiable subject areas for scholarship and PhD research, and to distinguish them all by a single pixel on the circle's circumference would require the circle to be about 40 times larger.

Lets look at this from the point of view of how many PhD students there might be out there: in the UK there are around 67,000 postgraduate research students in 2012/3 according to HESA statistics (536440 postgraduate students, of which around 1/8 are on PhD courses rather than postgraduate taught courses). Even if we scale this up by just a factor of 10 to represent the global population of PhD students, this is 2/3 of a million PhD students currently pressing forward in their own unique areas with their own unique pixels. That would require the circle to be 240 times larger, as big as a 25m swimming pool!

Monday, 29 April 2013

Web Science - Industry Partnerships

The Web Science Doctoral Training Centre has always aimed to have a close relationship with industry, as part of its mission to provide a leadership role in the UK's Digital Economy. With 63 members of our growing Industry Forum meeting regularly to discuss the future of the Web, we offer many opportunities to transfer knowledge and expertise between academia and our partners.

Switch Concepts, a trail-blazing Hampshire-based IT firm that specialises in digital advertising, claims that its partnership with the DTC contributes to its success. For the past year we have been working closely together with Switch to share understanding of the world of online advertising, swapping knowledge of our world-leading digital research. Switch, and our other DTC partners, work with teams of our students on the challenges and opportunities that the Web raises for their businesses.

Business Solent, which unites business leaders to drive economic prosperity, partnered with the DTC to host an exclusive Directors’ Forum Dinner to share the developments in the study of Web Science with business leaders in the region – going beyond the theory with an overview of the commercial opportunities being created by leading Solent businesses. At the event we shared the outcomes of the most recent set of Industry Forum Research discussions in Big Data, Social Businesses, Cybercrime, E-health and the Open Data Economy.

We look forward to increasing the size and scope of our Industry Forum, and would be delighted to hear from any businesses who would benefit from the latest research insights into the Web and Digital Economy.

Sunday, 24 February 2013

Everyone is Deeply Intertwingled

We are only a small way though creating the Web - this network of interlinked documents (Web 1.0), interlinked personal activities (Web 2.0) and interlinked facts (Web 3.0) has only just started to impact the way we run our lives. And the way that others run our lives.

The Internet, a runaway academic project designed to create a military communications system that could not be destroyed, became the perfect substructure for the Web, a knowledge sharing system that was developed in an underground nuclear research bunker and escaped through research institutions and the labs of the computer industry to embed itself in commerce, government and every aspect of society.

Both of these endeavours were shaped by the concerns of academia, or rather by the unconcerns of academia - a world in which trade and commerce were largely irrelevant, and where the notion of hi-tech criminality had yet to be invented. And so we now find Web Science grappling with some really fundamental issues (state-sponsored snooping, gossip/defamation, authority, jurisdiction, property, the nature of truth, the extent of human rights) to deal with the limitations of the Web's original design, while Web Technologists continue to innovate their way through the canon of science fiction literature (portable communicators, wrist-watch TVs, wall-sized video screens, voice recognition, computer glasses, autonomous drones and self-driving cars).

In our discussions, we tend to retreat to the fundamentals of the Web / Internet protocols - the Web is just a transfer of documentary information. Like borrowing a library book, we want those transactions to be private, unobserved and unrestricted, while being valuable. But the web was never just about transferring information (something that computers do) - it was from the very start about consuming the the assets of the information rich, and then the services of the business savvy. And after twenry years of hartd work we have created a very complex, highly interwoven network of people and events and activities and knowledge.

We are richly online individuals, with interconnected histories, making complex asynchronous engagements with other individuals, corporations and services. Our online personas are deeply informed by the needs, desires and happenings of our offline lives, so that the online recorded history of our avatars corresponds to the recorded offline history of the transactions, activities and events in which we were engaged.

We are deeply intertwingled, multi-persona individuals and it seems remiss that we haven't reconciled our Web presence with more than that of an invisible and inscrutable chess-player shuffling pieces (documents/bytes) across a huge board.

Once again, academia is leading in this social change. The requirements for Research Assessment (imposed by governments anxious to demonstrate value) mean that even Internet researchers need to be able to calculate the impact of their every activity, and show the evidence (often virtual) of its origin and effect. The Internet becomes our real history, and the actors whose names we see on the Web must be carefully and uniquely identified. So much for privacy, anonymity or simple abstinence.

Friday, 3 August 2012

Queen Anne, Copyright and Illegal Numbers

I have spent a lot of time as a web scientist and open access advocate considering the role of copyright in the web, where by "considering" I mean "banging my head against a wall" and by "role" I mean "trump card for the intellectually bankrupt but politically powerful".

Three hundred years ago, the English crown invented a new kind of property which it declared to exist in the written expression of ideas, and which it granted directly to the individuals who created those written texts. Copyright declares that anyone who creates a new piece of written text becomes the only person who has the right to make a copy of that text. Publishing a text or CD used to require that the creator made (printed or burned) as many copies as were necessary and readers simply acquired a pre-copied item. In the Web, the creator makes available a single item (on their web server's disk) and then to read the page the readers have to make their own copies using their own computers' browsers.

The Web has created a new kind of user experience and raised all kinds of expectations that are not compatible with the idea of copyright; different people would like to reconcile these differences by either changing the Web or changing the notion of copyright.

Copyright controls the expression of ideas; it stops one person stealing and using using another person's work, but not stopping them using their ideas. However, now these ideas are expressed digitally, there is more at stake than a piece of text.

Copyright doesn't just apply to novels or articles; it has been argued that it applies to very much shorter forms of communication such as emails and tweets. All of these things are stored as documents or files on a computer; they are sequences of bytes that have to be interpreted according to some coding scheme (e.g. ASCII, Unicode, plain text, HTML or Word).

Imagine the tweet "I am a pink hotdog". It is brief, but as far as Google is concerned it is original and has never been written before in any other document, and asserting my own copyright would seem to be justified. When I save it in a file on disk, I can use the od command to either show it as a string of characters, or a very long number (in hexadecimal format). I can then translate that hexadecimal number into a more usual decimal format.

text	I a m a p i n k h o t d o g
hexadecimal	4920616d20612070696e6b20686f74646f67
decimal	6,370,215,410,492,649,031,668,884,346,259,210,575,834,983

As far as the computer is concerned, the very long number and the stream of characters are equivalent data. If the law says that no-one else is allowed to reproduce these characters, it is the same as saying that no-one else is allowed to reproduce this number. If these words are not allowed to be stored in a computer system as a document, then this number is not allowed to be stored in a computer system as the output of a calculation. In other words, copyright not only assigns property in the expression of ideas, it also assigns property in the use of numbers. Effectively, once you write down an (admittedly large) number, it becomes illegal for anyone else to use it until 70 years after you die.

I doubt that Queen Anne foresaw this outcome when she created legislation "for the encouragement of learning". I'll leave it as an exercise for my students to work out how many illegal numbers there are, and how much of which range of numbers they infest. See Wikipedia for other examples of illegal numbers.

Sunday, 25 March 2012

The Sky is Falling (again)

Matt Honan's recent article The Case Against Google tells us that Google is Evil, people are abandoning the Open Web in favour of Closed Ecosystems and it's impossible to search the web without surrendering enough privacy to make a gynaecologist blush. So far, so 2012.

Here's the über-challenge that Google has set itself in delivering relevant search results:

You are about to leave San Francisco to drive to Lake Tahoe for a weekend of skiing, so you fire up your Android handset and ask it "what's the best restaurant between here and Lake Tahoe?" It's an incredibly complex and subjective query. But Google wants to be able to answer it anyway. (This was an actual example given to me by Google.) To provide one, it needs to know things about you. A lot of things. A staggering number of things.
To start with, it needs to know where you are. Then there is the question of your route—are you taking 80 up to the north side of the lake, or will you take 50 and the southern route? It needs to know what you like. So it will look to the restaurants you've frequented in the past and what you've thought of them. It may want to know who is in the car with you—your vegan roommates?—and see their dining and review history as well. It would be helpful to see what kind of restaurants you've sought out before. It may look at your Web browsing habits to see what kind of sites you frequent. It wants to know which places your wider circle of friends have recommended. But of course, similar tastes may not mean similar budgets, so it could need to take a look at your spending history. It may look to the types of instructional cooking videos you've viewed or the recipes found in your browsing history.
It wants to look at every possible signal it can find, and deliver a highly relevant answer: You want to eat at Ikeda's in Auburn, California

In looking at the corner into which this search company has painted itself in its attempts to stay relevant, I can't help but compare it with the lengths to which the Sirius Cybernetics Corporation went with its Nutrimatic Drinks Dispenser:

When the 'Drink' button is pressed it makes an instant but highly detailed examination of the subject's taste buds, a spectroscopic analysis of the subject's metabolism, and then sends tiny experimental signals down the neural pathways to the taste centres of the subject's brain to see what is likely to be well received. However, no-one knows quite why it does this because it then invariably delivers a cupful of liquid that is almost, but not quite, entirely unlike tea. Hitchhiker's Guide to the Galaxy, Douglas Adams

You see, I'm fairly sure that after all that effort that Google puts into finding me a restaurant, it will end up sending me to a so-so kind of establishment - the kind of restaurant I 'normally' end up in on business trips. The kind of restaurant that averages out the likes and dislikes of all my companions. And after all that invasive knowledge elicitation, I'll end up somewhere which has put in more effort on search engine optimisation than culinary optimisation.

What I really want from Google in these circumstances is an answer to the question "are there any Michelin starred restaurants between here and there?" It's a good old fashioned objective question, with a written down answer. One that can be looked up on the Web, not divined from my cerebellum.

Thursday, 5 January 2012

Open vs Closed? An Explosion of Generativity

In a previous posting I have mentioned Jonathan Zittrain's book The Future of the Internet and How to Stop It, in which he argues that the Internet needs to be open as a "generative system" to allow unanticipated change to emerge through unfiltered contribution from broad and varied audiences. His argument is that the Internet (i.e. Web) innovation needs to be open to all comers, in the same way that PC development has been unrestricted and open. No-one controls what you can do with a PC, what programs you should be able to write, to run or what information you should be allowed to process. The very processes that could control the Internet to make it a "safer" place (with regards to kiddie porn, piracy, cyber bullying, identity theft &ct.) will also tend to restrict technological development and make the future of the Internet a much poorer place - both in terms of the user experience and in terms of the future economic activity that could be developed.

In developing this argument, Zittrain and others have tended to contrast PC development (open to all) with iPhone development (closed and controlled by Apple). The first edition of his book was written before the iPhone API was released and the remarkably successful App Store(TM) was released. Subsequent editions/additions to the book have finessed the argument but by and large people still believe that a manufacturer controlled smartphone with software development policed by the manufacturer is a bad thing for innovation and hence generatively.

Historical PC/Windows Package vs iOS Package Development per year

Is this "received wisdom" supported by the evidence? The chart to the right compares the annual contribution of software developers on the Windows PC platform available from download.com (a major software portal since the early days of the Web) and iOS iPhone/iPod/iPad platform available from Apple's app store, and apparently shows an order of magnitude more development being supported by the closed environment.

Now PC software is available from thousands of sources, not just this single aggregator, and so the number of Windows packages here is clearly underestimated, while the iOS figure is accurate (by the nature of a closed, single manufacturer environment). Still, it is not the number of downloads which is important, and which scales with the number of distribution channels, but the number of software packages that have been created. Since download.com is such a significant source of PC software, we might expect that it would provide a not-insignificant fraction of software that is available to the general public.

So, given the arguments made about innovation and open platforms, it is interesting that there is such a difference between these figures for the two platforms in favour of the closed environment. That might suggest the amount of innovation stimulated by the iPhone is significant in comparison to the PC, that the development of the next generation of Web environments could be triggered by an iPhone-like ecosystem and not throttled by it, and that the future of the Internet is not so alarmingly threatened as some have thought.

This naive investigation and its results are an excuse for further investigation into how we theorise and predict the emergence of future web developments. The Web, after all, is not defined by the particular experience of a browser on a computer (desktop, laptop, netbook or smartphone), but by the interaction of informational and social agents.

Monday, 19 September 2011

Research Ethics and the Webs Private & Public Spaces

In a paper (Six Provocations for Big Data, section 5) related to her forthcoming keynote at the Oxford Internet Institute's "Decade in Internet Time" conference, danah boyd talks about "being in public" on the web, bringing metaphors about one's own public presence in a physical environment to bear on the accessibility of digital writings on a computer server. While we can all intuit what is meant by this (the conscious felt experience of being engaged with the web), are metaphors such as "being in public" helpful when thinking about ethical issues raised by the Web?

"Being in public" means that one's presence and actions can be seen/heard by other people, where we have no choice about who those "other people" are, nor control over what they do. Of course on the Web "we ourselves" are not in public, but the record of our words (or audios, videos, photographs, artwork) are. Or may be; sites may hide their content behind user accounts and secure browsing protocols. We may debate about our social networking activities being public, but we are rarely tempted to debate about the public nature of our bank account transactions.

What is the difference between the following:

Being in a public space	vs	having one's statements made public
Making a statement in a public space	vs	making a public statement
Being in a public space	vs	being on a global stage
Being in a public space	vs	being in a particular space for a particular purpose that other people could observe now or in the future
Being in a public space	vs	being made aware of other people's scrutiny
Making a statement in a public space	vs	having ones statements publicly analysed & criticised by observers

"Being in public" on the Web means that one's activities, memberships, engagements, writings, videos can be seen/heard by other people, where we have no choice about who those "other people" are, nor control over what they do. "Being in public" on the Web is useful on occasions when we want a global audience, and also on occasions when we are pontificating to the aether.

But "being in public on the Web" is also useful when we are expecting to speak to only a few individuals because for practical reasons it would be hugely inconvenient to create a specific channel for those people only. This is how we are "in public" normally: in parks, on the street, in coffee shops. One might refer to this as an expectation of "privacy by obscurity" - people could eavesdrop, but why would they bother? And when we are in those situations we are used to social norms that preclude people gathering around and gawping at our discussions. (As we are taught as children "don't stare", "don't be nosy", "that's none of your business".)

There are two phenomena that intrude on the unconsciously public: Google and the wily researcher. Search engines exist to expose and make things findable (more effectively public). However, those inhabiting the "self-conscious public" will often go to great SEO lengths to make sure that their public utterances are prominently positioned. Although not occupying key marketing positions in the top page of a Google search, the unconsciously public may still find find that their words are more accessible than they would have liked.

However acting in an "unconsciously public" fashion does not necessary imply being completely oblivious to the lack of privacy. Individuals may adjust to the emerging social norms and in doing so create new norms and establish new boundaries of behaviour. You may consider it acceptable for like-minded individuals (friendly observers, benign lurkers) to search for your online presence on discussion forum; you may be unhappy about work colleagues, reporters, government agents and university researchers actively examining your opinions.

So perhaps it is no small wonder that Google reports that there are almost half a million Web pages using the following boilerplate text threatening sociologists with legal action if they dare make use of their pages:

WARNING Any institutions or individuals using this site or any of its associated sites for studies or projects - You DO NOT have permission to use any of my profile, pictures, or other material posted on this site (including discussion thread posts and blogs) in any form or forum both current and future. If you have or do, it will be considered a violation of my privacy and will be subject to legal ramifications. It is recommended that other members post a notice similar to this or you may copy and paste this one into your profile

From a technical and legal point of view, I'm not convinced that this carries any weight (although I'm looking into it), but it certainly telegraphs a preference and intent. On the one hand we should feel a very strong pull towards respecting and honouring an individual's wishes, on the other hand we have clear social and legal boundaries precisely to curb our individual requests.

Should web mining personal information stop? Should ethics committees come down hard on this practice? Is it right to broaden the principle of "informed consent" to the Web, and to severely prune the availability of "big data"? I don't know, but I do know that my engineer's default position of "do what you want with public web pages" has been severely challenged.