With the rise of individual online activity in chat rooms, social networking platforms and micro blogging services new datasources for social science research has become available in large quantities. The change in sample sizes from 100 participants to 100,000 is a dramatic challenge in numerous ways, technically, politically, but also ethically.
In this emerging context, because of its virtual and remote nature, the guidelines have to be reworked to meet the arising implications and establish fair, responsible and ethical management of such large quantities of information, containing potentially largely personal information of individuals.
Issues and concerns surrounding privacy and ethics have been raised recently around the data mining projects develop here at CASA. Most prominently at the CRESC conference in Oxford where it sparked a heated, but very interesting debate.
The questions arise over to what extent the users of online services agree to ‘their data’ being used for further research or analysis; potentially useful information which they often unknowingly generate while online. The lot of Survey Mapper and New City Landscape maps (NCL) generated from tweets sent with included geo location are working with data collected remotely through the internet without a direct consent from the 'user'.
With the NCL maps for example we are working with around 150,000 twitter messages sent by about 45,000 individual twitter users. The data is collected through the public twitter API which is provided as an additional service by twitter. Using the API, twitter packages the outgoing data stream of tweets for third party developers of twitter applications. The data served through the API is believed to be exactly the same as it is used for the main twitter online page.
The implications in the case of twitter, and likly with other similar services lies in the perception of private and public. With twitter the user can set up a personal profile and start sending 140 character messages. These messages are generally undirected statements that are sent out to the world using the twitter platform. To get other peoples messages delivered onto the personal twitter account page one has to start 'following' other users. This needs to happen in order for other users to see one's messages, they have to start 'following'. Each user can manage the list of followers manually.
However, while this setting creates a sense of closed community and could, probably does, lead one to believe the information or data sent using this platform can only be read and accessed by the circle of followers (e.g. friends), this is actually not the case. Every twitter message sent, unless deliberately sent as private message, is public.
For example last week the first person was sent to curt, see the Guardian, because he tweeted a joke to his friend: 'To bow the Robin Hood Airport sky high'. The twitter user was planning to fly out, but the airport was closed because of snow. How this message got him into trouble is not quite clear. The news article only states that an airport staff had by chance found the message using his home computer. Is he a follower of the tweeter or was he searching for the term 'blow' and 'Robin Hood Airport'? However, this sounds a bit set up. But try the search. Now after the media attention the scanners will bring up loads of tweets containing the terms. So this airport staff will be very busy reading all the messages or any investigation unit filtering tweets will face some difficulties.
This is not, however, a unique case to twitter. The issue arises in a number of fields related to user generated data, ranging from Google to facebook, from Microsoft to Apple and from Oyster card to Nectar Card. Information is the basic material this bright new world is built of and the more one leverage it the bigger the value (see for instance ). The data generated by users on the web is constantly being analysed and pored back into the ocean of data. To some extend this is fundamental part of the whole web world.
How does Amazon know that I was searching for cat flap the other week, even if I was not searching it on Amazon? Or why does my webmail show ads for online degrees in the sidebar, while I am reading an email sent from a university account?
The information the user generates on the internet is leaving traces by the click and beyond. Search histories can be accessed and analysed and snippets can be located in the past. However this phenomenon is not limited to the past. It travels beside the user in the present, even arriving before hand at the shores of potential service providers almost like a rippling wave in the ocean of the web.
As described above using the example of twitter, the issue with privacy is that it is perceived in one way and handled in another. Maybe the comparison with public space could make for an interesting case. More and more public spaces are merging into corporate spaces in the city. Shopping malls start to enter the domain of the space perceived as 'public'. Even though this is a privately owned mall and someone is making a lot of money from you being there, it successfully camouflages itself as a public space where people happily spend the money since it is so 'convenient'. They are provided with everything they are demanding, including the selection of the peers thought the target group of the mall as well as a mix of additional factors, such as social group, economic as well as location based aspects. In this 'easy' setting one does not have to deal with the implication and sharing aspects of the real public space, where conflict of interest have to be solved between the parties and cannot be solved by the house rule in the appearance of the private security guard.
It could be argued that the web services are quite similar to what is described above. We are not surfing the 'public' internet a such, even though most websites are free to use, but they are actually private sites owned by someone and often offering a service. And of course the service provider will want to make some money. If not directly from the user, probably through a third party that offers money in exchange for something, mostly the directing of users to certain information.
In this sense the user is provided with a free service in exchange for letting himself/herself be directed to potentially interesting information and adverts.
In economical terms this is a pretty good offer and should be a win-win situation for everyone involve. But, is it?
Image by Matt McKeon, via imgur / the Evolution of Privacy on facebook, Changes in default profile settings over time.It does actually change and automatically jump through the years, you have to be patient with this one.
Twitter also has a privacy page where they attempt to explain the company's privacy guidelines and considerations. It states: " We collect and use your information to provide our Services and improve them over time". In this paper twitter clearly states that the concept of the service is to publicly distribute messages. It further states that the default setting is set to public with the option to make it more private. This is not true however, for the location information as in this case the user has to activate this feature if one chooses to include this information. In this sense every user who's location information is mapped on the NCL maps has chosen to share this information with the word. Nevertheless there is an option to opt out of this and delete the location information of all messages sent in the past: "You may delete all location information from your past tweets. This may take up to 30 minutes".
Twitter makes it - not perfectly - but clear what the implications are with using the service: "What you say on Twitter may be viewed all around the world instantly".
Image by Diaspora / the project Logo as a dandelion, to symbolise the distribution of the seeds as uses for the basic concept of the new social network.
Sailing on the wave of complaints over the treatment of privacy on facebook and other social networking sites a bottom up project has risen, DIASPORA*. A self acclaimed perfectly personal social networking platform developed by four guys, with funny enough one of the goes under the name 'Max Salzberg?'. It reads all like a spoof as it was published on NYT earlier in May this year. But the project took of with the donation of over $10,000 within 12 days and some $24,000 within 20 days. By now they are fully funded with over $200,000 using KickStarter. This was back in May 2010 and now the developer code was published on September 15 2010. It looks cool and maybe it will bring the change, but this is probably decided by other features other than that the privacy issue. Since the big hype this discussion has dramatically calmed down, but it was definitely a good kickstart for the Diaspora* project and it shows how much people care for their privacy.
The data of interest for a whole range of commercial and academic or political bodies is not confined to only the actual message or information sent. Each account or profile contains a lot of additional information, such as name, age, gender, address, contact details, interests, birthday, shoe size. All of which can be extremely valuable, not just for marketing purposes. In addition, the very big things are the connections and networks that can be constructed from the data. Who knows who is contacting whom, when, how often and where. This is the real aspect of change with these personal information – known in internet law and policy circles as Personal Identify Information (PII). For the first time we can actually observe large-scale social interaction in dramatic detail in real time.
Even more so it becomes an implication with now almost all services integrating actual location data, either by using the integrated GPS module if used on a smart phone or for example IP or Wi-Fi access point data. Service providers know not only with whom one is connected but also where one actually is physically.
The biggest discussion around this was stirred up by Google at the launch of its Google Latitude service, discussed HERE earlier, and the Google Privacy Statement can be found HERE. The service would offer the option to distribute one's location to a list of friends who could follow one's movement in real time.
Concern rose over the possibility that a jealous husband could potentially log in to the service and activate the service on his wife's mobile without her knowledge and get his wife's position in real time delivered onto his screen. This would be actually possible but is a ridiculous scenario. There are numerous providers of such a service to be found on the internet who have actually specialised in this sort of service. However, the Google service is one for the masses and freely accessible for everyone with internet. Google reacted by sending a scheduled reminder email every week once the service is activated.
The implications of the detailed knowledge of private information and especially location information is that the identification of individuals for third parties becomes possible and potentially this information can be used to harm the individual.
This issue was brought to the pubic attention by the online platform 'pleaserobme.com' which displayed information collected from social networking site of people who stated that they are actually not at home. Implying that it would now be the opportunity to burgle their house. This was made possible through the message embedded location information.
One major factor in this discussion is the scale of resolution. Having the information is not the same as being able to use it. It is a question of accessing, or making it available. There might be a degree of anonymity in the fact that the data pool is so vast that the individual personal information is actually no longer visible. This is game deciding when the actual output of the private information are visualisations.
For example with the NCL maps, even though they are based on individual twitter messages because the data has been aggregated and the resulting visualisation is a density surface generated from the tweets, the individual tweet no longer features in this data. And even if, for example, we show the location of an individual message as in the LondonTweet clip, the resolution of the clip in pixel is so low that it becomes nearly impossible to determine a definite location. The blurred pixels display more of a potential area. In addition, we are also dealing with the inaccuracy of the GPS of between 5 to 20 - maybe 100 - metres in a dense urban environment. It becomes impossible to pinpoint the exact location of an individual. Combine this with a population density as we have here in London and it is impossible to identify an individual.
Images by urbanTick / This shows a zoom (part 1) in on a animation of tweets in Google Earth as to demonstrate how tricky it is to read an actual location from this, even more so if one takes the GPS accuracy into account.
In conclusion it can be said that new guidelines clearly have to be developed for the changing nature of data availability in the digital age. Both commercial companies and academic researchers have to take extra care in handling and using digital personal data. They need to be aware that just because it is accessible this does not mean it can be used. However, there also has to be a change of mindset on the user side. They cannot just make use of services provided to them without contributing anything. If the service is based on public sharing and they want to use it they have to buy in to this information economy. Similarly with good search results. If people want the best possible service to quickly find something relevant to them in the ocean of data they might have to provide a little bit of information about themselves and what they are looking for. Economies - information no less than traditional - operate upon an exchange.
As discussed above in relation to physical public space, recently people seem to be very willing to accept corporate provisions and probably the discussion has to start there with the question of how dependant on these dominating private service providers do we want to be, both virtual and real and how much of our personal information in this context is actually still really private and how much do we just want to make it private.
However these aspects and links only touch on the topic and there are a lot more aspects that need to be discussed in detail, please feel free to comment and/or contribute.
Dutton, William H. and Paul W. Jeffreys, editors. 2010. World Wide Research. Cambridge, MA: MIT Press.
Rogers, Richard. 2004. Information Politics on the Web. Cambridge, MA: MIT Press.