We sometimes underestimate the impact of digital data on archaeology because we have become so accustomed to the capture, processing, and analysis of data using our digital tools. Of course, archaeology is by no means alone in this respect. For example, Sandra Rendgren, who writes about data visualisation, infographics and interactive media, recently pointed to the creation of a new genre of journalism that has arisen from the availability of digital data and the means to analyse them (2018a). But this growth in reliance on digital data should lead to a re-consideration of what we actually mean by data. Indeed, Sandra Rendgren suggests that the term ‘data’ can be likened to a transparent fluid – “always used but never much reflected upon” – because of its ubiquity and apparent lack of ambiguity (2018b).
Although there has been a dramatic growth in the development of autonomous vehicles and consequent competition between different companies and different methodologies, and despite the complexities of the task, the number of incidents remains remarkably small though no less tragic where the death of the occupants or other road users is involved. Of course, at present autonomous cars are not literally autonomous in the sense that a human agent is still required to be available to intervene, and accidents involving such vehicles are usually a consequence of the failure of the human component of the equation not reacting as they should. A recent fatal accident involving a Tesla Model X (e.g. Hruska 2018) has resulted in some push-back by Tesla who have sought to emphasise that the blame lies with the deceased driver rather than with the technology. One of the company’s key concerns in this instance appears to be the defence of the functionality of their Autopilot system, and in relation to this, a rather startling comment on the Tesla blog recently stood out:
No one knows about the accidents that didn’t happen, only the ones that did. The consequences of the public not using Autopilot, because of an inaccurate belief that it is less safe, would be extremely severe. (Tesla 2018).
The US Department of Immigration and Customs Enforcement (ICE) is apparently seeking to employ ‘big data’ methods for automating their assessment of visa applications in pursuit of meeting Trump’s calls for ‘extreme vetting’ (e.g. Joseph 2017, Joseph and Lipp 2017, and see also). A crucial problem with the proposals has been flagged in a letter to the Acting Secretary of Homeland Security by a group of scientists, engineers and others with experience in machine learning, data mining etc.. Specifically, they point to the problem that algorithms developed to detect ‘persons of interest’ could arbitrarily select groups while at the same time appearing to be objective. We’ve already seen this stereotyping and discrimination being embedded in other applications, inadvertently for the most part, and the risk is the same in this case. The reason provided in the letter is simple:
“Inevitably, because these characteristics are difficult (if not impossible) to define and measure, any algorithm will depend on ‘proxies’ that are more easily observed and may bear little or no relationship to the characteristics of interest” (Abelson et al 2017)
I’ve borrowed the idea of ‘deep-fried data’ from the title of a presentation by Maciej Cegłowski to the Collections as Data conference at the Library of Congress last month. As an archaeologist living and working in Scotland for 26 years, the idea of deep-fried data spoke to me, not least of course because of Scotland’s culinary reputation for deep-frying anything and everything. Deep-fried Mars bars, deep-fried Crème eggs, deep-fried butter balls in Irn Bru batter, deep-fried pizza, deep-fried steak pies, and so it goes on (see some more not entirely serious examples).
Hardened arteries aside, what does deep-fried data mean, and how is this relevant to the archaeological situation? In fact, you don’t have to look too hard to see that cooking is often used as a metaphor for our relationship with and use of data.
Big Data is (are?) old hat … Big Data dropped off Gartner’s Emerging Technologies Hype Cycle altogether in 2015, having slipped into the ‘Trough of Disillusionment’ in 2014 (Gartner Inc. 2014, 2015a). The reason given for this was simply that it had evolved and had become the new normal – the high-volume, high-velocity, high-variety types of information that classically defined ‘big data’ were becoming embedded in a range of different practices (e.g. Heudecker 2015).
At the same time, some of the assumptions behind Big Data were being questioned. It was no longer quite so straightforward to claim that ‘big data’ could overcome ‘small data’ by throwing computer power at a problem, or that quantity outweighed quality such that the large size of a dataset offset any problems of errors and inaccuracies in the data (e.g. Mayer-Schönberger and Cukier 2013, 33), or that these data could be analysed in the absence of any hypotheses (Anderson 2008).
For instance, boyd and Crawford had highlighted the mythical status of ‘big data’; in particular that it somehow provided a higher order of intelligence that could create insights that were otherwise impossible, and assigned them an aura of truth, objectivity and accuracy (2012, 663). Others followed suit. For example, McFarland and McFarland (2015) have recently shown how most Big Data analyses give rise to “precisely inaccurate” results simply because the sample size is so large that they give rise to statistically highly significant results (and hence the debacle over Google Flu Trends – for example, Lazer and Kennedy 2015). Similarly, Pechenick et al (2015) showed how, counter-intuitively, results from Google’s Books Corpus could easily be distorted by a single prolific author, or by the fact that there was a marked increase in scientific articles included in the corpus after the 1960s. Indeed, Peter Sondergaard, a senior vice president at Gartner and global head of Research, underlined that data (big or otherwise) are inherently dumb without algorithms to work on them (Gartner Inc. 2015b). In this regard, one might claim Big Data have been superseded by Big Algorithms in many respects.
It was only a matter of time before a ‘big data’ company latched onto archaeology for commercial purposes. Reported in a New Scientist article last week (with an unfortunate focus on ‘treasure’), a UK data analytics start-up called Democrata is incorporating archaeological data into a system to allow engineering and construction firms to predict the likelihood of encountering archaeological remains. This, of course, is what local authority archaeologists do, along with environmental impact assessments undertaken by commercial archaeology units. But this isn’t (yet) an argument about a potential threat to archaeological jobs.
As the end of 2014 approaches, Facebook has unleashed its new “Year in Review” app, purporting to show the highlights of your year. In my case, it did little other than demonstrate a more or less complete lack of Facebook activity on my part other than some conference photos a colleague had posted to my wall; in Eric Meyer’s case, it presented him with a picture of his daughter who had died earlier in the year. In a thoughtful and thought-provoking piece, he describes this as ‘Inadvertent Algorithmic Cruelty’: it wasn’t deliberate on the part of Facebook (who have now apologised), and for many people it worked well as evidenced by the numbers who opted to include it on their timelines, but it lacked an opt-in facility and there was an absence of what Meyer calls ‘empathetic design’. Om Malik picks up on this, pointing to the way Facebook now has an ‘Empathy Team’ apparently intended to make designers understand what it is like to be a user (sorry, a person), although Facebook’s ability to highlight what people see as important is driven by crude data such as the number of ‘likes’ and comments without any understanding of the underlying meanings which are present.
One of the features of the availability of increasing amounts of archaeological data online is that it frequently arrives without an accompanying awareness of context. Far from being a problem, this is often seen as an advantage in relation to ‘big data’ – indeed, Chris Anderson has claimed that context can be established later once statistical algorithms have found correlations in large datasets that might not otherwise be revealed.