We’re becoming increasingly accustomed to talk of Big Data in archaeology and at the same time beginning to see the resurgence of Artificial Intelligence in the shape of machine learning. And we’ve spent the last 20 years or so assembling mountains of data in digital repositories which are becoming big data resources for mining in the pursuit of machine learning training data. At the same time we are increasingly aware of the restrictions that those same repositories impose upon us – the use of pre-cooked ‘what/where/when’ queries, the need to (re)structure data in order to integrate different data sources and suppliers, and their largely siloed nature which limits cross-repository connections, for example. More generally, we are accustomed to the need to organise our data in specific ways in order to fit the structures imposed by database management systems, or indeed, to fit our data into the structures predefined by archaeological recording systems, both of which shape subsequent analysis. But what if it doesn’t need to be this way?
To what extent does our use of digital devices to capture and process archaeological data affect our perceptions of what was there? Mark Altaweel (2018) has recently asked a similar question in relation to GPS technologies – how do these affect our understanding and experience of place? He suggests that they diminish our sense of place and experiences that we might otherwise have as we navigate according to their recommendations. Certainly, satnavs are notorious for taking our navigational cognitive load upon themselves and consequently leading drivers who are insufficiently aware of their surroundings into undesirable, even dangerous situations. We might think that the human cognitive load that is thereby freed up by such devices ought to be capable of being diverted into more useful, more extensive, areas – we literally have the space to think about bigger and deeper things as a consequence of their application. This kind of argument frequently arises in relation to the value of automation, for instance, and can be seen in the kinds of discussions surrounding the use of structure-from-motion photogrammetric recording on archaeological excavations, for example. But is this supposed release of cognitive space an unalloyed good? Or is this a case of the technologies distancing us from the physicality of the archaeological material and space in front of us?
We sometimes underestimate the impact of digital data on archaeology because we have become so accustomed to the capture, processing, and analysis of data using our digital tools. Of course, archaeology is by no means alone in this respect. For example, Sandra Rendgren, who writes about data visualisation, infographics and interactive media, recently pointed to the creation of a new genre of journalism that has arisen from the availability of digital data and the means to analyse them (2018a). But this growth in reliance on digital data should lead to a re-consideration of what we actually mean by data. Indeed, Sandra Rendgren suggests that the term ‘data’ can be likened to a transparent fluid – “always used but never much reflected upon” – because of its ubiquity and apparent lack of ambiguity (2018b).
Although there has been a dramatic growth in the development of autonomous vehicles and consequent competition between different companies and different methodologies, and despite the complexities of the task, the number of incidents remains remarkably small though no less tragic where the death of the occupants or other road users is involved. Of course, at present autonomous cars are not literally autonomous in the sense that a human agent is still required to be available to intervene, and accidents involving such vehicles are usually a consequence of the failure of the human component of the equation not reacting as they should. A recent fatal accident involving a Tesla Model X (e.g. Hruska 2018) has resulted in some push-back by Tesla who have sought to emphasise that the blame lies with the deceased driver rather than with the technology. One of the company’s key concerns in this instance appears to be the defence of the functionality of their Autopilot system, and in relation to this, a rather startling comment on the Tesla blog recently stood out:
No one knows about the accidents that didn’t happen, only the ones that did. The consequences of the public not using Autopilot, because of an inaccurate belief that it is less safe, would be extremely severe. (Tesla 2018).
So here’s a thing. A while ago, I asked whether there was any way to quantify the extent to which archaeologists were citing their reuse of data. I used the Thomson Reuters/Clarivate Analytics Data Citation Index (DCI) as a starting point, but it didn’t go too well … Back then, the DCI indicated that 56 of the 476 data studies derived from the UK’s Archaeology Data Service repository had apparently been cited elsewhere in the Web of Science databases (the figure is currently 58 out of 515). But I also found that the citations themselves were problematic: the citation of the published paper/volume was frequently incomplete or abbreviated, many appeared to be self-citations from within interim or final reports, in some cases the citations preceded the dates of the project being referenced, and in many instances it was possible to demonstrate that the data had been cited (in some form or other) but this had not been captured in the DCI. At that point I concluded that the DCI was of little value at present. So what was going on?
I’ve commented here and here about the question of data reuse (or more accurately, the lack of it) and the implications for archaeological digital repositories. It’s frequently argued that the key incentive for making data available for reuse is providing credit through citation. So how’s that going? I’ve not seen any attempt to actually quantify this, so out of curiosity I thought I’d have a go.
A logical starting point is Thomson Reuters Data Citation Index – according to its owners (it’s a licensed rather than public resource), this indexes the contents of a large number of the world’s leading data repositories, and, on checking, the UK’s Archaeology Data Service (ADS) appears among them. So far so good.
I’ve borrowed the idea of ‘deep-fried data’ from the title of a presentation by Maciej Cegłowski to the Collections as Data conference at the Library of Congress last month. As an archaeologist living and working in Scotland for 26 years, the idea of deep-fried data spoke to me, not least of course because of Scotland’s culinary reputation for deep-frying anything and everything. Deep-fried Mars bars, deep-fried Crème eggs, deep-fried butter balls in Irn Bru batter, deep-fried pizza, deep-fried steak pies, and so it goes on (see some more not entirely serious examples).
Hardened arteries aside, what does deep-fried data mean, and how is this relevant to the archaeological situation? In fact, you don’t have to look too hard to see that cooking is often used as a metaphor for our relationship with and use of data.
Solutions to the crisis in archaeological archives in an environment of shrinking resources often involve selection and discard of the physical material and an increased reliance on the digital. For instance, several presentations to a recent day conference on Selection, De-selection and Rationalisation organised by the Archaeological Archives Group implicitly or explicitly refer to the effective replacement of physical items with data records, where either deselected items were removed from the archive or else material was never selected for inclusion in the first place because of its perceived ‘low research potential’. Indeed, Historic England are currently tendering for research into what they call the ‘rationalisation’ of museum archaeology collections
“… which ensures that those archives that are transferred to museums contain only material that has value, mainly in the potential to inform future research.” (Historic England 2016, 2)
Historic England anticipate that these procedures may also be applied retrospectively to existing collections. It remains too early to say, but it seems more than likely a key approach to the mitigation of such rationalisation will be the use of digital records. In this way, atoms are quite literally converted into bits (to borrow from Nicholas Negroponte) and the digital remains become the sole surrogate for material that, for whatever reason, was not considered worthy of physical preservation. What are the implications of the digital coming to the rescue of the physical archive in this way?
The UK is suddenly wakening from the reality distortion field that has been created by politicians on both sides and only now beginning to appreciate the consequences of Brexit – our imminent departure from the European Union. But – without forcing the metaphor – are we operating within some kind of archaeological reality distortion field in relation to digital data?
Undoubtedly one of the big successes of digital archaeology in recent years has been the development of digital data repositories and, correspondingly, increased access to archaeological information. Here in the UK we’ve been fortunate enough to have seen this develop over the past twenty years in the shape of the Archaeology Data Service, which offers search tools, access to digital back-issues of journals, monograph series and grey literature reports, and the availability of downloadable datasets from a variety of field and research projects. In the past, large-scale syntheses took years to complete (for instance, Richard Bradley’s synthesis of British and Irish prehistory took four years paid research leave with three years of research assistant support in order to travel the country to seek out grey literature reports accumulated over 20 years (Bradley 2006, 10)). At this moment, there are almost 38,000 such reports in the Archaeology Data Service digital library, with more are added each month (a more than five-fold increase since January 2011, for example). The appearance of projects of synthesis such as the Rural Settlement of Roman Britain is starting to provide evidence of the value of access to such online digital resources. And, of course, other countries increasingly have their own equivalents of the ADS – tDAR and OpenContext in the USA, DANS in the Netherlands, and the Hungarian National Museum’s Archaeology Database, for instance).
But all is not as rosy in the archaeological digital data world as it might be.
Big Data is (are?) old hat … Big Data dropped off Gartner’s Emerging Technologies Hype Cycle altogether in 2015, having slipped into the ‘Trough of Disillusionment’ in 2014 (Gartner Inc. 2014, 2015a). The reason given for this was simply that it had evolved and had become the new normal – the high-volume, high-velocity, high-variety types of information that classically defined ‘big data’ were becoming embedded in a range of different practices (e.g. Heudecker 2015).
At the same time, some of the assumptions behind Big Data were being questioned. It was no longer quite so straightforward to claim that ‘big data’ could overcome ‘small data’ by throwing computer power at a problem, or that quantity outweighed quality such that the large size of a dataset offset any problems of errors and inaccuracies in the data (e.g. Mayer-Schönberger and Cukier 2013, 33), or that these data could be analysed in the absence of any hypotheses (Anderson 2008).
For instance, boyd and Crawford had highlighted the mythical status of ‘big data’; in particular that it somehow provided a higher order of intelligence that could create insights that were otherwise impossible, and assigned them an aura of truth, objectivity and accuracy (2012, 663). Others followed suit. For example, McFarland and McFarland (2015) have recently shown how most Big Data analyses give rise to “precisely inaccurate” results simply because the sample size is so large that they give rise to statistically highly significant results (and hence the debacle over Google Flu Trends – for example, Lazer and Kennedy 2015). Similarly, Pechenick et al (2015) showed how, counter-intuitively, results from Google’s Books Corpus could easily be distorted by a single prolific author, or by the fact that there was a marked increase in scientific articles included in the corpus after the 1960s. Indeed, Peter Sondergaard, a senior vice president at Gartner and global head of Research, underlined that data (big or otherwise) are inherently dumb without algorithms to work on them (Gartner Inc. 2015b). In this regard, one might claim Big Data have been superseded by Big Algorithms in many respects.