One of the features of the world-wide COVID-19 pandemic over the past eighteen months has been the significance of the role of data and associated predictive data modelling which have governed public policy. At the same time, we have inevitably seen the spread of misinformation (as in false or inaccurate information that is believed to be true) and disinformation (information that is known to be false but is nevertheless spread deliberately), stimulating an infodemic alongside the pandemic. The ability to distinguish between information that can be trusted and information which can’t is key to managing the pandemic, and failure to do so lies behind many of the surges and waves that we have witnessed and experienced. Distinguishing between information and mis/disinformation can be difficult to do. The problem is all too often fuelled by algorithmic amplification across social media and compounded by the frequent shortage of solid, reliable, comprehensive, and unambiguous data, and leads to expert opinions being couched in cautious terms, dependent on probabilities and degrees of freedom, and frustratingly short on firm, absolute outcomes. Archaeological data is clearly not in the same league as pandemic health data, but it still suffers from conclusions drawn on often weak, always incomplete data and is consequently open to challenge, misinformation, and disinformation.
Data is tricky stuff. It can appear to be self-evident but equally may be elusive. It can be material yet immaterial, tangible but ephemeral, objective yet biased, precise but inaccurate, detailed but broad-brush. It may be big or small, fast or slow, quantitative or qualitative. It may be easily misrepresented, misconceived, misunderstood, misread, misconstrued, misinterpreted. It can be distorted, altered, mangled, wrangled, and reshaped into something it never originally was according to different purposes and agendas. Data is slippery and perilous, but we often overlook its characteristics and peculiarities in the pursuit of interpretation and – hopefully! – knowledge. Looking over this blog, I’ve written a lot about data over the years. In the process, I’ve undoubtedly repeated myself, quite probably contradicted myself, sometimes confused myself, and that’s before considering any of my more formal publications on the subject! For instance, there’s the question of missing and unknown data, data associations, data metaphors, data reuse, data proxies, big data, and quite a lot more besides on data archiving etc.. Not only does this highlight how fundamental data are, but it perhaps underlines the value of a range of different perspectives about the character and nature of data.
Shannon Mattern has recently written about mapping nothing: from the ‘here be dragons’ on old maps marking the limits of knowledge and the promise of new discoveries, to the perception of the Amazon rainforest as an unpeopled wilderness until satellite imagery revealed pre-Columbian geoglyphs which had been largely invisible on the ground. In her wide-ranging essay, she makes the point that nothingness is always something: “A map of nothing demonstrates that an experiential nothingness depends upon a robust ecology of somethingness to enable its occurrence” (Mattern 2021). The question, of course, is what that something actually is.
Nothingness is something that has long been an issue in databases. Null is traditionally used to represent something missing. As null is not a value, it is technically and meaningfully distinct from zeros and empty strings which are values and hence indicators of something. Although this seems straightforward, the boundaries begin to blur when some guides to SQL, for instance, define null in terms of both missing and unknown values. After all, if something is missing, then we know we are missing it; if something is unknown, then we don’t know whether or not it was ever something. Indeed, Codd, in his classic book on relational databases argued that null should also indicate why the data is missing, distinguishing between a null that is ‘missing but applicable’, and a null that is ‘missing but inapplicable’ (Codd 1990, 173), but this was never adopted. Consequently, nulls tend to have a bad reputation because of the ways they may variously be used (mostly in error) as representing ‘nothing’, ‘unknown’, ‘value not yet entered’, ‘default value’, etc. in part because of messy implementations in database management systems.
Michael Shanks has recently blogged about Ray Harryhausen and his stop-motion animation (Shanks 2020), sparked by an exhibition at the Scottish National Gallery of Modern Art (currently shut as a result of coronavirus restrictions). Harryhausen’s work proved inspirational to many film directors over the years, but might his technique also be inspirational for archaeological visualisation?
For example, Shanks draws a sharp distinction between the stop motion creations of Harryhausen and computer-generated imagery in the way that the technique of stop motion animation never quite disappears into the background which is part of both its charm and effect, unlike the emphasis on photorealistic models in CGI.
In CGI the objective is often to have the imagery fabricated by the computer blend in so one doesn’t notice where the fabrication begins or ends. The rhetorical purpose of CGI is to fool, to deceive. Harryhausen’s models don’t look “real”. More precisely, they don’t look “natural”. No one need be fooled. One admires the craft in their making. (Shanks 2020)
Shawn Graham recently pointed me (and a number of colleagues!) to a new paper entitled ‘Computer vision, human senses, and language of art’ by Lev Manovich (2020) in a tweet in which he asked what we made of it … so, challenge accepted!
Lev Manovich is, of course, a professor of computer science and a prolific author, focusing on cultural analytics, artificial intelligence, and media theory, amongst other things. In this particular paper, he proposes that numbers, and their associated analytical methods, offer a new language for describing cultural artefacts. The idea that this is novel may be news to those who have been engaged in quantitative analyses across the humanities since before the introduction of the computer, but aspects of his argument go further than this. The actual context of the paper is as yet unclear since it is online first and not yet assigned to a volume. That said, a number of other open access online first papers in AI & Society seem to address similar themes, so one might imagine it to be a contribution to a collection of digital humanities-related papers concerning images and computer vision.
It’s an interesting paper, not least since – as Manovich says himself (p2) – it presents the perspective of an outside observer writing about the application of technological methods within the humanities. Consequently it can be tempting to grump about how he “doesn’t understand” or “doesn’t appreciate” what is already done within the humanities, but it’s perhaps best to resist that temptation as far as possible.
There was a flurry of interest in the technical press during the summer with the news that GitHub had placed much of the open source code it held into an almost improbably long-term Arctic archive (e.g. Kimball 2020; Metcalf 2020; Vaughan 2020). GitHub’s timing seemed propitious: in the midst of a global pandemic, with wild fires burning out of control on the west coast of the USA and elsewhere, and with upgrades to the nearby Global Seed Vault recently finished after being flooded as a consequence of global warming.
The Arctic World Archive was set up by Piql in 2017 and situated in a decommissioned mineshaft deep within the permafrost near Longyearbyen on the Svalbard archipelago. The data are stored on reels of piqlFilm (see Piql 2019, Piql nd), a high-resolution photosensitive film claimed to be secure for 750 years (and over 1000 years in cold low-oxygen conditions) and hence require no cycle of refresh and migrate, unlike all other forms of digital archive. The film holds both analog (text, images etc.) and digital information, with digital data stored as high resolution QR codes. Explanations of how to decode and retrieve the information are included as text at the beginning of each reel that can simply be read by holding it up to a light source with a magnifying glass, and Piql claim that only a camera/scanner and a computer of some kind will be required to restore the information in the future which means that the archive outlives any technology used to store the data in the first place.
We’ve all experienced that rush of recollection when we uncover some long-hidden or long-lost object from our past in the bottom of a drawer or box, triggering memories of encounters, activities, people, and places. We’re accustomed to the idea that we use evocative things as stored memories, deliberately or inadvertently, and as distributed extensions of our embodied memory (e.g. Heersmink 2018). Is it the same with digital objects? For example, van Dijck asks:
Are analog and digital objects interchangeable in the making, storing, and recalling of memories? Do digital objects change our inscription and remembrance of lived experience, and do they affect the memory process in our brains? (2007, xii).
Perhaps it’s a neurosis brought on by the contemplation of my excavation backlog, but I think there is a difference: that not all analog objects are equally interchangeable with digital equivalents in terms of their functioning as distributed memories, and that this difference is significant when we consider the archaeological narratives we are able to construct from our digital records. It may be that this perspective is coloured by the physical nature of my backlog from the 1980s and 1990s which for various reasons sits on the cusp of analog/digital recording. Although Ruth Tringham recalls how in the 1980s the digital recording of hitherto paper records was distrusted (Tringham 2010, 87), not least due to concerns about the fragility of the hardware and impermanence of the product, in my case it was rather more prosaic: as someone working with computers full-time in my day job I had no desire to turn my excavation experience into a busman’s holiday as the on-site computer technician. The downside was that I subsequently gave myself the monumental task of manually entering the record sheets into the database and scanning/digitising the plans and sections in the off-season. In retrospect, however, this provides the opportunity to consider the different affordances of the two sets of analog and digital records, a perception that is reinforced by the pre-pandemic experience of packing my office which incorporated two days of sorting and moving the physical archive and about five minutes transferring the digital files.
There are quite a few metaphors associated with archaeological data, many of which relate to its apparent mystery. For example, Gavin Lucas has described the archaeological record as being “haunted by absences” created by decay and destruction (Lucas 2012, 178). In a similar vein, Alison Wylie has described archaeological data as “shadowy” and that archaeology is defined “by the challenges of working with gaps and absences in its primary data” (Wylie 2017, 204). In a special issue of the Science, Technology, & Human Values journal on ‘Data Shadows’, Leonelli et al. describe data in terms of its presence, but also in terms of its unavailability, inaccessibility, or its absence, defining absence as a descriptor of how “data are missing, incomplete, unreliable, ignored, unwanted, or untagged” (Leonelli et al. 2017, 192). As Chris Chippendale described it,
Archaeology is plagued in many an instance with poorly defined variables (usually thought of as ‘data’) drawn from ill-understood populations, and with uncertain articulations between the entities whose logical relations we seek to understand. (2000, 611)
Bill Caraher has recently been considering the nature of ‘legacy data’ in archaeology (Caraher 2019) (with a commentary by Andrew Reinhard). Amongst other things, he suggests there has been a shift from paper-based archives designed with an emphasis on the future to digital archives which often seem more concerned with present utility. Coincidentally, Bill’s post landed just as I was pondering the nature of the relationship between digital archives and our use of data.
So do digital archives represent a paradigm shift from traditional archives and archival practice, or are they simply a technological development of them? Digital archives are commonly understood to be a means of storing, organising, maintaining, and making data accessible in digital format. Relative to traditional archives they are therefore not limited by physical space or its associated costs and so can make much more information available more easily, cheaply, and widely. But a consequence of this can be a kind of ‘storage mania’, in which data become easier to accumulate than to delete because of digitalisation, and where data are released from the limitations of time and space through their dematerialisation (Sluis 2017, 28). This is akin to David Berry’s “infinite archives” (2017, 107), who suggests that “One way of thinking about computational archives and new forms of abstraction they produce is the specific ways in which they manage the ‘derangement’ of knowledge through distance.” (Berry 2017, 119). At the same time, digital archives represent new technological material structures built on the performativity of the software which delivers large-scale processing of these apparently dematerialised data (Sluis 2017, 28).
Yesterday was World Digital Preservation Day and saw the publication of the Digital Preservation Coalition’s Bitlist – their global list of Digitally Endangered Species. Interestingly, under their ‘Practically Extinct’ category (“when the few known examples are inaccessible by most practical means and methods”) sits Unpublished Research Data, which they define as
“research data which has not been shared or published by any means and is thus in contravention of the ‘FAIR’ principles which require data to be Findable Accessible, Interoperable and Reusable”.
Although the DPC jury hopes that this is a small group, I rather suspect that there is an unseen mountain of unpublished research data in archaeology (and in the interest of full disclosure: reader, I have some).
This crossed my screen at the same time as a paper published in the Harvard Data Science Review by Stephen Stigler: ‘Data Have a Limited Shelf Life’, in which he argues that data, unlike wines, do not improve with age. He suggests that old data are “Often … no more than decoration; sometimes they may be misleading in ways that cannot easily be discovered”, while emphasising this is not the same as saying they have no value. Using three examples of old statistical data, he shows how misleading and incomplete they can be if their full background is not known. In each case, the data were selected from a prior source, not always accurately referenced if at all. In some instances, uncovering the original data flagged problems with the sample that had been taken, in others it revealed a greater breadth and depth of information which had gone un-used because the particular research question had stripped them away.