data quality – Introspective Digital Archaeology

Discovery Machines

29/08/2023 Jeremy Huggett

A model robot reading a kindle — Adapted from the original by Brian J. Matis (CC BY-NC-SA 2.0)

Michael Brian Schiffer is perhaps best-known (amongst archaeologists of a certain age in the UK at least), for his development of behavioural archaeology, which looked at the changing relationships between people and things as a response to the processual archaeology of Binford et al. (Schiffer 1976; 2010), and for his work on the formation processes of the archaeological record (Schiffer 1987). But Schiffer also has an extensive track record of work on archaeological (and behavioural) approaches to modern technologies and technological change (e.g., Schiffer 1992; 2011) which receives little attention in the digital archaeology arena, in part because despite his interest in a host of other electrical devices involved in knowledge creation (e.g., Schiffer 2013, 81ff) he has little to say about computers beyond observing their use in modelling and simulation or as an example of an aggregate technology constructed from multiple technologies and having a generalised functionality (Schiffer 2011, 167-171).

In his book The Archaeology of Science, Schiffer introduces the idea of the ‘discovery machine’. In applying such an apparatus,

Continue reading →

Data Detachment

20/10/2022 Jeremy Huggett

A couple of interesting but unrelated articles around the subject of humanities digital data recently appeared: a guest post in The Scholarly Kitchen by Chris Houghton on data and digital humanities, and an Aeon essay by Claire Lemercier and Clair Zalc on historical data analysis.

Houghton’s article emphasises the benefits of mass digitisation and large-scale analysis in the context of the increasing availability of digital data resources provided through digital archives and others. According to Houghton, “The more databases and sources available to the scholar, the more power they will have to ask new questions, discover previously unknown trends, or simply strengthen an argument by adding more proof.” (Houghton 2022). The challenge he highlights is that although digital archives increasingly provide access to large bodies of data, the work entailed in exploring, refining, checking, and cleaning the data for subsequent analysis can be considerable.

An academic who runs a large digital humanities research group explained to me recently, “You can spend 80 percent of your time curating and cleaning the data, and another 80 percent of your time creating exploratory tools to understand it.” … the more data sources and data formats there are, the more complex this process becomes. (Houghton 2022).

Continue reading →

Mining the Grey

01/07/2022 Jeremy Huggett

Text mining icon by Julie McMurray (via Pixabay)

Archaeological grey literature reports were primarily a response to the explosion of archaeological work from the 1970s (e.g. Thomas 1991) which generated a backlog which quickly outstripped the capacity of archaeologists, funders, and publishers to create traditional outputs, and it became accepted that the vast majority of fieldwork undertaken would never be published in any form other than as a client report or summary format. This in turn (and especially in academic circles) frequently raised concerns over the quality of the reports, as well as their accessibility: indeed, Cunliffe suggested that some reports were barely worth the paper they were printed on (cited in Ford 2010, 827). Elsewhere, it was argued that the schematisation of reports could make it easier to hide shortcomings and lead to lower standards (e.g. Andersson et al. 2010, 23). On the other hand, it was increasingly recognised that such reports had become the essential building blocks for archaeological knowledge to the extent that labelling them ‘grey’ was something of a misnomer (e.g. Evans 2015, sec 5), and the majority of archaeological interventions across Europe were being carried out within the framework of development-led archaeology rather than through the much smaller number of more traditional research excavations (e.g. Beck 2022, 3).

Continue reading →

Grey Data

20/06/2022 Jeremy Huggett

Row of books with Informed on the spines In recent years, digital access to unpublished archaeological reports (so-called ‘grey literature’) has become increasingly transformational in archaeological practice. Besides being important as a reference source for new archaeological investigations including pre-development assessments (the origin of many of the grey literature reports themselves), they also provide a resource for regional and national synthetic studies, and for automated data mining to extract information about periods of sites, locations of sites, types of evidence, and so on. Despite this, archaeological grey literature itself has not yet been closely evaluated as a resource for the creation of new archaeological knowledge. Can the data embedded within the reports (‘grey data’) be re-used in full knowledge of their origination, their strategies of recovery, the procedures applied, and the constraints experienced? Can grey data be securely repurposed, and if not, what measures need to be taken to ensure that it can be reliably reused?

Continue reading →

Data in a Crisis

13/06/2021 Jeremy Huggett

One of the features of the world-wide COVID-19 pandemic over the past eighteen months has been the significance of the role of data and associated predictive data modelling which have governed public policy. At the same time, we have inevitably seen the spread of misinformation (as in false or inaccurate information that is believed to be true) and disinformation (information that is known to be false but is nevertheless spread deliberately), stimulating an infodemic alongside the pandemic. The ability to distinguish between information that can be trusted and information which can’t is key to managing the pandemic, and failure to do so lies behind many of the surges and waves that we have witnessed and experienced. Distinguishing between information and mis/disinformation can be difficult to do. The problem is all too often fuelled by algorithmic amplification across social media and compounded by the frequent shortage of solid, reliable, comprehensive, and unambiguous data, and leads to expert opinions being couched in cautious terms, dependent on probabilities and degrees of freedom, and frustratingly short on firm, absolute outcomes. Archaeological data is clearly not in the same league as pandemic health data, but it still suffers from conclusions drawn on often weak, always incomplete data and is consequently open to challenge, misinformation, and disinformation.

Continue reading →

Dark Data

22/04/2020 Jeremy Huggett

There are quite a few metaphors associated with archaeological data, many of which relate to its apparent mystery. For example, Gavin Lucas has described the archaeological record as being “haunted by absences” created by decay and destruction (Lucas 2012, 178). In a similar vein, Alison Wylie has described archaeological data as “shadowy” and that archaeology is defined “by the challenges of working with gaps and absences in its primary data” (Wylie 2017, 204). In a special issue of the Science, Technology, & Human Values journal on ‘Data Shadows’, Leonelli et al. describe data in terms of its presence, but also in terms of its unavailability, inaccessibility, or its absence, defining absence as a descriptor of how “data are missing, incomplete, unreliable, ignored, unwanted, or untagged” (Leonelli et al. 2017, 192). As Chris Chippendale described it,

Archaeology is plagued in many an instance with poorly defined variables (usually thought of as ‘data’) drawn from ill-understood populations, and with uncertain articulations between the entities whose logical relations we seek to understand. (2000, 611)

Continue reading →

The Death of Data

08/11/201909/11/2019 Jeremy Huggett

“Dead Data” by Stinging Eyes CC BY-SA 2.0

Yesterday was World Digital Preservation Day and saw the publication of the Digital Preservation Coalition’s Bitlist – their global list of Digitally Endangered Species. Interestingly, under their ‘Practically Extinct’ category (“when the few known examples are inaccessible by most practical means and methods”) sits Unpublished Research Data, which they define as

“research data which has not been shared or published by any means and is thus in contravention of the ‘FAIR’ principles which require data to be Findable Accessible, Interoperable and Reusable”.

Although the DPC jury hopes that this is a small group, I rather suspect that there is an unseen mountain of unpublished research data in archaeology (and in the interest of full disclosure: reader, I have some).

This crossed my screen at the same time as a paper published in the Harvard Data Science Review by Stephen Stigler: ‘Data Have a Limited Shelf Life’, in which he argues that data, unlike wines, do not improve with age. He suggests that old data are “Often … no more than decoration; sometimes they may be misleading in ways that cannot easily be discovered”, while emphasising this is not the same as saying they have no value. Using three examples of old statistical data, he shows how misleading and incomplete they can be if their full background is not known. In each case, the data were selected from a prior source, not always accurately referenced if at all. In some instances, uncovering the original data flagged problems with the sample that had been taken, in others it revealed a greater breadth and depth of information which had gone un-used because the particular research question had stripped them away.

Continue reading →

Delving into Data Reuse

10/10/201902/01/2021 Jeremy Huggett

Given the years, the money, expertise and energy we’ve spent on creating and managing archaeological data archives, the relative lack of evidence of reuse is a problem. Making our data open and available doesn’t equate to reusing it, nor does making it accessible necessarily correspond to making it usable. But if we’re not reusing data, how can we justify these resources? In their reflections on large-scale online research infrastructures Holly Wright and Julian Richards (2018) have recently suggested that we need to understand how to optimize archives and their interfaces in order to maximize the use and reuse of archaeological data, and explore how archaeological archives can better respond to user needs alongside ways to document and understand both quantitative and qualitative reuse.

However, I would argue that all these kinds of issues (alongside those of citation, recognition, training, etc.) while not resolved are at least known and mostly acknowledged. The real challenges to data reuse lie elsewhere and entail a much deeper understanding and appreciation of what reuse entails: issues associated with the re-presentation and interpretation of old data, the nature and purpose of reuse, and the opportunities and risks presented by reuse. Such questions are not specific to digital data; however, digital data change the terms of engagement with their near-instant access, volume, and flexibility, and their potentially transformative effects on the practice of archaeology now and in the future.

Continue reading →

Dipping in Data Lakes

15/07/2019 Jeremy Huggett

We’re becoming increasingly accustomed to talk of Big Data in archaeology and at the same time beginning to see the resurgence of Artificial Intelligence in the shape of machine learning. And we’ve spent the last 20 years or so assembling mountains of data in digital repositories which are becoming big data resources for mining in the pursuit of machine learning training data. At the same time we are increasingly aware of the restrictions that those same repositories impose upon us – the use of pre-cooked ‘what/where/when’ queries, the need to (re)structure data in order to integrate different data sources and suppliers, and their largely siloed nature which limits cross-repository connections, for example. More generally, we are accustomed to the need to organise our data in specific ways in order to fit the structures imposed by database management systems, or indeed, to fit our data into the structures predefined by archaeological recording systems, both of which shape subsequent analysis. But what if it doesn’t need to be this way?

Continue reading →

Digital Place, Cognitive Space

07/11/201809/11/2018 Jeremy Huggett

CC0 Adapted from photo by Jordan Madrid on Unsplash

To what extent does our use of digital devices to capture and process archaeological data affect our perceptions of what was there? Mark Altaweel (2018) has recently asked a similar question in relation to GPS technologies – how do these affect our understanding and experience of place? He suggests that they diminish our sense of place and experiences that we might otherwise have as we navigate according to their recommendations. Certainly, satnavs are notorious for taking our navigational cognitive load upon themselves and consequently leading drivers who are insufficiently aware of their surroundings into undesirable, even dangerous situations. We might think that the human cognitive load that is thereby freed up by such devices ought to be capable of being diverted into more useful, more extensive, areas – we literally have the space to think about bigger and deeper things as a consequence of their application. This kind of argument frequently arises in relation to the value of automation, for instance, and can be seen in the kinds of discussions surrounding the use of structure-from-motion photogrammetric recording on archaeological excavations, for example. But is this supposed release of cognitive space an unalloyed good? Or is this a case of the technologies distancing us from the physicality of the archaeological material and space in front of us?

Continue reading →