Dipping in Data Lakes

We’re becoming increasingly accustomed to talk of Big Data in archaeology and at the same time beginning to see the resurgence of Artificial Intelligence in the shape of machine learning. And we’ve spent the last 20 years or so assembling mountains of data in digital repositories which are becoming big data resources for mining in the pursuit of machine learning training data. At the same time we are increasingly aware of the restrictions that those same repositories impose upon us – the use of pre-cooked ‘what/where/when’ queries, the need to (re)structure data in order to integrate different data sources and suppliers, and their largely siloed nature which limits cross-repository connections, for example. More generally, we are accustomed to the need to organise our data in specific ways in order to fit the structures imposed by database management systems, or indeed, to fit our data into the structures predefined by archaeological recording systems, both of which shape subsequent analysis. But what if it doesn’t need to be this way?

Continue reading

Towards a digital ethics of agential devices

Image by Rawpixel CC0 1.0 via Creative Commons

Discussion of digital ethics is very much on trend: for example, the Proceedings of the IEEE special issue on ‘Ethical Considerations in the Design of Autonomous Systems’ has just been published (Volume 107 Issue 3), and the Philosophical Transactions of the Royal Society A published a special issue on ‘Governing Artificial Intelligence – ethical, legal and technical opportunities and challenges’ late in 2018. In that issue, Corinne Cath (2018, 3) draws attention to the growing body of literature surrounding AI and ethical frameworks, debates over laws governing AI and robotics across the world and points to an explosion of activity in 2018 with a dozen national strategies published and billions in government grants allocated. She also notes the way that many of the leaders in both debates and the technologies are based in the USA which itself presents an ethical issue in terms of the extent to which AI systems mirror the US culture rather than socio-cultural systems elsewhere around the world (Cath 2018, 4).

Agential devices, whether software or hardware, essentially extend the human mind by scaffolding or supporting our cognition. This broad definition therefore runs the gamut of digital tools and technologies, from digital cameras to survey devices (e.g. Huggett 2017), through software supporting data-driven meta-analyses and their incorporation in machine-learning tools, to remotely controlled terrestrial and aerial drones, remotely operated vehicles, autonomous surface and underwater vehicles, and lab-based robotic devices and semi-autonomous bio-mimetic or anthropomorphic robots. Many of these devices augment archaeological practice, reducing routinised and repetitive work in the office environment and in the field. Others augment work by developing data-driven methods which represent, store, and manipulate information in order to undertake tasks previously thought to be uncomputable or incapable of being automated. In the process, each raises ethical issues of various kinds. Whether agency can be associated with such devices can be questioned on the basis that they have no intent, responsibility or liability, but I would simply suggest that anything we ascribe agency to acquires agency, especially bearing in mind the human tendency to anthropomorphize our tools and devices. What I am not suggesting, however, is that these systems have a mind or consciousness themselves, which represents a whole different ethical set of questions.

Continue reading

Digital Data Relations

Data is the new oil
(adapted from original by Gerd Leonhard, CC-BY-SA 2.0)

We sometimes underestimate the impact of digital data on archaeology because we have become so accustomed to the capture, processing, and analysis of data using our digital tools. Of course, archaeology is by no means alone in this respect. For example, Sandra Rendgren, who writes about data visualisation, infographics and interactive media, recently pointed to the creation of a new genre of journalism that has arisen from the availability of digital data and the means to analyse them (2018a). But this growth in reliance on digital data should lead to a re-consideration of what we actually mean by data. Indeed, Sandra Rendgren suggests that the term ‘data’ can be likened to a transparent fluid – “always used but never much reflected upon” – because of its ubiquity and apparent lack of ambiguity (2018b).

Continue reading

Is there a digital File Drawer problem?

by Sailko via Wikimedia Commons CC BY-SA 3.0

Although there has been a dramatic growth in the development of autonomous vehicles and consequent competition between different companies and different methodologies, and despite the complexities of the task, the number of incidents remains remarkably small though no less tragic where the death of the occupants or other road users is involved. Of course, at present autonomous cars are not literally autonomous in the sense that a human agent is still required to be available to intervene, and accidents involving such vehicles are usually a consequence of the failure of the human component of the equation not reacting as they should. A recent fatal accident involving a Tesla Model X (e.g. Hruska 2018) has resulted in some push-back by Tesla who have sought to emphasise that the blame lies with the deceased driver rather than with the technology. One of the company’s key concerns in this instance appears to be the defence of the functionality of their Autopilot system, and in relation to this, a rather startling comment on the Tesla blog recently stood out:

No one knows about the accidents that didn’t happen, only the ones that did. The consequences of the public not using Autopilot, because of an inaccurate belief that it is less safe, would be extremely severe. (Tesla 2018).

Continue reading

Digital proxies

proxyThe US Department of Immigration and Customs Enforcement (ICE) is apparently seeking to employ ‘big data’ methods for automating their assessment of visa applications in pursuit of meeting Trump’s calls for ‘extreme vetting’ (e.g. Joseph 2017, Joseph and Lipp 2017, and see also). A crucial problem with the proposals has been flagged in a letter to the Acting Secretary of Homeland Security by a group of scientists, engineers and others with experience in machine learning, data mining etc.. Specifically, they point to the problem that algorithms developed to detect ‘persons of interest’ could arbitrarily select groups while at the same time appearing to be objective. We’ve already seen this stereotyping and discrimination being embedded in other applications, inadvertently for the most part, and the risk is the same in this case. The reason provided in the letter is simple:

“Inevitably, because these characteristics are difficult (if not impossible) to define and measure, any algorithm will depend on ‘proxies’ that are more easily observed and may bear little or no relationship to the characteristics of interest” (Abelson et al 2017)

Continue reading

Deep-fried archaeological data

deepfriedmarsbar
Deep fried Mars bar

I’ve borrowed the idea of ‘deep-fried data’ from the title of a presentation by Maciej Cegłowski to the Collections as Data conference at the Library of Congress last month. As an archaeologist living and working in Scotland for 26 years, the idea of deep-fried data spoke to me, not least of course because of Scotland’s culinary reputation for deep-frying anything and everything. Deep-fried Mars bars, deep-fried Crème eggs, deep-fried butter balls in Irn Bru batter, deep-fried pizza, deep-fried steak pies, and so it goes on (see some more not entirely serious examples).

Hardened arteries aside, what does deep-fried data mean, and how is this relevant to the archaeological situation? In fact, you don’t have to look too hard to see that cooking is often used as a metaphor for our relationship with and use of data.

Continue reading

Biggish Data

Big Data
Big Data 😉

Big Data is (are?) old hat …  Big Data dropped off Gartner’s Emerging Technologies Hype Cycle altogether in 2015, having slipped into the ‘Trough of Disillusionment’ in 2014 (Gartner Inc. 2014, 2015a). The reason given for this was simply that it had evolved and had become the new normal – the high-volume, high-velocity, high-variety types of information that classically defined ‘big data’ were becoming embedded in a range of different practices (e.g. Heudecker 2015).

At the same time, some of the assumptions behind Big Data were being questioned. It was no longer quite so straightforward to claim that ‘big data’ could overcome ‘small data’ by throwing computer power at a problem, or that quantity outweighed quality such that the large size of a dataset offset any problems of errors and inaccuracies in the data (e.g. Mayer-Schönberger and Cukier 2013, 33), or that these data could be analysed in the absence of any hypotheses (Anderson 2008).

For instance, boyd and Crawford had highlighted the mythical status of ‘big data’; in particular that it somehow provided a higher order of intelligence that could create insights that were otherwise impossible, and assigned them an aura of truth, objectivity and accuracy (2012, 663). Others followed suit. For example, McFarland and McFarland (2015) have recently shown how most Big Data analyses give rise to “precisely inaccurate” results simply because the sample size is so large that they give rise to statistically highly significant results (and hence the debacle over Google Flu Trends  – for example, Lazer and Kennedy 2015). Similarly, Pechenick et al (2015) showed how, counter-intuitively, results from Google’s Books Corpus could easily be distorted by a single prolific author, or by the fact that there was a marked increase in scientific articles included in the corpus after the 1960s. Indeed, Peter Sondergaard, a senior vice president at Gartner and global head of Research, underlined that data (big or otherwise) are inherently dumb without algorithms to work on them (Gartner Inc. 2015b). In this regard, one might claim Big Data have been superseded by Big Algorithms in many respects.

Continue reading

Big Data Analytics

It was only a matter of time before a ‘big data’ company latched onto archaeology for commercial purposes. Reported in a New Scientist article last week (with an unfortunate focus on ‘treasure’), a UK data analytics start-up called Democrata is incorporating archaeological data into a system to allow engineering and construction firms to predict the likelihood of encountering archaeological remains. This, of course, is what local authority archaeologists do, along with environmental impact assessments undertaken by commercial archaeology units. But this isn’t (yet) an argument about a potential threat to archaeological jobs.

Continue reading

Inadvertent Algorithms

As the end of 2014 approaches, Facebook has unleashed its new “Year in Review” app, purporting to show the highlights of your year. In my case, it did little other than demonstrate a more or less complete lack of Facebook activity on my part other than some conference photos a colleague had posted to my wall; in Eric Meyer’s case, it presented him with a picture of his daughter who had died earlier in the year. In a thoughtful and thought-provoking piece, he describes this as ‘Inadvertent Algorithmic Cruelty’: it wasn’t deliberate on the part of Facebook (who have now apologised), and for many people it worked well as evidenced by the numbers who opted to include it on their timelines, but it lacked an opt-in facility and there was an absence of what Meyer calls ‘empathetic design’. Om Malik picks up on this, pointing to the way Facebook now has an ‘Empathy Team’ apparently intended to make designers understand what it is like to be a user (sorry, a person), although Facebook’s ability to highlight what people see as important is driven by crude data such as the number of ‘likes’ and comments without any understanding of the underlying meanings which are present.

Continue reading

Big Data and Distance

One of the features of the availability of increasing amounts of archaeological data online is that it frequently arrives without an accompanying awareness of context. Far from being a problem, this is often seen as an advantage in relation to ‘big data’ – indeed, Chris Anderson has claimed that context can be established later once statistical algorithms have found correlations in large datasets that might not otherwise be revealed.
Continue reading