Counting and Mining Research Data with Unix

Loading...
Thumbnail Image

Date

2014-09-20

Authors

Baker, James
Milligan, Ian

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

The Editorial Board of the Programming Historian

Abstract

This lesson will look at how research data, when organised in a clear and predictable manner, can be counted and mined using the Unix shell. The lesson builds on the lessons “Preserving Your Research Data: Documenting and Structuring Data” and “Introduction to the Bash Command Line”. Depending on your confidence with the Unix shell, it can also be used as a standalone lesson or refresher. Having accumulated research data for one project, a historian might ask different questions of that same data when returning to it during a subsequent project. If this data is spread across multiple files - a series of tabulated data, a set of transcribed text, a collection of images - it can be counted and mined using simple Unix commands. The Unix shell gives you access to a range of powerful commands that can transform how you count and mine research data. This lesson will introduce you to a series of commands that use counting and mining of tabulated data, though they only scratch the surface of what the Unix shell can do. By learning just a few simple commands you will be able to undertake tasks that are impossible in Libre Office Calc, Microsoft Excel, or other similar spreadsheet programs. These commands can be easily extended for use with non-tabulated data. This lesson will also demonstrate that the options for manipulating, counting and mining data available to you will often depend on the amount of metadata, or descriptive text, contained in the filenames of the data you are using as much as the range of Unix commands you have learnt to use. Thus, even if it is not a prerequisite of working with the Unix shell, taking the time to structure your research data and filenaming conventions in a consistent and predictable manner is certainly a significant step towards getting the most out of Unix commands and being able to count and mine your research data. For the value of taking the time to make your data consistent and predictable beyond matters of preservation, see “Preserving Your Research Data: Documenting and Structuring Data”.

Description

This article Published by the Editorial Board of the Programming Historian is made available under a Creative Commons Attribution 2.0 Generic License. Available at: http://programminghistorian.org/lessons/research-data-with-unix

Keywords

Guides and tutorials, Research data, Data mining, Unix

LC Subject Headings

Citation