Counting and Mining Research Data with Unix

Baker, James; Milligan, Ian

Counting and Mining Research Data with Unix

Files

Counting and mining research data with Unix _ Programming Historian.pdf (42.48 KB)

Date

2014-09-20

Authors

Baker, James

Milligan, Ian

Publisher

The Editorial Board of the Programming Historian

Abstract

This lesson will look at how research data, when organised in a clear and predictable manner, can be counted and mined using the Unix shell. The lesson builds on the lessons “Preserving Your Research Data: Documenting and Structuring Data” and “Introduction to the Bash Command Line”. Depending on your confidence with the Unix shell, it can also be used as a standalone lesson or refresher. Having accumulated research data for one project, a historian might ask different questions of that same data when returning to it during a subsequent project. If this data is spread across multiple files - a series of tabulated data, a set of transcribed text, a collection of images - it can be counted and mined using simple Unix commands. The Unix shell gives you access to a range of powerful commands that can transform how you count and mine research data. This lesson will introduce you to a series of commands that use counting and mining of tabulated data, though they only scratch the surface of what the Unix shell can do. By learning just a few simple commands you will be able to undertake tasks that are impossible in Libre Office Calc, Microsoft Excel, or other similar spreadsheet programs. These commands can be easily extended for use with non-tabulated data. This lesson will also demonstrate that the options for manipulating, counting and mining data available to you will often depend on the amount of metadata, or descriptive text, contained in the filenames of the data you are using as much as the range of Unix commands you have learnt to use. Thus, even if it is not a prerequisite of working with the Unix shell, taking the time to structure your research data and filenaming conventions in a consistent and predictable manner is certainly a significant step towards getting the most out of Unix commands and being able to count and mine your research data. For the value of taking the time to make your data consistent and predictable beyond matters of preservation, see “Preserving Your Research Data: Documenting and Structuring Data”.

Description

This article Published by the Editorial Board of the Programming Historian is made available under a Creative Commons Attribution 2.0 Generic License. Available at: http://programminghistorian.org/lessons/research-data-with-unix

Keywords

Guides and tutorials, Research data, Data mining, Unix

URI

http://programminghistorian.org/lessons/research-data-with-unix
http://hdl.handle.net/10012/11750

Collections

Waterloo Research
History

Creative Commons license

Except where otherwise noted, this item's license is described as Attribution 2.0 Generic

Full item page

Counting and Mining Research Data with Unix

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license