3.3 Data Directory

3.3.1 Data Directory

All data is stored in subfolders of the Data folder on the Team Drive (<UNK>/data). To document these data, we use the Data Directory that includes key, standardized information from each readme metadata file. Every data file in the <UNK>/data folder has a record (row) in the Data Directory. The Data Directory file contains two sheets: (1) Data directory (the record and standardized documentation for each data file); (2) Metadata (information needed to populate the Data Directory, i.e. the meta-metadata)

In the case of placeholder metadata (as described in the Metadata section), only the following columns should be filled out: folder, filename, contact, and summary. This (mostly blank) row serves two purposes: 1) it retains some of the searchability function for that dataset and 2) it serves as a visual reminder that those datasets are in need of more robust metadata development.

Column	Description
Domain	Climate/Energy; Land; Ocean; General; Other [drop down menu]
Description	A few word description (e.g. SST US 2017); max 5 words
Folder	Name of folder containing data
Filename	Name of data
Year	Year of publication
Version	Sub category of year; NA if not applicable
Project	Project name that used these data (can have multiple listings) or ‘General’ if widely used (e.g. FAO data), hyperlinked to OneDrive/Box folder
Code	Link to Github repo or wherever code is stored
Data Stage	raw’ if raw data; ‘final input’ for the input data used for the analysis; ‘output’ for what was used for the project and/or published [drop down menu]
Filetype	File extension (e.g. csv; tif; rds); note: do note include ‘.’
Citation	Hyperlinked reference to publication or online resource or contact for individual/group data author
URL	Link to original data source
Extent	global; regional; national; local [drop down menu]
Resolution	Resolution of spatial data (in degrees)
Permissions	open = open source/open access; restricted = need author permission; secure = confidential data and likely involves a DUA or NDA [drop down menu]
Start year	Data set start year; numeric
End year	Data set end year; numeric
Source	e.g. ; FAO; Rare
Contact	Name and email of contact person in who used/stored data
reference	Hyperlinked reference to publication using data (can be NA)
Keywords	e.g. fisheries; fire; utilities; property value; VDS; MPA; oceanography; temperature; habitat; biodiversity (up to 5 per entry, separated by semi-colons)
Summary	Brief description of the data (1-2 sentences). Include years for timeseries; location/spatial extent for spatial data; key variables; resolution; sampling frequency; species; etc.
Notes	Other relevant information about data. Initial your entry (e.g. if it was processed (e.g. subset from a larger dataset); what specifically was done; are there suspicious data points?; note if there are issues; etc.)

Any time you add a new dataset to the shared data folder and directory, please message the #data-streamlining Slack channel so that others on the team know about the new dataset.

3.3.2 Project-level Data Directory

We highly recommend that research teams create a data_overview spreadsheet for keeping track of project-related data (i.e. a separate spreadsheet stored in the project’s Google Shared Drive data folder). This centralized document can be used to document project-relevant information and communicate to team members datasets that have already been saved. This document can then be used to guide and simplify data migration to the Data Directory once the project is complete. Suggested attributes include:

File name
Folder name
Source of data
Link where data was downloaded
Description of data
Name of the researcher who downloaded the data
Data directory entry (complete, in progress, not started, etc.)
Metadata sheet (complete, in progress, not started, etc.)