AAAARRRRRGH!! Metadata herding remains the most frustrating aspect of genomic metaanalyses. Here are some lessons learned from several large ongoing natural variation genomics (meta)analyses I am involved in.

(NB: for now this post is formatted as dot-points as I originally gave this as a talk)

Oh my, the metadata

"… the missing heritability might then in fact be in the metadata." N. Warthmann

  • The main hurdle to data reuse
  • Metadata always a pain
  • No standardisation or enforcement
  • Often an afterthought

Metadata: in a perfect universe

  • Consistent: names, formats
  • Unique: no accidental duplicates
  • Machine readable: excel, csv, NOT PDF
  • Validated: check the above (more later)
  • Computers are idiots: make it easy for them!

Some concise guidelines for healthy metadata

Do this: Not this:
Metadata in a tidy CSV/XLSX Metadata as a Word table in PDF
One individual, one row Grouped entries, hanging rows, etc.
Use native spelling (Tü) Inconsistent Anglicisation (Tu/Tue/Tü)
Names like PA1003, for Pathodopsis Arabaidopsis num. 1003 Names like 1003
Names without exotic punctuation Names like 321-(1)%2
Names excel won’t think are dates/numbers Names like Mar-1, 13E12
Keep dates as text, in ISO8601 Diverse date formats, or excel date datatype (days since XXX)
Keep track of values not parsed/coerced to NA Blindly trust as.numeric() or friends
Use decimal degrees with sign for S/W Use DMS notation for lat/long (37°23"13.72’W)
Use consistent GPS datum (WGS84) Trust your GPS is set right
Record everything in a column, use conditional formatting to colour Encode values only as colour
Trim extra spaces from strings Values like " Arabidopsis thaliana"

Crappy metadata: how to manage it

Extracting from pdf

  • Tabula LaTeX/Word PDF -> CSV
  • ExtractTable.com (PAID) anything -> CSV
  • Will often require some manual post-editing (save the orig docs!)

Consolidating metadata

  • If consolidating multiple forms of metadata, ideally use a script
    • Records exactly what was done
    • Trivial to repeat if you find an error
    • No need for versioning: keep orig & script
  • Make corrections with code if possible
  • Export to a sane format (ideally TSV)
  • Worked example below!!

Validating metadata

  • Use schema to describe “good” values
  • Catches errors early, hard to find by eye
  • Many different tools
    • Gdocs/Excel!
    • R’s library(validate)
    • Python’s import schema
  • Worked example below!!

APPENDIX: Some concrete metadata DOs and DON’Ts

Sample names

  • Make them unique!!!
  • Avoid spaces or exotic punctuation (should match [A-Za-z0-9_-]+)
  • Consistently Anglicise non-ASCII letters
    • e.g. pick ONE of , Tu, or Tue. You might read them the same, but R doesn’t. Ideally keep original, i.e. Tü
  • Avoid naming schemes that stupid programs will interpret as dates or numbers
    • e.g. Mar-1, 12-23, or 11E06
  • Avoid simple numbers (e.g. 1032, a la ecotype ids).
    • 1032 could be anything, but PA1032 is a Pathodopsis Arabidopsis number 1032
  • If joining multiple datasets, add a new ID
  • Zero-pad IDs to at least 4 places (i.e. KM0001 not KM1)
  • If you need to update names, keep a column for the old/external name
    • E.g. translating field IDs (KM0001) to sensible names not knowable before sampling (EmelNSW0005)

Dates

  • Use a consistent format, ideally ISO8601
    • yyyy-mm-dd and yyyy-mm-ddThh:mm:ss+tz
  • Don’t record times if they’re not accurate (e.g. always midnight)
  • When entering in excel/gdocs, force the column and values to be text

Numbers

  • Beware decimal separators (, vs .)
  • In R, use as.numeric.verbose() to show which values are non-numeric while converting.

Missing values

  • Use either NA or blank values to indicate missing values, not other spellings of the same.

Latitude and longitude

  • Always either record the GPS datum, or convert to WGS84
    • CHECK YOUR GPS!
  • Use decimal degrees notation (-32.945, 144.7213)
  • Beware the sign and N/S/E/W
    • Especially in Southern and Western hemispheres

Species names

  • Wherever possible, use names that exactly match the NCBI’s taxonomy database
    • ENA/SRA should help with this!
  • Avoid abbreviations, especially non-standard ones.

Formatting/colour

  • Formatting and colour in excel sheets is fine
  • BUT, never encode data ONLY as colour
  • Use conditional formatting and a column to encode the actual meaning

All string/character columns

  • Check for and trim leading and trailing spaces, or double spaces/tabs within strings
    • "Arabidopsis thaliana" != " Arabidopsis thaliana", which will drive you insane
  • Normalise any funky punctuation, e.g.:
    • em-dashes (–) instead of normal dash/minus (-)
    • different quotes («») and quotes used as apostrophes or vice versa

PDFs

  • Avoid at any cost putting metadata ONLY in a pdf
  • If you must, stick an excel sheet on figshare and link to it in the caption
  • If you can’t do that, at least make sure your PDF can be decoded by e.g. tabula
  • Absolutely do not provide metadata as an image of a table