AAAARRRRRGH!! Metadata herding remains the most frustrating aspect of genomic metaanalyses. Here are some lessons learned from several large ongoing natural variation genomics (meta)analyses I am involved in.
(NB: for now this post is formatted as dot-points as I originally gave this as a talk)
Oh my, the metadata
"… the missing heritability might then in fact be in the metadata." N. Warthmann
- The main hurdle to data reuse
- Metadata always a pain
- No standardisation or enforcement
- Often an afterthought
Metadata: in a perfect universe
- Consistent: names, formats
- Unique: no accidental duplicates
- Machine readable: excel, csv, NOT PDF
- Validated: check the above (more later)
- Computers are idiots: make it easy for them!
Some concise guidelines for healthy metadata
|Do this:||Not this:|
|Metadata in a tidy CSV/XLSX||Metadata as a Word table in PDF|
|One individual, one row||Grouped entries, hanging rows, etc.|
|Use native spelling (Tü)||Inconsistent Anglicisation (Tu/Tue/Tü)|
|Names like PA1003, for Pathodopsis Arabaidopsis num. 1003||Names like 1003|
|Names without exotic punctuation||Names like 321-(1)%2|
|Names excel won’t think are dates/numbers||Names like Mar-1, 13E12|
|Keep dates as text, in ISO8601||Diverse date formats, or excel date datatype (days since XXX)|
|Keep track of values not parsed/coerced to NA||Blindly trust as.numeric() or friends|
|Use decimal degrees with sign for S/W||Use DMS notation for lat/long (37°23"13.72’W)|
|Use consistent GPS datum (WGS84)||Trust your GPS is set right|
|Record everything in a column, use conditional formatting to colour||Encode values only as colour|
|Trim extra spaces from strings||Values like
Crappy metadata: how to manage it
Extracting from pdf
- Tabula LaTeX/Word PDF -> CSV
- ExtractTable.com (PAID) anything -> CSV
- Will often require some manual post-editing (save the orig docs!)
- If consolidating multiple forms of metadata, ideally use a script
- Records exactly what was done
- Trivial to repeat if you find an error
- No need for versioning: keep orig & script
- Make corrections with code if possible
- Export to a sane format (ideally TSV)
- Worked example below!!
- Use schema to describe “good” values
- Catches errors early, hard to find by eye
- Many different tools
- Worked example below!!
APPENDIX: Some concrete metadata DOs and DON’Ts
- Make them unique!!!
- Avoid spaces or exotic punctuation (should match
- Consistently Anglicise non-ASCII letters
- e.g. pick ONE of
Tue. You might read them the same, but R doesn’t. Ideally keep original, i.e. Tü
- e.g. pick ONE of
- Avoid naming schemes that stupid programs will interpret as dates or numbers
- Avoid simple numbers (e.g.
1032, a la ecotype ids).
1032could be anything, but
Arabidopsis number 1032
- If joining multiple datasets, add a new ID
- Zero-pad IDs to at least 4 places (i.e. KM0001 not KM1)
- If you need to update names, keep a column for the old/external name
- E.g. translating field IDs (KM0001) to sensible names not knowable before sampling (EmelNSW0005)
- Use a consistent format, ideally ISO8601
- Don’t record times if they’re not accurate (e.g. always midnight)
- When entering in excel/gdocs, force the column and values to be text
- Beware decimal separators (
- In R, use
as.numeric.verbose()to show which values are non-numeric while converting.
- Use either
NAor blank values to indicate missing values, not other spellings of the same.
Latitude and longitude
- Always either record the GPS datum, or convert to WGS84
- CHECK YOUR GPS!
- Use decimal degrees notation (-32.945, 144.7213)
- Beware the sign and N/S/E/W
- Especially in Southern and Western hemispheres
- Wherever possible, use names that exactly match the NCBI’s taxonomy database
- ENA/SRA should help with this!
- Avoid abbreviations, especially non-standard ones.
- Formatting and colour in excel sheets is fine
- BUT, never encode data ONLY as colour
- Use conditional formatting and a column to encode the actual meaning
All string/character columns
- Check for and trim leading and trailing spaces, or double spaces/tabs within strings
"Arabidopsis thaliana" != " Arabidopsis thaliana", which will drive you insane
- Normalise any funky punctuation, e.g.:
- em-dashes (–) instead of normal dash/minus (-)
- different quotes («») and quotes used as apostrophes or vice versa
- Avoid at any cost putting metadata ONLY in a pdf
- If you must, stick an excel sheet on figshare and link to it in the caption
- If you can’t do that, at least make sure your PDF can be decoded by e.g. tabula
- Absolutely do not provide metadata as an image of a table