AAAARRRRRGH!! Metadata herding remains the most frustrating aspect of genomic metaanalyses. Here are some lessons learned from several large ongoing natural variation genomics (meta)analyses I am involved in.
(NB: for now this post is formatted as dot-points as I originally gave this as a talk)
Oh my, the metadata
"… the missing heritability might then in fact be in the metadata." N. Warthmann
- The main hurdle to data reuse
- Metadata always a pain
- No standardisation or enforcement
- Often an afterthought
Metadata: in a perfect universe
- Consistent: names, formats
- Unique: no accidental duplicates
- Machine readable: excel, csv, NOT PDF
- Validated: check the above (more later)
- Computers are idiots: make it easy for them!
Some concise guidelines for healthy metadata
Do this: | Not this: |
---|---|
Metadata in a tidy CSV/XLSX | Metadata as a Word table in PDF |
One individual, one row | Grouped entries, hanging rows, etc. |
Use native spelling (Tü) | Inconsistent Anglicisation (Tu/Tue/Tü) |
Names like PA1003, for Pathodopsis Arabaidopsis num. 1003 | Names like 1003 |
Names without exotic punctuation | Names like 321-(1)%2 |
Names excel won’t think are dates/numbers | Names like Mar-1, 13E12 |
Keep dates as text, in ISO8601 | Diverse date formats, or excel date datatype (days since XXX) |
Keep track of values not parsed/coerced to NA | Blindly trust as.numeric() or friends |
Use decimal degrees with sign for S/W | Use DMS notation for lat/long (37°23"13.72’W) |
Use consistent GPS datum (WGS84) | Trust your GPS is set right |
Record everything in a column, use conditional formatting to colour | Encode values only as colour |
Trim extra spaces from strings | Values like " Arabidopsis thaliana" |
Crappy metadata: how to manage it
Extracting from pdf
- Tabula LaTeX/Word PDF -> CSV
- ExtractTable.com (PAID) anything -> CSV
- Will often require some manual post-editing (save the orig docs!)
Consolidating metadata
- If consolidating multiple forms of metadata, ideally use a script
- Records exactly what was done
- Trivial to repeat if you find an error
- No need for versioning: keep orig & script
- Make corrections with code if possible
- Export to a sane format (ideally TSV)
- Worked example below!!
Validating metadata
- Use schema to describe “good” values
- Catches errors early, hard to find by eye
- Many different tools
- Gdocs/Excel!
- R’s
library(validate)
- Python’s
import schema
- Worked example below!!
APPENDIX: Some concrete metadata DOs and DON’Ts
Sample names
- Make them unique!!!
- Avoid spaces or exotic punctuation (should match
[A-Za-z0-9_-]+
) - Consistently Anglicise non-ASCII letters
- e.g. pick ONE of
Tü
,Tu
, orTue
. You might read them the same, but R doesn’t. Ideally keep original, i.e. Tü
- e.g. pick ONE of
- Avoid naming schemes that stupid programs will interpret as dates or numbers
- e.g.
Mar-1
,12-23
, or11E06
- e.g.
- Avoid simple numbers (e.g.
1032
, a la ecotype ids).1032
could be anything, butPA1032
is aP
athodopsisA
rabidopsis number 1032
- If joining multiple datasets, add a new ID
- Zero-pad IDs to at least 4 places (i.e. KM0001 not KM1)
- If you need to update names, keep a column for the old/external name
- E.g. translating field IDs (KM0001) to sensible names not knowable before sampling (EmelNSW0005)
Dates
- Use a consistent format, ideally ISO8601
yyyy-mm-dd
andyyyy-mm-ddThh:mm:ss+tz
- Don’t record times if they’re not accurate (e.g. always midnight)
- When entering in excel/gdocs, force the column and values to be text
Numbers
- Beware decimal separators (
,
vs.
) - In R, use
as.numeric.verbose()
to show which values are non-numeric while converting.
Missing values
- Use either
NA
or blank values to indicate missing values, not other spellings of the same.
Latitude and longitude
- Always either record the GPS datum, or convert to WGS84
- CHECK YOUR GPS!
- Use decimal degrees notation (-32.945, 144.7213)
- Beware the sign and N/S/E/W
- Especially in Southern and Western hemispheres
Species names
- Wherever possible, use names that exactly match the NCBI’s taxonomy database
- ENA/SRA should help with this!
- Avoid abbreviations, especially non-standard ones.
Formatting/colour
- Formatting and colour in excel sheets is fine
- BUT, never encode data ONLY as colour
- Use conditional formatting and a column to encode the actual meaning
All string/character columns
- Check for and trim leading and trailing spaces, or double spaces/tabs within strings
"Arabidopsis thaliana" != " Arabidopsis thaliana"
, which will drive you insane
- Normalise any funky punctuation, e.g.:
- em-dashes (–) instead of normal dash/minus (-)
- different quotes («») and quotes used as apostrophes or vice versa
PDFs
- Avoid at any cost putting metadata ONLY in a pdf
- If you must, stick an excel sheet on figshare and link to it in the caption
- If you can’t do that, at least make sure your PDF can be decoded by e.g. tabula
- Absolutely do not provide metadata as an image of a table