Kraken2
Input
Download an example
We can use a kraken report from bracken's sample data
curl https://raw.githubusercontent.com/jenniferlu717/Bracken/master/sample_data/sample_output_bracken.report -o kreport.txt
Format report
We can use taxpasta to convert kraken reports (and others) to a format compatible with mess
Sort and filter taxids
Sort by descending read count, filter zero counts and unclassified taxids with csvtk
cat sample.tsv \
    | csvtk filter2 -t -f '$taxonomy_id > 0 && $count > 0' \
    | csvtk rename -t -f 1,2 -n taxon,reads \
    | csvtk sort -t -k reads:nr > filtered.tsv
Update taxids
Update taxids and add their lineage with taxonkit
csvtk cut -t -U -f taxon filtered.tsv \
    | taxonkit lineage --show-status-code 2> tax.log \
    | csvtk -t add-header -n old_taxid,updated_taxid,lineage 1> lineage.tsv
$ cat tax.log
17:07:59.845 [WARN] taxid 86662 was merged into 1405
17:07:59.845 [WARN] taxid 62928 was merged into 418699
17:07:59.845 [WARN] taxid 1172 was merged into 264691
Join updated taxids with read counts
csvtk join -t filtered.tsv lineage.tsv -f 1 \
    | csvtk cut -t -f updated_taxid,reads \
    | csvtk rename -t -f 1,2 -n taxon,reads > updated.tsv
Replace unclassified taxids
Failure
Some taxids have no genome data available !
For example, 60481, 164757 and 189918 correspond to unclassified Shewanella sp. MR-7, Mycobacterium sp. JLS and Mycobacterium sp. KMS, repsectively.
We can replace unclassified species with other species from the same genus.
We also add the nb column to tell mess to download one genome per taxon.
cat updated.tsv \
    | csvtk replace -t -f taxon -p "60481" -r "211586" \
    | csvtk replace -t -f taxon -p "164757" -r "83332" \
    | csvtk replace -t -f taxon -p "189918" -r "557599" \
    > clean.tsv
Subsample
For a quick run (~ 5 min with 10 threads), we can subset the top 10 most abundant taxa for our simulation.
subsample.tsv
| taxon | reads | 
|---|---|
| 562 | 1449813 | 
| 1396 | 1288824 | 
| 1280 | 1212241 | 
| 1076 | 824296 | 
| 83332 | 699935 | 
| 148447 | 656706 | 
| 132919 | 622314 | 
| 272131 | 590648 | 
| 303 | 515978 | 
| 32046 | 466493 | 
Simulate
Now that we cleaned up and subsampled bracken's taxonomic profile we can use it for mess run.
By default, mess downloads 1 reference genome per taxon and excludes atypical genomes (see download guide. ) For read simulation, the default parameters generate paired 150 bp illumina reads using art_illumina's HiSeq 2500 error profile.
Using the --bam flag yields an alignment file and a ground truth taxonomic profile. 
Output
fastq
$ seqkit stats mess_out/fastq/*.fq.gz --all
processed files:  2 / 2 [======================================] ETA: 0s. done
file                               format  type   num_seqs        sum_len  min_len  avg_len  max_len   Q1   Q2   Q3  sum_gap  N50  Q20(%)  Q30(%)  AvgQual  GC(%)
mess_out/fastq/subsample_R1.fq.gz  FASTQ   DNA   8,326,992  1,249,048,800      150      150      150  150  150  150        0  150   98.01   91.67    27.79  50.76
mess_out/fastq/subsample_R2.fq.gz  FASTQ   DNA   8,326,992  1,249,048,800      150      150      150  150  150  150        0  150   97.31   89.64    26.51  50.76
- Total read count correspond to the one requested in the input (256 reads difference)
 
Taxonomic profile
Kraken2 composition
| taxon | reads | seq_abundance | 
|---|---|---|
| 562 | 1449813 | 0.17410469821482438 | 
| 1396 | 1288824 | 0.15477190063271803 | 
| 1280 | 1212241 | 0.14557522485219607 | 
| 1076 | 824296 | 0.09898780485461704 | 
| 83332 | 699935 | 0.08405357928573762 | 
| 148447 | 656706 | 0.07886230841209485 | 
| 132919 | 622314 | 0.07473225248005103 | 
| 272131 | 590648 | 0.07092955559868037 | 
| 303 | 515978 | 0.06196260757455525 | 
| 32046 | 466493 | 0.056020068094525345 | 
Simulated composition
I calculated sequence abundances from true read counts in logs/coverage/subsample.tsv, and added each genomes taxid at species level (with taxonkit reformat --show-lineage-taxids).
| taxon | reads | seq_abundance | 
|---|---|---|
| 562 | 1449799 | 0.1741083695048584 | 
| 1396 | 1288786 | 0.1547720953736955 | 
| 1280 | 1212160 | 0.1455699729265982 | 
| 1076 | 824275 | 0.0989883261566721 | 
| 1773 | 699930 | 0.0840555629211604 | 
| 148447 | 656686 | 0.0788623310794582 | 
| 132919 | 622291 | 0.0747317879013213 | 
| 272131 | 590622 | 0.0709286138379861 | 
| 303 | 515970 | 0.0619635517843658 | 
| 32046 | 466473 | 0.0560193885138835 | 
I used a wilcoxon test to compare abundances
>>> from scipy.stats import wilcoxon
>>> delta = abs(real_df["seq_abundance"] - sim_df["seq_abundance"])
>>> wilcoxon(delta)
WilcoxonResult(statistic=24.0, pvalue=0.76953125)
- Non significant difference in sequence abundance between the simulated and real sample (mean delta = \({1.47}^{-06}\), \(p=0.77\))