Kraken2

Input

Download an example

We can use a kraken report from bracken's sample data

curl https://raw.githubusercontent.com/jenniferlu717/Bracken/master/sample_data/sample_output_bracken.report -o kreport.txt

Format report

We can use taxpasta to convert kraken reports (and others) to a format compatible with mess

taxpasta standardise --profiler kraken2 --output sample.tsv kreport.txt

Sort and filter taxids

Sort by descending read count, filter zero counts and unclassified taxids with csvtk

cat sample.tsv \
    | csvtk filter2 -t -f '$taxonomy_id > 0 && $count > 0' \
    | csvtk rename -t -f 1,2 -n taxon,reads \
    | csvtk sort -t -k reads:nr > filtered.tsv

Update taxids

Update taxids and add their lineage with taxonkit

csvtk cut -t -U -f taxon filtered.tsv \
    | taxonkit lineage --show-status-code 2> tax.log \
    | csvtk -t add-header -n old_taxid,updated_taxid,lineage 1> lineage.tsv

Checking the tax.log (as of may 2024)

$ cat tax.log
17:07:59.845 [WARN] taxid 86662 was merged into 1405
17:07:59.845 [WARN] taxid 62928 was merged into 418699
17:07:59.845 [WARN] taxid 1172 was merged into 264691

Join updated taxids with read counts

csvtk join -t filtered.tsv lineage.tsv -f 1 \
    | csvtk cut -t -f updated_taxid,reads \
    | csvtk rename -t -f 1,2 -n taxon,reads > updated.tsv

Replace unclassified taxids

Failure

Some taxids have no genome data available !

For example, 60481, 164757 and 189918 correspond to unclassified Shewanella sp. MR-7, Mycobacterium sp. JLS and Mycobacterium sp. KMS, repsectively.

We can replace unclassified species with other species from the same genus.

We also add the nb column to tell mess to download one genome per taxon.

cat updated.tsv \
    | csvtk replace -t -f taxon -p "60481" -r "211586" \
    | csvtk replace -t -f taxon -p "164757" -r "83332" \
    | csvtk replace -t -f taxon -p "189918" -r "557599" \
    > clean.tsv

Subsample

For a quick run (~ 5 min with 10 threads), we can subset the top 10 most abundant taxa for our simulation.

csvtk head -t -n 10 clean.tsv > subsample.tsv

subsample.tsv

taxon	reads
562	1449813
1396	1288824
1280	1212241
1076	824296
83332	699935
148447	656706
132919	622314
272131	590648
303	515978
32046	466493

Simulate

Now that we cleaned up and subsampled bracken's taxonomic profile we can use it for mess run.

By default, mess downloads 1 reference genome per taxon and excludes atypical genomes (see download guide. ) For read simulation, the default parameters generate paired 150 bp illumina reads using art_illumina's HiSeq 2500 error profile.

Using the --bam flag yields an alignment file and a ground truth taxonomic profile.

mess run -i subsample.tsv --bam --threads 10

Output

fastq

$ seqkit stats mess_out/fastq/*.fq.gz --all
processed files:  2 / 2 [======================================] ETA: 0s. done
file                               format  type   num_seqs        sum_len  min_len  avg_len  max_len   Q1   Q2   Q3  sum_gap  N50  Q20(%)  Q30(%)  AvgQual  GC(%)
mess_out/fastq/subsample_R1.fq.gz  FASTQ   DNA   8,326,992  1,249,048,800      150      150      150  150  150  150        0  150   98.01   91.67    27.79  50.76
mess_out/fastq/subsample_R2.fq.gz  FASTQ   DNA   8,326,992  1,249,048,800      150      150      150  150  150  150        0  150   97.31   89.64    26.51  50.76

Total read count correspond to the one requested in the input (256 reads difference)

Taxonomic profile

Kraken2 composition

taxon	reads	seq_abundance
562	1449813	0.17410469821482438
1396	1288824	0.15477190063271803
1280	1212241	0.14557522485219607
1076	824296	0.09898780485461704
83332	699935	0.08405357928573762
148447	656706	0.07886230841209485
132919	622314	0.07473225248005103
272131	590648	0.07092955559868037
303	515978	0.06196260757455525
32046	466493	0.056020068094525345

Simulated composition

I calculated sequence abundances from true read counts in logs/coverage/subsample.tsv, and added each genomes taxid at species level (with taxonkit reformat --show-lineage-taxids).

taxon	reads	seq_abundance
562	1449799	0.1741083695048584
1396	1288786	0.1547720953736955
1280	1212160	0.1455699729265982
1076	824275	0.0989883261566721
1773	699930	0.0840555629211604
148447	656686	0.0788623310794582
132919	622291	0.0747317879013213
272131	590622	0.0709286138379861
303	515970	0.0619635517843658
32046	466473	0.0560193885138835

I used a wilcoxon test to compare abundances

>>> from scipy.stats import wilcoxon
>>> delta = abs(real_df["seq_abundance"] - sim_df["seq_abundance"])
>>> wilcoxon(delta)
WilcoxonResult(statistic=24.0, pvalue=0.76953125)

Non significant difference in sequence abundance between the simulated and real sample (mean delta = \({1.47}^{-06}\), \(p=0.77\))