a) Fasta processing
Before generating reads, mess
pre-processes fastas for read simulators as shown below:
graph LR
A[genome_asmV3.fna.gz] --> B{seqkit
seq};
B--> C[genome.fna];
C --> D{split_contigs.py};
D --> E[genome_contig1.fna];
D --> F[genome_contig2.fna];
D --> G[genome_contig3.fna];
seqkit seq
File renaming
Files are renamed according to the fasta
or accession
column in the input file or assembly_summary.tsv file.
Done to avoid Snakemake filename matching erros.
Decompressing
Done because read simulators take decompressed fasta as input
Removing long fasta headers
Done to ensure correct BAM formatting for samtools
split_contigs.py
Correct amount of sequences
Art_illumina generates the same amount of reads per contig (Milhaven and Pfeifer, 2023). This means that, for fragmented assemblies, read sampling will not be uniform across contigs.
In fact, when we use art_illumina --fcov 10
on this slightly fragmented E.coli genome, we expect 51534530 simulated bases (10x genome size). However we obtain, 51251100, which less than requested. Thus, to uniformly sample reads acros contigs, we split the genome and apply the same art_illumina command for each contig.
Compatibilty with pbsim3
Since pbsim3 already splits fasta, inputting split fasta is faster.