Step-By-Step Guide¶
This workflow will walk you through the steps needed to go from getting your data back from a sequencing facility to having a profiled list of genomes. Once the profiles are preprocessed, you will be able to choose whether to run them against the prokaryote database, eukaryote database, substrate database, or all three and then combine the results.
As an example, we can use reads from a sample of pikliz, a Haitian ferment with cabbage, carrots, bell peppers and Scotch bonnet peppers, produced in Montana, USA.
Pre-processing¶
Expected input: Raw paired end reads (ex. ending in R1.fastq.gz and R2.fastq.gz)
Expected output: Trimmed paired end reads (ex. ending in _1.trim.fastq.gz and _2.trim.fastq.gz)
1. Download the raw reads from Zenodo or directly through these links:
2. Perform FastQC
Use FastQC to perform quality control checks on raw sequence data. We recommend setting up a conda environment:
$ conda create -n preprocess_env python=3.8
$ conda activate preprocess_env
$ conda install fastqc
Finally, create your output_directory and run fastqc on your reads:
$ fastqc EBC_087_S160_L003_R1.fastq.gz EBC_087_S160_L003_R2.fastq.gz -o EBC_087_output_directory
this process should take ~5-7min/sample
The output for this step includes .html files with QC information for each read, and .fastqc.zip files with basic sequencing statistics.
For more FastQC information, visit their website.
3. Perform BBTools
Use BBTools to trim any sequencing adapters so that you are left with just reads from your original sample and remove any potential contaminating human genomes (this is less of a problem with fermented foods, but a huge deal when collecting human stool samples packed with the donors DNA) by using:
bbduk: for the trimming and filtering of adapters and contaminants in your reads
repair: when starting with pair-end reads, to fix paired read files that became disordered
bbmap: removed human reads
First, install BBtools following instructions on their installation guide. Make sure to make bbmap executable by adding the path to your ~/.bashrc.
Next, run:
$ bbduk.sh in1=EBC_087_S160_L003_R1.fastq.gz in2=EBC_087_S160_L003_R2.fastq.gz out1=EBC_087_bbduk_1.fastq.gz out2=EBC_087_bbduk_2.fastq.gz ref=$ADAPTERS ktrim=r k=23 mink=11 hdist=1 tpe tbo &> EBC_087.bbduk.log
$ repair.sh in=EBC_087_bbduk_1.fastq.gz in2=EBC_087_bbduk_2.fastq.gz out=EBC_087_repair_1.fastq.gz out2=EBC_087_repair_2.fastq.gz
Finally, prepare the human reference genome, and then run bbmap:
$ bbmap.sh ref=hg38.fa
$ bbmap.sh in=EBC_087_repair_1.fastq.gz in2=EBC_087_repair_2.fastq.gz out=EBC_087_trim_1.fastq.gz out2=EBC_087_trim_2.fastq.gz ref=hg38.fa nodisk
4. Save output to processed_reads directory
Profiling¶
Expected input: Trimmed pair-end reads (in your processed_reads directory. Should end in R1.fastq.gz and R2.fastq.gz)
Expected output: .IS file output for each sample with genome profiling results
1. Set up your environment
$ mamba create -n instrain_env -c bioconda instrain
$ conda activate instrain_env
Install instrain
$ pip install instrain
Install samtools to create your .bam file from the reference .fasta
$ conda install -c bioconda bowtie2 samtools
For more information on installation, visit inStrain or Bowtie2 https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
2. Download the reference databases
For each database (prokaryote, eukaryote, or substrate), download the .fasta and .stb file.
for prokaryote, make sure to also download the .genes file
3. Make your .bam file
You only have to do this once for each database version. Make sure to always use the .bam file made from the same version of the database .fasta file.
$ bowtie2-build MiFoDB_beta_v2_prok.fasta MiFoDB_prok_v2_index
$ bowtie2 -x MiFoDB_prok_v2_index -1 EBC_087_S160_L003_R1.fastq.gz -2 EBC_087_S160_L003_R2.fastq.gz -S EBC_087_aligned_reads.sam
Finally, you will need to convert the SAM file to a BAM file and index the sorted BAM
$ samtools view -Sb EBC_087_aligned_reads.sam > EBC_087_aligned_reads.bam
$ samtools sort EBC_087_aligned_reads.bam -o EBC_087_sort_aligned_reads.bam
$ samtools index EBC_087_sort_aligned_reads.bam
$ bowtie2 -p 10 -x MiFoDB_prok_v2_index -1 EBC_087_S160_L003_R1.fastq.gz -2 EBC_087_S160_L003_R2.fastq.gz -S EBC_087_aligned_reads.sam) 2>bowtie2.EBC_087.log
3. Run inStrain
Now that you have your .bam, .fasta, .stb files and inStrain installed, you can run inStrain profile
$ inStrain profile EBC_087_sort_aligned_reads.bam MiFoDB_beta_v2_prok.fasta -o EBC_087.IS -p 6 -g genesfile.fasta --stb_file MiFoDB_beta_v2_prok.stb --genes_file MiFoDB_beta_v2_prok.genes.fna --instrain_profile_args --database_mode
The output will be a .IS file, with a number of .tsv file. We will be most interested in genome_info.tsv (example below), which includes all mapping information. For interpretation and analysis, see example output.
genome |
coverage |
breadth |
nucl_diversity |
length |
true_scaffolds |
detected_scaffolds |
coverage_median |
coverage_std |
coverage_SEM |
breadth_minCov |
breadth_expected |
nucl_diversity_rarefied |
conANI_reference |
popANI_reference |
iRep |
iRep_GC_corrected |
linked_SNV_count |
SNV_distance_mean |
r2_mean |
d_prime_mean |
consensus_divergent_sites |
population_divergent_sites |
SNS_count |
SNV_count |
filtered_read_pair_count |
reads_unfiltered_pairs |
reads_mean_PID |
reads_unfiltered_reads |
divergent_site_count |
|||||
C-03.Ssa-BR.fna |
1.686020547 |
0.049164091 |
0.004595774 |
1896140 |
182 |
86 |
0 |
69.19478668 |
0.050739639 |
0.011300326 |
0.774346839 |
0.000140703 |
0.986372334 |
0.988145797 |
0.981642137 |
36199 |
417 |
FALSE |
242 |
39.69008264 |
0.951699521 |
0.999845137 |
292 |
254 |
252 |
165 |
15171 |
15417 |
0.981642137 |
36199 |
417 |
292 |
254 |
|
EBC_086.5.fna |
1.596317454 |
0.049848898 |
0.006035971 |
2377866 |
79 |
52 |
0 |
19.94120243 |
0.012974942 |
0.028909535 |
0.755746415 |
0.002048653 |
0.979081506 |
0.984682077 |
0.969968582 |
48221 |
1865 |
FALSE |
1337 |
56.69334331 |
0.637899 |
652 |
0.9941014 |
1438 |
36199 |
417 |
292 |
254 |
||||||
FS03_2016_noduplicates_bin.6.fna |
1.191514863 |
0.041940437 |
0.004574618 |
2543035 |
344 |
186 |
0 |
21.96261861 |
0.013962518 |
0.008234649 |
0.650799011 |
0.001974379 |
0.966286233 |
0.96981997 |
FALSE |
393 |
68.18320611 |
0.596979301 |
0.989440015 |
706 |
632 |
628 |
185 |
14188 |
15687 |
0.965486302 |
39649 |
813 |