Step-By-Step Guide

This workflow will walk you through the steps needed to go from getting your data back from a sequencing facility to having a profiled list of genomes. Once the profiles are preprocessed, you will be able to choose whether to run them against the prokaryote database, eukaryote database, substrate database, or all three and then combine the results.

As an example, we can use reads from a sample of pikliz, a Haitian ferment with cabbage, carrots, bell peppers and Scotch bonnet peppers, produced in Montana, USA.

Pre-processing

Expected input: Raw paired end reads (ex. ending in R1.fastq.gz and R2.fastq.gz)

Expected output: Trimmed paired end reads (ex. ending in _1.trim.fastq.gz and _2.trim.fastq.gz)

1. Download the raw reads from Zenodo or directly through these links:

EBC_087_S160_L003_R1

EBC_087_S160_L003_R2

2. Perform FastQC

Use FastQC to perform quality control checks on raw sequence data. We recommend setting up a conda environment:

$  conda create -n preprocess_env python=3.8
$  conda activate preprocess_env

$  conda install fastqc

Finally, create your output_directory and run fastqc on your reads:

$  fastqc EBC_087_S160_L003_R1.fastq.gz EBC_087_S160_L003_R2.fastq.gz -o EBC_087_output_directory

this process should take ~5-7min/sample

The output for this step includes .html files with QC information for each read, and .fastqc.zip files with basic sequencing statistics.

For more FastQC information, visit their website.

3. Perform BBTools

Use BBTools to trim any sequencing adapters so that you are left with just reads from your original sample and remove any potential contaminating human genomes (this is less of a problem with fermented foods, but a huge deal when collecting human stool samples packed with the donors DNA) by using:

  • bbduk: for the trimming and filtering of adapters and contaminants in your reads

  • repair: when starting with pair-end reads, to fix paired read files that became disordered

  • bbmap: removed human reads

First, install BBtools following instructions on their installation guide. Make sure to make bbmap executable by adding the path to your ~/.bashrc.

Next, run:

$  bbduk.sh in1=EBC_087_S160_L003_R1.fastq.gz in2=EBC_087_S160_L003_R2.fastq.gz out1=EBC_087_bbduk_1.fastq.gz out2=EBC_087_bbduk_2.fastq.gz ref=$ADAPTERS ktrim=r k=23 mink=11 hdist=1 tpe tbo &> EBC_087.bbduk.log

$  repair.sh in=EBC_087_bbduk_1.fastq.gz in2=EBC_087_bbduk_2.fastq.gz out=EBC_087_repair_1.fastq.gz out2=EBC_087_repair_2.fastq.gz

Finally, prepare the human reference genome, and then run bbmap:

$  bbmap.sh ref=hg38.fa
$  bbmap.sh in=EBC_087_repair_1.fastq.gz in2=EBC_087_repair_2.fastq.gz out=EBC_087_trim_1.fastq.gz out2=EBC_087_trim_2.fastq.gz ref=hg38.fa nodisk

4. Save output to processed_reads directory

Profiling

Expected input: Trimmed pair-end reads (in your processed_reads directory. Should end in R1.fastq.gz and R2.fastq.gz)

Expected output: .IS file output for each sample with genome profiling results

1. Set up your environment

$  mamba create -n instrain_env -c bioconda instrain
$  conda activate instrain_env

Install instrain

$  pip install instrain

Install samtools to create your .bam file from the reference .fasta

$  conda install -c bioconda bowtie2 samtools

For more information on installation, visit inStrain or Bowtie2 https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

2. Download the reference databases

For each database (prokaryote, eukaryote, or substrate), download the .fasta and .stb file.

for prokaryote, make sure to also download the .genes file

3. Make your .bam file

You only have to do this once for each database version. Make sure to always use the .bam file made from the same version of the database .fasta file.

$  bowtie2-build MiFoDB_beta_v2_prok.fasta MiFoDB_prok_v2_index
$  bowtie2 -x MiFoDB_prok_v2_index -1 EBC_087_S160_L003_R1.fastq.gz -2 EBC_087_S160_L003_R2.fastq.gz -S EBC_087_aligned_reads.sam

Finally, you will need to convert the SAM file to a BAM file and index the sorted BAM

$  samtools view -Sb EBC_087_aligned_reads.sam > EBC_087_aligned_reads.bam
$  samtools sort EBC_087_aligned_reads.bam -o EBC_087_sort_aligned_reads.bam
$  samtools index EBC_087_sort_aligned_reads.bam
$  bowtie2 -p 10 -x MiFoDB_prok_v2_index -1 EBC_087_S160_L003_R1.fastq.gz -2 EBC_087_S160_L003_R2.fastq.gz -S EBC_087_aligned_reads.sam)  2>bowtie2.EBC_087.log

3. Run inStrain

Now that you have your .bam, .fasta, .stb files and inStrain installed, you can run inStrain profile

$  inStrain profile EBC_087_sort_aligned_reads.bam MiFoDB_beta_v2_prok.fasta -o EBC_087.IS -p 6 -g genesfile.fasta --stb_file MiFoDB_beta_v2_prok.stb --genes_file MiFoDB_beta_v2_prok.genes.fna --instrain_profile_args --database_mode

The output will be a .IS file, with a number of .tsv file. We will be most interested in genome_info.tsv (example below), which includes all mapping information. For interpretation and analysis, see example output.

genome_info.tsv

genome

coverage

breadth

nucl_diversity

length

true_scaffolds

detected_scaffolds

coverage_median

coverage_std

coverage_SEM

breadth_minCov

breadth_expected

nucl_diversity_rarefied

conANI_reference

popANI_reference

iRep

iRep_GC_corrected

linked_SNV_count

SNV_distance_mean

r2_mean

d_prime_mean

consensus_divergent_sites

population_divergent_sites

SNS_count

SNV_count

filtered_read_pair_count

reads_unfiltered_pairs

reads_mean_PID

reads_unfiltered_reads

divergent_site_count

C-03.Ssa-BR.fna

1.686020547

0.049164091

0.004595774

1896140

182

86

0

69.19478668

0.050739639

0.011300326

0.774346839

0.000140703

0.986372334

0.988145797

0.981642137

36199

417

FALSE

242

39.69008264

0.951699521

0.999845137

292

254

252

165

15171

15417

0.981642137

36199

417

292

254

EBC_086.5.fna

1.596317454

0.049848898

0.006035971

2377866

79

52

0

19.94120243

0.012974942

0.028909535

0.755746415

0.002048653

0.979081506

0.984682077

0.969968582

48221

1865

FALSE

1337

56.69334331

0.637899

652

0.9941014

1438

36199

417

292

254

FS03_2016_noduplicates_bin.6.fna

1.191514863

0.041940437

0.004574618

2543035

344

186

0

21.96261861

0.013962518

0.008234649

0.650799011

0.001974379

0.966286233

0.96981997

FALSE

393

68.18320611

0.596979301

0.989440015

706

632

628

185

14188

15687

0.965486302

39649

813