Quick Start

The simplest way to run MiFoDB is using nextflow, a workflow management system. Matt Olm has developed a nextflow pipeline to perform genome resolved metagenomics, which includes all steps outlined in the step-by-step guide.

To run nextflow, you must first install nextflow, following the outlined instructions. Instructions are also available here.

We can use reads from a sample of pikliz, a Haitian ferment with cabbage, carrots, bell peppers and Scotch bonnet peppers, produced in Montana, USA.

Pre-processing with nextflow

1. Download your reads

2. Prepare your input file: this will include 4 columns called sample, fastq_1, fastq_2, and single_end. In this case, single_end is left blank.

basicInfo_v1.csv

sample

fastq_1

fastq_2

single_end

EBC_087

/location/of/your/file/raw_reads/EBC_087_S160_L003_R1.fq.gz

/location/of/your/file/raw_reads/EBC_087_S160_L003_R2.fq.gz

If you are pre-processing multiple files at once, you can add one sample to each line

3. Run preprocessing: Once nextflow is installed and the basicInfo.csv is pointing to the right location of the file, run:

$ nextflow run https://github.com/MrOlm/nf-genomeresolvedmetagenomics -entry PREPROCESSREADS --input basicInfo_v1.csv -with-report v1 --outdir results_v1/

The resulting trimmed files, ending in .trim.fastq.gz will be in results_v1/fastp/*.trim.fastq.gz

Read metrics will be included in results_v1/basicinfo/basic_info_final.csv

Tip

Common errors include pointing to the incorrect location, or accidentally including additional empty lines in your .csv file.

Additionally, depending on how the .csv file is, a byte order mark (BOM) might need to be removed before running. To do that, in your terminal in the directory with the basicInfo.csv file run:

$ vi -c ":set nobomb" -c ":wq" basicInfo_v1.csv

`More on runtime info <https://mifodb.readthedocs.io/en/latest/overview.html>`_

Profile with nextflow

1. Prepare your input file: you will now make a new input file, pointing to the processed results your just generated. This will include 4 columns called sample, fastq_1, fastq_2, and single_end. In this case, single_end is left blank.

preprocessingInfo_v1.csv

sample

fastq_1

fastq_2

single_end

EBC_087

/location/of/your/trimmed_file//EBC_087_S160_L003_R1.trim.fastq.gz

/location/of/your/file/raw_reads/EBC_087_S160_L003_R2.trim.fastq.gz

You may need to run:

$ vi -c ":set nobomb" -c ":wq" preprocessingInfo_v1.csv

2. Run profile: MiFoDB_beta_v3_prok.fasta, MiFoDB_beta_v3_prok.stb and MiFoDB_beta_v3_prok.genes.fna can all be found on Zenodo. Alternatively, you can use a custom database.

$ nextflow run /home/mattolm/user_data/Nextflow/nf-core-genomeresolvedmetagenomics/main.nf -entry PROFILE --input processingInfo_v1.csv -with-report report.html --outdir results_prok_v1 --fasta path/to/MiFoDB_beta_v3_prok.fasta --stb_file path/to/MiFoDB_beta_v3_prok.stb --genes_file path/to/MiFoDB_beta_v3_prok.genes.fna --instrain_profile_args " --database_mode --skip_plot_generation"

3. Run additional profiling databases: you can also run the eukaryote microbe and substrate mapping. Each have slight differences in the commands, but use the same input sheet.

$ nextflow run /home/mattolm/user_data/Nextflow/nf-core-genomeresolvedmetagenomics/main.nf -entry PROFILE --input processingInfo_v1.csv -with-report report_euk.html --outdir results_euk_v1 --fasta path/to/MiFoDB_beta_vhm_v3_euk.fasta --stb_file path/to/MiFoDB_beta_vhm_v3_euk.stb --instrain_profile_args " --database_mode --skip_plot_generation"

And finally run with the substrate database:

$ nextflow run https://github.com/MrOlm/nf-genomeresolvedmetagenomics -entry PROFILE --input processingInfo_v1.csv -with-report report_sub_v1.html --outdir results_sub_v1 --fasta path/to/substrate_genomes.fasta --stb_file path/to/substrate_genomes.stb --instrain_profile_args " --database_mode --skip_plot_generation" --coverm

Tip

If your substrate is not in the default MiFoDB_sub database, or you will be using this frequently, we recommend making a custom substrate database. Substrate genomes, like Brassica oleracea and Bos taurus are much much larger and more complex compared to prokaryote (and most eukaryotic microbes), which increases the rate of false-positive mapping, so we recommend limiting substrate genomes to those that are relevant to the project.