Chimera_finder

BASH scripts to extract chimeric pairs and chimeric reads from NGS data mixing 2 DNA references

View project onGitHub

Welcome to Chimera_finder Page

Motivations

Chimera_finder consist of 2 simple Bash pipelines for pair end illumina NGS data, originally dedicated to retrieval of viral insertion sites retrieval in a host cell genome (chimeric paired end and chimeric reads). Pipelines were initially designed to find Adeno associated vectors genome integrated in host cell genomic DNA, but they may also be suitable for any other viral integration sites after minor code modifications.

Principle

Data provided to both pipelines had to be generated by pair end sequencing with a recent Illumina platform (MiSeq or Hiseq). The scripts use the same basic strategy but chimera_finder_single.sh merge forward and reverse fastq read files at the very beginning of its execution and mapping is then performed in single end mode. In contrast chimera_finder_pair.sh keep pairs separated through all the process. chimera_finder_single.sh is able to extract reads that overlap host DNA and viral DNA, while chimera_finder_pair.sh identify paired reads having one read mapped on a reference and the second read mapped on a different reference.

Basic steps of the algorithms are detailed bellow

Preprocessing

Extremities of bad quality sequences are trimmed using Sickle using "pe mode" for chimera_finder_pair.sh and "se mode" for chimera_finder_single.sh

Subtraction of non-chimeric reads/pairs

In order to facilitate further analyses, we first remove from dataset reads (for chimera_finder_single.sh) and read pairs (for chimera_finder_pair.sh) which map unambiguously on only one of the reference. To do so, fastq are mapped against the viral genome with the fast read aligner Bowtie2 using end to end highly stringent mapping conditions. reads/pairs with a MAPQ > 30 are then removed using Samtools and a fastq dataset is regenerated with Bedtools bamtofastq. From this smaller dataset, a second round of mapping/filtering/fastq regeneration is then performed with the host cell genome.

Extraction of chimeras

chimera_finder_single.sh performs a local bowtie2 alignment against a reference index mixing both viral and genomic DNA. We used a special option of Bowtie2 (-k 5) rendering the 5 best alignments instead of only one. SAM files resulting from the alignment are parsed with standard UNIX tools (grep, sort, awk, comm and join) to select reads with map in both viral and genomic DNA.

As for chimera_finder_pair.sh the principle is slightly different. A standard local alignment is performed against the mixed index and SAM file is parsed to retrieve pairs with one read aligned on viral DNA and the mate on host cell DNA.

Postprocessing

During this last step a BAM file, a BAM index (.BAI) and a Bedgraph are produced for further visualization (with integrated genomic viewer for example). Finally, a report containing parameters used for this run and BAMflag statistics is generated (sample_report.txt)

Get Chimera_finder

From github repository

$ git clone https://github.com/a-slide/Chimera_Finder my_folder/

Archive download

In both case, scripts must be made executable using chmod.

$ sudo chmod u+x chimera_finder_pair.sh chimera_finder_single.sh

Usage

Before using pipelines, the 3 following reference genomes must be indexed for Bowtie2 (see the-bowtie2-build-indexer :

  • Viral genome alone
  • Host cell genome alone
  • Viral and host cell genome together
$ # For chimeric pairs extraction
$ ./ chimera_finder_single.sh 
$   [ R1.fastq(.gz) ]
$   [ R2.fastq(.gz) ]
$   [ output name ]
$   [ viral genome index]
$   [ host genome index]
$   [ viral + host genome index]

Fastq files containing forward reads (R1.fastq) and reverse reads (R2.fastq) can be provide as uncompressed or tarball compressed files.

The output name will be used as a prefix for all files that will be generated during the analysis.

Indexes need to be indicated as the basename of index files up to but not including the final .1.bt2. For example, for the viral genome indexed file in the folder my_virus_index are named my_virus.1.bt2, my_virus.2.bt2, my_virus.3.bt2, my_virus.4.bt2, my_virus.rev.1.bt2, my_virus.rev.2.bt2. In this situation viral genome index will be path/my_virus_index/my_virus.

Dependencies:

The script was developed under Linux Mint 16 "petra" but is compatible with other LINUX debian based distributions. The following dependencies are required by the pipeline. Users should verify that all of these programs are added to PATH and correspond to the indicated or a later version.

Authors and Contact

Adrien Leger @a-slide

adrien.leger@inserm.fr