Find_overlap_reads
Motivation
Find_overlap_reads is a python3 object oriented script, developed to parse a genomic alignment files(BAM/SAM). The program is intended to count and extract reads or read pairs overlapping genomic intervals provided in a tab-separated text file.
Principle
- The interval file is parsed and Interval objects are created for each valid line within it.
- Sam/Bam (CRAM?) files are parsed iteratively with pysam.
- For each read the program verify if the read itself or the template covered by the read and its pair overlap one of the genomic interval.
- A report containing the total number of reads, the number of read mapped and the reads overlapping each interval is created in the current dir.
- A subset of bam containing only read overlapping intervals is generated, sorted and indexed.
Dependencies
- pysam 0.8.1+ (based on htslib and samtools versions 1.1)
If you have pip already installed, enter the following line to install pysam:
sudo pip install pysam
Get Find_overlap_reads
- Clone the repository with --recursive option to also pull the submodule
$ git clone --recursive https://github.com/a-slide/Find_overlap_reads/ my_folder/
- Enter the root of the program folder and make the main script executable
$ sudo chmod u+x Find_overlap_reads.py
- Add Find_overlap_reads.py in your PATH
Usage
Usage: find_overlap_read -f genomic_interval.csv [-b/-r] f1.bam(sam),[f2.bam(sam)...fn.bam(sam)
Parse a BAM/SAM file(s) and extract reads overlapping given genomic coordinates
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-f INTERVAL_FILE Path of the tab separated file contaning genomic interval (mandatory)
-t TEMPLATE_LEN Maximum length of the template between 2 paired reads to be considered as a concordant pair (default = 1000)
-b, --no_bam Don't output bam file(s) (default = True)
-r, --no_report Don't output report file(s) (default = True)
Genomic interval file
The file containing genomic intervals have to be formated as a tab-separated values text format as follow:
Name_of_the_the_reference start_coordinate(INT) end_coordinate(INT) Name_of_the_interval (facultative)
Lines with invalid integer values or less than 3 fields will be skipped. An error will be raised if no valid interval was found. Examples of valid and invalid files are provided in the demo/ folder. A standard Bed file would be considered as a
Aligment files (sam/bam)
A list of bam, sam and or cram files containing
Development notebook
2 possibilities:
- Use ipython notebook with Dev_notebook.ipynb
- Consult directly online through nbviewer : Notebook
Authors and Contact
Adrien Leger - 2014