Alignment API Usage
Import package
from pyBioTools import Alignment
from pyBioTools.common import jhelp, head
Reads_index
jhelp(Alignment.Reads_index)
Reads_index (input_fn, skip_unmapped, skip_secondary, skip_supplementary, verbose, quiet, progress, kwargs)
Index reads found in a coordinated sorted bam file by read_id. The created index file can be used to randon access the alignment file per read_id
- input_fn (required) [str]
Path to the bam file to index
- skip_unmapped (default: False) [bool]
Do not include unmapped reads in index
- skip_secondary (default: False) [bool]
Do not include secondary alignment in index
- skip_supplementary (default: False) [bool]
Do not include supplementary alignment in index
verbose (default: False) [bool]
quiet (default: False) [bool]
progress (default: False) [bool]
kwargs
Basic usage
Alignment.Reads_index("./data/sample_1.bam", verbose=True)
head("./data/sample_1.bam.idx.gz")
Excluding reads from index
Alignment.Reads_index("./data/sample_1.bam", verbose=True, skip_secondary=True, skip_supplementary=True, skip_unmapped=True)
head("./data/sample_1.bam.idx.gz")
Reads_sample
jhelp(Alignment.Reads_sample)
Reads_sample (input_fn, output_folder, output_prefix, n_reads, n_samples, rand_seed, verbose, quiet, progress, kwargs)
Randomly sample n_reads
reads from a bam file and write downsampled files in n_samples
bam files. If the input bam file is not indexed by read_id index_reads
is automatically called.
- input_fn (required) [str]
Path to the indexed bam file
- output_folder (default: ./) [str]
Path to a folder where to write sample files
- output_prefix (default: out) [str]
Path to a folder where to write sample files
- n_reads (default: 1000) [int]
Number of randomly selected reads in each sample
- n_samples (default: 1) [int]
Number of samples to generate files for
- rand_seed (default: 42) [int]
Seed to use for the pseudo randon generator. For non deterministic behaviour set to 0
verbose (default: False) [bool]
quiet (default: False) [bool]
progress (default: False) [bool]
kwargs
Basic usage
Alignment.Reads_sample("./data/sample_1.bam", "./output/sample_reads/", output_prefix="1K", n_reads=1000, n_samples=3)
References_sample
jhelp(Alignment.References_sample)
References_sample (input_fn, output_fn, selected_reads_fn, frac_reads, min_reads_ref, rand_seed, sorting_threads, verbose, quiet, progress, kwargs)
Randomly sample reads per references according to a fraction od the reads mapped to this reference for a one or several files and write selected reads in a new bam file
- input_fn (required) [list(str)]
Bam file path or directory containing bam files or list of files, or regex or list of regex. It is quite flexible. All files need to be sorted and aligned to the same reference file.
- output_fn (default: out.bam) [str]
Path to the output bam file (sorted and indexed)
- selected_reads_fn (default: select_ref.txt) [str]
Path to the output text file containing all the read id selected
- frac_reads (default: 0.5) [int]
Fraction of reads mapped to sample for each reference
- min_reads_ref (default: 30) [int]
Minimal read coverage per file and reference before sampling
- rand_seed (default: 42) [int]
Seed to use for the pseudo randon generator. For non deterministic behaviour set to None
- sorting_threads (default: 4) [int]
Number of threads to use for bam file sorting
verbose (default: False) [bool]
quiet (default: False) [bool]
progress (default: False) [bool]
kwargs
Basic usage
Alignment.References_sample (
input_fn = "./data/sample_*.bam",
output_fn = "./output/sample_References_sample.bam",
selected_reads_fn = "./output/sample_References_sample_refid.txt",
frac_reads = 0.25,
min_reads_ref = 100,
progress = True)
Filter
jhelp(Alignment.Filter)
Filter (input_fn, output_fn, selected_reads_fn, skip_unmapped, skip_secondary, skip_supplementary, index_reads, orientation, min_read_len, min_align_len, min_mapq, min_freq_identity, select_ref_fn, exclude_ref_fn, verbose, quiet, progress, kwargs)
- input_fn (required) [str]
Path to the bam file to filter
- output_fn (required) [str]
Path to the write filtered bam file
- selected_reads_fn (default: None) [str]
Optional file where to write ids of selected reads
- skip_unmapped (default: False) [bool]
Filter out unmapped reads
- skip_secondary (default: False) [bool]
Filter out secondary alignment
- skip_supplementary (default: False) [bool]
Filter out supplementary alignment
- index_reads (default: False) [bool]
Index bam file with both pysam and pybiotools reads_index
- orientation (default: .) [str]
Orientation of alignment on reference genome {"+","-" ,"."}
- min_read_len (default: 0) [int]
Minimal query read length (basecalled length)
- min_align_len (default: 0) [int]
Minimal query alignment length on reference
- min_mapq (default: 0) [int]
Minimal mapping quality score (mapq)
- min_freq_identity (default: 0) [float]
Minimal frequency of alignment identity [0 to 1]
- select_ref_fn (default: None) [str]
File containing a list of references on which the reads have to be mapped.
- exclude_ref_fn (default: None) [str]
File containing a list of references on which the reads should not be mapped.
verbose (default: False) [bool]
quiet (default: False) [bool]
progress (default: False) [bool]
kwargs
Basic usage
Filter all non primary reads
Alignment.Filter(
"./data/sample_1.bam",
"./output/sample_1_filter.bam",
skip_unmapped = True,
skip_supplementary = True,
skip_secondary = True,
progress=True,
verbose=True)
head("./output/sample_1_filter.bam")
Multi criteria filtering
Remove unmapped, short reads and alignments, reads mapped on the minus strand, low mapq and low identity
Alignment.Filter(
"./data/sample_1.bam",
"./output/sample_1_filter.bam",
skip_unmapped = True,
min_read_len=300,
min_align_len=300,
orientation = "+",
min_mapq = 10,
min_freq_identity=0.8,
progress=True,
verbose=True)
head("./output/sample_1_filter.bam")
Select specific reference
with open ("data/select_ref.txt", "w") as fp:
for ref in ['chr-I', 'chr-II', 'chr-III', 'chr-IV', 'chr-V', 'chr-VI']:
fp.write(f"{ref}\n")
Alignment.Filter(
input_fn="./data/sample_1.bam",
output_fn="./output/sample_1_filter.bam",
select_ref_fn="data/select_ref.txt",
index=True,
progress=True,
verbose=True)
To_fastq
jhelp(Alignment.To_fastq)
To_fastq (input_fn, output_r1_fn, output_r2_fn, ignore_paired_end, verbose, quiet, progress, kwargs)
Dump reads from an alignment file or set of alignment file(s) to a fastq or pair of fastq file(s). Only the primary alignment are kept and paired_end reads are assumed to be interleaved. Compatible with unmapped or unaligned alignment files as well as files without header.
- input_fn (required) [list(str)]
Path (or list of paths) to input BAM/CRAM/SAM file(s)
- output_r1_fn (required) [str]
Path to an output fastq file (for Read1 in paired_end mode of output_r2_fn is provided). Automatically gzipped if the .gz extension is found
- output_r2_fn (default: None) [str]
Optional Path to an output fastq file. Automatically gzipped if the .gz extension is found
- ignore_paired_end (default: False) [bool]
Ignore paired_end information and output everything in a single file.
verbose (default: False) [bool]
quiet (default: False) [bool]
progress (default: False) [bool]
kwargs
Single end read usage from bam files
Alignment.To_fastq(
input_fn=["./data/sample_1.bam", "./data/sample_2.bam"],
output_r1_fn="./output/sample_1-2_SE_from_bam.fastq.gz",
verbose=True,
progress=True)
Paired-end reads usage from unaligned CRAM files
Alignment.To_fastq(
input_fn=["./data/sample_1_20k.cram", "./data/sample_2_20k.cram"],
output_r1_fn="./output/sample_1-2_PE_from_CRAM_1.fastq.gz",
output_r2_fn="./output/sample_1-2_PE_from_CRAM_2.fastq.gz",
verbose=True,
progress=True)
Split
jhelp(Alignment.Split)
Split (input_fn, output_dir, n_files, output_fn_list, index, verbose, quiet, progress, kwargs)
Split reads in a bam file in N files. The input bam file has to be sorted by coordinates and indexed. The last file can contain a few extra reads.
- input_fn (required) [str]
Path to the bam file to filter
- output_dir (default: "") [str]
Path to the directory where to write split bam files. Files generated have the same basename as the source file and are suffixed with numbers starting from 0
- n_files (default: 10) [int]
Number of file to split the original file into
- output_fn_list (default: []) [list(str)]
As an alternative to output_dir and n_files one can instead give a list of output files. Reads will be automatically split between the files in the same order as given
- index (default: False) [bool]
Index output BAM files
verbose (default: False) [bool]
quiet (default: False) [bool]
progress (default: False) [bool]
kwargs
Usage with number of output files to generate
Alignment.Split(
input_fn="./data/sample_1.bam",
output_dir="./output/split_bam",
n_files= 4,
verbose=True)
!ls -lh "./output/split_bam"
Usage with a predefined list of output files
Alignment.Split(
input_fn="./data/sample_2.bam",
output_fn_list=["./output/split_bam_2/f1.bam", "./output/split_bam_2/f4.bam", "./output/split_bam_2/f3.bam"],
index=True,
verbose=True)
!ls -lh "./output/split_bam_2"