Skip to content

Alignment CLI Usage

Activate virtual environment

# Using virtualenvwrapper here but can also be done with Conda 
conda activate pyBioTools
(pyBioTools) 

Reads_index

Get help

pyBioTools Alignment Reads_index -h
usage: pyBioTools Alignment Reads_index [-h] -i INPUT_FN [-u] [-s] [-p] [-v]
                                        [-q] [--progress]

Index reads found in a coordinated sorted bam file by read_id. The created
index file can be used to randon access the alignment file per read_id

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FN, --input_fn INPUT_FN
                        Path to the bam file to index (required) [str]
  -u, --skip_unmapped   Filter out unmapped reads (default: False) [None]
  -s, --skip_secondary  Filter out secondary alignment (default: False) [None]
  -p, --skip_supplementary
                        Filter out supplementary alignment (default: False)
                        [None]
  -v, --verbose         Increase verbosity (default: False)
  -q, --quiet           Reduce verbosity (default: False)
  --progress            Display a progress bar
(pyBioTools) 

Basic usage

pyBioTools Alignment Reads_index -i ./data/sample_1.bam
## Running Alignment Reads_index ##
    Checking Bam file
    Parsing reads
    Read counts summary
     Reads retained
      total: 13,684
      primary: 10,584
      secondary: 1,496
      unmapped: 1,416
      supplementary: 188
(pyBioTools) 

Excluding reads from index

pyBioTools Alignment Reads_index -i ./data/sample_1.bam --verbose --skip_secondary --skip_unmapped
## Running Alignment Reads_index ##
    Checking Bam file
    Parsing reads
    Read counts summary
     Reads retained
      total: 10,772
      primary: 10,584
      supplementary: 188
     Reads discarded
      total: 2,912
      secondary: 1,496
      unmapped: 1,416
(pyBioTools) 

References_sample

Get help

pyBioTools Alignment References_sample -h
usage: pyBioTools Alignment References_sample [-h] -i
                                              [INPUT_FN [INPUT_FN ...]]
                                              [-o OUTPUT_FN]
                                              [-s SELECTED_READS_FN]
                                              [-f FRAC_READS]
                                              [-r MIN_READS_REF]
                                              [-t SORTING_THREADS]
                                              [--rand_seed RAND_SEED] [-v]
                                              [-q] [--progress]

Randomly sample reads per references according to a fraction od the reads
mapped to this reference for a one or several files and write selected reads
in a new bam file

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT_FN [INPUT_FN ...]], --input_fn [INPUT_FN [INPUT_FN ...]]
                        Bam file path or directory containing bam files or
                        list of files, or regex or list of regex. It is quite
                        flexible. All files need to be sorted and aligned to
                        the same reference file. (required) [str]
  -o OUTPUT_FN, --output_fn OUTPUT_FN
                        Path to the output bam file (sorted and indexed)
                        (default: out.bam) [str]
  -s SELECTED_READS_FN, --selected_reads_fn SELECTED_READS_FN
                        Path to the output text file containing all the read
                        id selected (default: select_ref.txt) [str]
  -f FRAC_READS, --frac_reads FRAC_READS
                        Fraction of reads mapped to sample for each reference
                        (default: 0.5) [int]
  -r MIN_READS_REF, --min_reads_ref MIN_READS_REF
                        Minimal read coverage per file and reference before
                        sampling (default: 30) [int]
  -t SORTING_THREADS, --sorting_threads SORTING_THREADS
                        Number of threads to use for bam file sorting
                        (default: 4) [int]
  --rand_seed RAND_SEED
                        Seed to use for the pseudo randon generator. For non
                        deterministic behaviour set to None (default: 42)
                        [int]
  -v, --verbose         Increase verbosity (default: False)
  -q, --quiet           Reduce verbosity (default: False)
  --progress            Display a progress bar
(pyBioTools) 

Basic usage

pyBioTools Alignment References_sample \
    --input_fn "./data/sample_*.bam" \
    --output_fn "./output/sample_References_sample.bam" \
    --selected_reads_fn "./output/sample_References_sample_refid.txt" \
    --frac_reads 0.25 \
    --min_reads_ref 100 \
    --progress
## Running Alignment Ref_sample ##
## Index files ##
    Indexing alignment file ./data/sample_2.bam
    Reading : 13678 Reads [00:00, 19324.68 Reads/s]
    Indexing alignment file ./data/sample_1.bam
    Reading : 13684 Reads [00:00, 23822.68 Reads/s]
    Raw read counts summary
     primary reads: 21,185
     secondary reads: 2,966
     unmapped reads: 2,815
     supplementary reads: 396
## Randomly pick reads per references ##
## Sample reads and write to output file ##
    Writing selected reads for bam file ./data/sample_2.bam
    Writing : 100%|██████████████████████| 2656/2656 [00:02<00:00, 1210.73 Reads/s]
    Writing selected reads for bam file ./data/sample_1.bam
    Writing : 100%|██████████████████████| 2653/2653 [00:02<00:00, 1247.70 Reads/s]
    Sort BAM File
    Index sorted BAM File
    Selected read counts summary
     valid reads: 21,185
     valid sampled reads: 5,309
     valid references: 17
(pyBioTools) 

Reads_sample

Get help

pyBioTools Alignment Reads_sample -h
usage: pyBioTools Alignment Reads_sample [-h] -i INPUT_FN [-o OUTPUT_FOLDER]
                                         [-p OUTPUT_PREFIX] [-r N_READS]
                                         [-s N_SAMPLES]
                                         [--rand_seed RAND_SEED] [-v] [-q]
                                         [--progress]

Randomly sample `n_reads` reads from a bam file and write downsampled files in
`n_samples` bam files. If the input bam file is not indexed by read_id
`index_reads` is automatically called.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FN, --input_fn INPUT_FN
                        Path to the indexed bam file (required) [str]
  -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
                        Path to a folder where to write sample files (default:
                        ./) [str]
  -p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Path to a folder where to write sample files (default:
                        out) [str]
  -r N_READS, --n_reads N_READS
                        Number of randomly selected reads in each sample
                        (default: 1000) [int]
  -s N_SAMPLES, --n_samples N_SAMPLES
                        Number of samples to generate files for (default: 1)
                        [int]
  --rand_seed RAND_SEED
                        Seed to use for the pseudo randon generator. For non
                        deterministic behaviour set to 0 (default: 42) [int]
  -v, --verbose         Increase verbosity (default: False)
  -q, --quiet           Reduce verbosity (default: False)
  --progress            Display a progress bar
(pyBioTools) 

Basic usage

pyBioTools Alignment Reads_sample -i ./data/sample_1.bam -o ./output/sample_reads -p 1K -r 1000 -s 3 --progress --verbose
## Running Alignment Reads_sample ##
    Checking Bam and index file
    Load index
    Index: 10772it [00:00, 528519.79it/s]
    Write sample reads
    Sample 1: 100%|██████████████████████| 1000/1000 [00:00<00:00, 1225.20 Reads/s]
    Indexing output bam file
    Sample 2: 100%|██████████████████████| 1000/1000 [00:00<00:00, 1255.67 Reads/s]
    Indexing output bam file
    Sample 3: 100%|██████████████████████| 1000/1000 [00:00<00:00, 1225.41 Reads/s]
    Indexing output bam file
(pyBioTools) 

Filter

Get help

pyBioTools Alignment Filter -h
usage: pyBioTools Alignment Filter [-h] -i INPUT_FN -o OUTPUT_FN [-u] [-s]
                                   [-p] [-t ORIENTATION] [-r MIN_READ_LEN]
                                   [-a MIN_ALIGN_LEN] [-m MIN_MAPQ]
                                   [-f MIN_FREQ_IDENTITY]
                                   [--select_ref [SELECT_REF [SELECT_REF ...]]]
                                   [--exclude_ref [EXCLUDE_REF [EXCLUDE_REF ...]]]
                                   [-v] [-q] [--progress]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FN, --input_fn INPUT_FN
                        Path to the bam file to filter (required) [str]
  -o OUTPUT_FN, --output_fn OUTPUT_FN
                        Path to the write filtered bam file (required) [str]
  -u, --skip_unmapped   Filter out unmapped reads (default: False) [None]
  -s, --skip_secondary  Filter out secondary alignment (default: False) [None]
  -p, --skip_supplementary
                        Filter out supplementary alignment (default: False)
                        [None]
  -t ORIENTATION, --orientation ORIENTATION
                        Orientation of alignment on reference genome {"+","-"
                        ,"."} (default: .) [str]
  -r MIN_READ_LEN, --min_read_len MIN_READ_LEN
                        Minimal query read length (basecalled length)
                        (default: 0) [int]
  -a MIN_ALIGN_LEN, --min_align_len MIN_ALIGN_LEN
                        Minimal query alignment length on reference (default:
                        0) [int]
  -m MIN_MAPQ, --min_mapq MIN_MAPQ
                        Minimal mapping quality score (mapq) (default: 0)
                        [int]
  -f MIN_FREQ_IDENTITY, --min_freq_identity MIN_FREQ_IDENTITY
                        Minimal frequency of alignment identity [0 to 1]
                        (default: 0) [float]
  --select_ref [SELECT_REF [SELECT_REF ...]]
                        List of references on which the reads have to be
                        mapped. (default: None) [str]
  --exclude_ref [EXCLUDE_REF [EXCLUDE_REF ...]]
                        List of references on which the reads should not be
                        mapped. (default: None) [str]
  -v, --verbose         Increase verbosity (default: False)
  -q, --quiet           Reduce verbosity (default: False)
  --progress            Display a progress bar
(pyBioTools) 

Basic usage

pyBioTools Alignment Filter \
    -i "./data/sample_1.bam" \
    -o "./output/sample_1_filter.bam" \
    --skip_unmapped \
    --skip_supplementary \
    --skip_secondary \
    --min_read_len 300 \
    --min_align_len 300 \
    --orientation "+" \
    --min_mapq 10 \
    --min_freq_identity 0.8 \
    --verbose
## Running Alignment Filter ##
    Checking input bam file
    Parsing reads
    Indexing output bam file
    Read counts summary
     Reads discarded
      total: 9,262
      wrong_orientation: 5,291
      secondary: 1,496
      unmapped: 1,416
      low_identity: 510
      low_mapping_quality: 283
      supplementary: 188
      short_alignment: 67
      short_read: 11
     Reads retained
      primary: 4,422
      total: 4,422
(pyBioTools) 

To_fastq

pyBioTools Alignment To_fastq -h
usage: pyBioTools Alignment To_fastq [-h] -i [INPUT_FN [INPUT_FN ...]] -1
                                     OUTPUT_R1_FN [-2 OUTPUT_R2_FN] [-s] [-v]
                                     [-q] [--progress]

Dump reads from an alignment file or set of alignment file(s) to a fastq or
pair of fastq file(s). Only the primary alignment are kept and paired_end
reads are assumed to be interleaved. Compatible with unmapped or unaligned
alignment files as well as files without header.

optional arguments:
  -h, --help            show this help message and exit
  -i [INPUT_FN [INPUT_FN ...]], --input_fn [INPUT_FN [INPUT_FN ...]]
                        Path (or list of paths) to input BAM/CRAM/SAM file(s)
                        (required) [str]
  -1 OUTPUT_R1_FN, --output_r1_fn OUTPUT_R1_FN
                        Path to an output fastq file (for Read1 in paired_end
                        mode of output_r2_fn is provided). Automatically
                        gzipped if the .gz extension is found (required) [str]
  -2 OUTPUT_R2_FN, --output_r2_fn OUTPUT_R2_FN
                        Optional Path to an output fastq file. Automatically
                        gzipped if the .gz extension is found (default: None)
                        [str]
  -s, --ignore_paired_end
                        Ignore paired_end information and output everything in
                        a single file. (default: False) [None]
  -v, --verbose         Increase verbosity (default: False)
  -q, --quiet           Reduce verbosity (default: False)
  --progress            Display a progress bar
(pyBioTools) 

Single end read usage from bam files

pyBioTools Alignment To_fastq \
    -i ./data/sample_1.bam ./data/sample_2.bam\
    -1 ./output/sample_1-2_SE_from_bam.fastq.gz \
    --verbose \
    --progress
## Running Alignment To_fastq ##
    [DEBUG]: Opening file ./output/sample_1-2_SE_from_bam.fastq.gz in writing mode
    Parsing reads
    Reading input file ./data/sample_1.bam
    Reading: 12000 Reads [00:15, 753.11 Reads/s] 
    [DEBUG]: Reached end of input file ./data/sample_1.bam
    Reading input file ./data/sample_2.bam
    Reading: 12000 Reads [00:18, 664.44 Reads/s] 
    [DEBUG]: Reached end of input file ./data/sample_2.bam
    [DEBUG]: Closing file:./output/sample_1-2_SE_from_bam.fastq.gz
    [DEBUG]: Sequences writen: 24000
(pyBioTools) 

Paired-end reads usage from unaligned CRAM files

pyBioTools Alignment To_fastq \
    -i ./data/sample_1_20k.cram ./data/sample_2_20k.cram \
    -1 ./output/sample_1-2_PE_from_CRAM_1.fastq.gz \
    -2 ./output/sample_1-2_PE_from_CRAM_2.fastq.gz \
    --verbose \
    --progress
## Running Alignment To_fastq ##
    [DEBUG]: Opening file ./output/sample_1-2_PE_from_CRAM_1.fastq.gz in writing mode
    [DEBUG]: Opening file ./output/sample_1-2_PE_from_CRAM_2.fastq.gz in writing mode
    Parsing reads
    Reading input file ./data/sample_1_20k.cram
[E::cram_index_load] Could not retrieve index file for './data/sample_1_20k.cram'
    Reading: 12000 Reads [00:03, 3594.10 Reads/s]
    [DEBUG]: Reached end of input file ./data/sample_1_20k.cram
    Reading input file ./data/sample_2_20k.cram
[E::cram_index_load] Could not retrieve index file for './data/sample_2_20k.cram'
    Reading: 12000 Reads [00:03, 3628.22 Reads/s]
    [DEBUG]: Reached end of input file ./data/sample_2_20k.cram
    [DEBUG]: Closing file:./output/sample_1-2_PE_from_CRAM_1.fastq.gz
    [DEBUG]: Sequences writen: 24000
    [DEBUG]: Closing file:./output/sample_1-2_PE_from_CRAM_2.fastq.gz
    [DEBUG]: Sequences writen: 24000
(pyBioTools) 

Split

pyBioTools Alignment Split -h
usage: pyBioTools Alignment Split [-h] -i INPUT_FN [-o OUTPUT_DIR]
                                  [-n N_FILES]
                                  [-l [OUTPUT_FN_LIST [OUTPUT_FN_LIST ...]]]
                                  [-x] [-v] [-q] [--progress]

Split reads in a bam file in N files. The input bam file has to be sorted by
coordinates and indexed. The last file can contain a few extra reads.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FN, --input_fn INPUT_FN
                        Path to the bam file to filter (required) [str]
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Path to the directory where to write split bam files.
                        Files generated have the same basename as the source
                        file and are suffixed with numbers starting from 0
                        (default: None) [str]
  -n N_FILES, --n_files N_FILES
                        Number of file to split the original file into
                        (default: 10) [int]
  -l [OUTPUT_FN_LIST [OUTPUT_FN_LIST ...]], --output_fn_list [OUTPUT_FN_LIST [OUTPUT_FN_LIST ...]]
                        As an alternative to output_dir and n_files one can
                        instead give a list of output files. Reads will be
                        automatically split between the files in the same
                        order as given (default: None) [str]
  -x, --index           Index output BAM files (default: False) [None]
  -v, --verbose         Increase verbosity (default: False)
  -q, --quiet           Reduce verbosity (default: False)
  --progress            Display a progress bar
(pyBioTools) 

Basic usage with an output folder

pyBioTools Alignment Split \
    -i "./data/sample_1.bam" \
    -o "./output/split_bam" \
    -n 5 \
    --verbose

ll "./output/split_bam"
## Running Alignment Split ##
    Checking input bam file
    [DEBUG]: List of output files to generate:
    [DEBUG]: * ./output/split_bam/sample_1_0.bam
    [DEBUG]: * ./output/split_bam/sample_1_1.bam
    [DEBUG]: * ./output/split_bam/sample_1_2.bam
    [DEBUG]: * ./output/split_bam/sample_1_3.bam
    [DEBUG]: * ./output/split_bam/sample_1_4.bam
    Parsing reads
    [DEBUG]: Counting reads
    [DEBUG]: Open ouput file './output/split_bam/sample_1_0.bam'
    [DEBUG]: Close output file './output/split_bam/sample_1_0.bam'
    [DEBUG]: Reads written: 2,736
    [DEBUG]: Open ouput file './output/split_bam/sample_1_1.bam'
    [DEBUG]: Close output file './output/split_bam/sample_1_1.bam'
    [DEBUG]: Reads written: 2,736
    [DEBUG]: Open ouput file './output/split_bam/sample_1_2.bam'
    [DEBUG]: Close output file './output/split_bam/sample_1_2.bam'
    [DEBUG]: Reads written: 2,736
    [DEBUG]: Open ouput file './output/split_bam/sample_1_3.bam'
    [DEBUG]: Close output file './output/split_bam/sample_1_3.bam'
    [DEBUG]: Reads written: 2,736
    [DEBUG]: Open ouput file './output/split_bam/sample_1_4.bam'
    [DEBUG]: Reached end of input file
    [DEBUG]: Close output file './output/split_bam/sample_1_4.bam'
    [DEBUG]: Reads written: 2,740
    Read counts summary
     Reads from index: 13,684
     Reads writen: 13,684
     Reads per file: 2,736
(pyBioTools) (pyBioTools) total 38M
-rw-rw-r-- 1 aleg aleg 8.4M Jan 19 14:57 sample_1_0.bam
-rw-rw-r-- 1 aleg aleg 8.4M Jan 19 14:57 sample_1_1.bam
-rw-rw-r-- 1 aleg aleg 8.5M Jan 19 14:57 sample_1_2.bam
-rw-rw-r-- 1 aleg aleg 8.5M Jan 19 14:57 sample_1_3.bam
-rw-rw-r-- 1 aleg aleg 3.5M Jan 19 14:57 sample_1_4.bam
(pyBioTools) 

Basic usage with named output files

pyBioTools Alignment Split \
    -i "./data/sample_1.bam" \
    -l "./output/split_bam_2/f1.bam" "./output/split_bam_2/f2.bam" "./output/split_bam_2/f3.bam" "./output/split_bam_2/f4.bam" \
    --verbose \
    --index

ll "./output/split_bam_2"
## Running Alignment Split ##
    Checking input bam file
    [DEBUG]: List of output files to generate:
    [DEBUG]: * ./output/split_bam_2/f1.bam
    [DEBUG]: * ./output/split_bam_2/f2.bam
    [DEBUG]: * ./output/split_bam_2/f3.bam
    [DEBUG]: * ./output/split_bam_2/f4.bam
    Parsing reads
    [DEBUG]: Counting reads
    [DEBUG]: Open ouput file './output/split_bam_2/f1.bam'
    [DEBUG]: Close output file './output/split_bam_2/f1.bam'
    [DEBUG]: Reads written: 3,421
    [DEBUG]: index output file './output/split_bam_2/f1.bam'
    [DEBUG]: Open ouput file './output/split_bam_2/f2.bam'
    [DEBUG]: Close output file './output/split_bam_2/f2.bam'
    [DEBUG]: Reads written: 3,421
    [DEBUG]: index output file './output/split_bam_2/f2.bam'
    [DEBUG]: Open ouput file './output/split_bam_2/f3.bam'
    [DEBUG]: Close output file './output/split_bam_2/f3.bam'
    [DEBUG]: Reads written: 3,421
    [DEBUG]: index output file './output/split_bam_2/f3.bam'
    [DEBUG]: Open ouput file './output/split_bam_2/f4.bam'
    [DEBUG]: Reached end of input file
    [DEBUG]: Close output file './output/split_bam_2/f4.bam'
    [DEBUG]: Reads written: 3,421
    [DEBUG]: index output file './output/split_bam_2/f4.bam'
    Read counts summary
     Reads from index: 13,684
     Reads writen: 13,684
     Reads per file: 3,421
(pyBioTools) (pyBioTools) total 38M
-rw-rw-r-- 1 aleg aleg  11M Jan 19 14:57 f1.bam
-rw-rw-r-- 1 aleg aleg 6.2K Jan 19 14:57 f1.bam.bai
-rw-rw-r-- 1 aleg aleg  12M Jan 19 14:57 f2.bam
-rw-rw-r-- 1 aleg aleg 7.4K Jan 19 14:57 f2.bam.bai
-rw-rw-r-- 1 aleg aleg 9.4M Jan 19 14:57 f3.bam
-rw-rw-r-- 1 aleg aleg 5.0K Jan 19 14:57 f3.bam.bai
-rw-rw-r-- 1 aleg aleg 5.6M Jan 19 14:57 f4.bam
-rw-rw-r-- 1 aleg aleg 2.5K Jan 19 14:57 f4.bam.bai
(pyBioTools)