NanoCount command line usage

Activate virtual environment

conda activate nanocount

(nanocount)

Running NanoCount

NanoCount --help

usage: NanoCount [-h] [--version] -i ALIGNMENT_FILE [-o COUNT_FILE]
                 [-b FILTER_BAM_OUT] [-l MIN_ALIGNMENT_LENGTH]
                 [-f MIN_QUERY_FRACTION_ALIGNED] [-s SEC_SCORING_VALUE]
                 [-t SEC_SCORING_THRESHOLD] [-c CONVERGENCE_TARGET]
                 [-e MAX_EM_ROUNDS] [-x] [-p PRIMARY_SCORE] [-a]
                 [-d MAX_DIST_3_PRIME] [-u MAX_DIST_5_PRIME] [-v] [-q]

NanoCount estimates transcripts abundance from Oxford Nanopore *direct-RNA
sequencing* datasets, using an expectation-maximization approach like RSEM,
Kallisto, salmon, etc to handle the uncertainty of multi-mapping reads

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

Input/Output options:
  -i ALIGNMENT_FILE, --alignment_file ALIGNMENT_FILE
                        Sorted and indexed BAM or SAM file containing aligned
                        ONT dRNA-Seq reads including secondary alignments
                        (required) [str]
  -o COUNT_FILE, --count_file COUNT_FILE
                        Output file path where to write estimated counts (TSV
                        format) (default: None) [str]
  -b FILTER_BAM_OUT, --filter_bam_out FILTER_BAM_OUT
                        Optional output file path where to write filtered
                        reads selected by NanoCount to perform quantification
                        estimation (BAM format) (default: None) [str]

Misc options:
  -l MIN_ALIGNMENT_LENGTH, --min_alignment_length MIN_ALIGNMENT_LENGTH
                        Minimal length of the alignment to be considered valid
                        (default: 50) [int]
  -f MIN_QUERY_FRACTION_ALIGNED, --min_query_fraction_aligned MIN_QUERY_FRACTION_ALIGNED
                        Minimal fraction of the primary alignment query
                        aligned to consider the read valid (default: 0.5)
                        [float]
  -s SEC_SCORING_VALUE, --sec_scoring_value SEC_SCORING_VALUE
                        Value to use for score thresholding of secondary
                        alignments either "alignment_score" or
                        "alignment_length" (default: alignment_score) [str]
  -t SEC_SCORING_THRESHOLD, --sec_scoring_threshold SEC_SCORING_THRESHOLD
                        Fraction of the alignment score or the alignment
                        length of secondary alignments compared to the primary
                        alignment to be considered valid alignments (default:
                        0.95) [float]
  -c CONVERGENCE_TARGET, --convergence_target CONVERGENCE_TARGET
                        Convergence target value of the cummulative difference
                        between abundance values of successive EM round to
                        trigger the end of the EM loop. (default: 0.005)
                        [float]
  -e MAX_EM_ROUNDS, --max_em_rounds MAX_EM_ROUNDS
                        Maximum number of EM rounds before triggering stop
                        (default: 100) [int]
  -x, --extra_tx_info   Add transcripts length and zero coverage transcripts
                        to the output file (required valid bam/sam header)
                        (default: False) [boolean]
  -p PRIMARY_SCORE, --primary_score PRIMARY_SCORE
                        Method to pick the best alignment for each read. By
                        default ("alignment_score") uses the best alignment
                        score (AS optional field), but it can be changed to
                        use either the primary alignment defined by the
                        aligner ("primary") or the longest alignment
                        ("alignment_length"). choices = [primary,
                        alignment_score, alignment_length] (default:
                        alignment_score) [str]
  -a, --keep_suplementary
                        Retain any supplementary alignments and considered
                        them like secondary alignments. Discarded by default.
                        (default: False) [boolean]
  -d MAX_DIST_3_PRIME, --max_dist_3_prime MAX_DIST_3_PRIME
                        Maximum distance of alignment end to 3 prime of
                        transcript. In ONT dRNA-Seq reads are assumed to start
                        from the polyA tail (-1 to deactivate) (default: 50)
                        [int]
  -u MAX_DIST_5_PRIME, --max_dist_5_prime MAX_DIST_5_PRIME
                        Maximum distance of alignment start to 5 prime of
                        transcript. In conjunction with max_dist_3_prime it
                        can be used to select near full transcript reads only
                        (-1 to deactivate). (default: -1) [int]

Verbosity options:
  -v, --verbose         Increase verbosity for QC and debugging (default:
                        False) [boolean]
  -q, --quiet           Reduce verbosity (default: False) [boolean]
(nanocount)

Basic command

NanoCount -i ./data/aligned_reads_sorted.bam -o ./output/tx_counts.tsv
head ./output/tx_counts.tsv

## Checking options and input files ##
## Initialise Nanocount ##
    Parse Bam file and filter low quality alignments
    Summary of alignments parsed in input bam file
        Valid alignments: 150,517
        Discarded unmapped alignments: 9,545
        Discarded alignment with invalid 3 prime end: 6,133
        Discarded negative strand alignments: 4,515
        Discarded supplementary alignments: 334
    Summary of reads filtered
        Reads with valid best alignment: 85,908
        Invalid secondary alignments: 60,120
        Valid secondary alignments: 2,622
        Reads with low query fraction aligned: 1,628
    Generate initial read/transcript compatibility index
## Start EM abundance estimate ##
    Progress: 2.00 rounds [00:00, 7.41 rounds/s]
    Exit EM loop after 2 rounds
    Convergence value: 0.0019361726963877538
## Summarize data ##
    Convert results to dataframe
    Compute estimated counts and TPM
    Write file
(nanocount) transcript_name raw est_count   tpm
YHR174W_mRNA    0.5881056948454584  50522.984032783635  588105.6948454584
YGR192C_mRNA    0.02083282680839274 1789.7064854554035  20832.82680839274
YLR110C_mRNA    0.009591656190343158    824.0   9591.656190343158
YOL086C_mRNA    0.008299576290915864    713.0   8299.576290915864
YKL060C_mRNA    0.006518601294407972    560.0   6518.601294407972
YCR012W_mRNA    0.005412767146249476    464.99999999999994  5412.767146249475
YPR080W_mRNA    0.005255622293616427    451.5   5255.622293616427
YBR118W_mRNA    0.005255622293616427    451.5   5255.622293616427
YKL152C_mRNA    0.005226521394980677    449.0   5226.5213949806775
(nanocount)

Changing default distance to transcripts ends filters

NanoCount -i ./data/aligned_reads_sorted.bam -o ./output/tx_counts.tsv  --max_dist_3_prime 10 --max_dist_5_prime 10
head ./output/tx_counts.tsv

## Checking options and input files ##
## Initialise Nanocount ##
    Parse Bam file and filter low quality alignments
    Summary of alignments parsed in input bam file
        Valid alignments: 73,329
        Discarded alignment with invalid 5 prime end: 44,897
        Discarded alignment with invalid 3 prime end: 38,424
        Discarded unmapped alignments: 9,545
        Discarded negative strand alignments: 4,515
        Discarded supplementary alignments: 334
    Summary of reads filtered
        Reads with valid best alignment: 46,241
        Invalid secondary alignments: 25,688
        Reads with low query fraction aligned: 687
        Valid secondary alignments: 606
    Generate initial read/transcript compatibility index
## Start EM abundance estimate ##
    Progress: 2.00 rounds [00:00, 13.8 rounds/s]
    Exit EM loop after 2 rounds
    Convergence value: 0.000702479043822885
## Summarize data ##
    Convert results to dataframe
    Compute estimated counts and TPM
    Write file
(nanocount) transcript_name raw est_count   tpm
YHR174W_mRNA    0.6314525433905865  29198.997058924113  631452.5433905865
YGR192C_mRNA    0.02019852511840142 934.0   20198.52511840142
YLR110C_mRNA    0.011461689842347701    530.0   11461.689842347701
YOL086C_mRNA    0.008217815358664388    379.99999999999994  8217.815358664388
YKL152C_mRNA    0.005428083302696741    251.0   5428.083302696741
YKL060C_mRNA    0.005384831642914297    249.0   5384.831642914297
YDL081C_mRNA    0.005125321684219632    237.0   5125.321684219632
YOR369C_mRNA    0.004433295127700526    205.0   4433.295127700526
YDL130W_mRNA    0.004152159339114638    191.99999999999997  4152.159339114638
(nanocount)

Adding extra transcripts information

The extra_tx_info option adds a columns with the transcript lengths and also includes all the zero-coverage transcripts in the results

NanoCount -i ./data/aligned_reads_sorted.bam -o ./output/tx_counts.tsv --extra_tx_info
head ./output/tx_counts.tsv

## Checking options and input files ##
## Initialise Nanocount ##
    Parse Bam file and filter low quality alignments
    Summary of alignments parsed in input bam file
        Valid alignments: 150,517
        Discarded unmapped alignments: 9,545
        Discarded alignment with invalid 3 prime end: 6,133
        Discarded negative strand alignments: 4,515
        Discarded supplementary alignments: 334
    Summary of reads filtered
        Reads with valid best alignment: 85,908
        Invalid secondary alignments: 60,120
        Valid secondary alignments: 2,622
        Reads with low query fraction aligned: 1,628
    Generate initial read/transcript compatibility index
## Start EM abundance estimate ##
    Progress: 2.00 rounds [00:00, 8.77 rounds/s]
    Exit EM loop after 2 rounds
    Convergence value: 0.0019361726963877538
## Summarize data ##
    Convert results to dataframe
    Compute estimated counts and TPM
    Write file
(nanocount) transcript_name raw est_count   tpm transcript_length
YHR174W_mRNA    0.5881056948454584  50522.984032783635  588105.6948454584   1314
YGR192C_mRNA    0.02083282680839274 1789.7064854554035  20832.82680839274   999
YLR110C_mRNA    0.009591656190343158    824.0   9591.656190343158   402
YOL086C_mRNA    0.008299576290915864    713.0   8299.576290915864   1047
YKL060C_mRNA    0.006518601294407972    560.0   6518.601294407972   1080
YCR012W_mRNA    0.005412767146249476    464.99999999999994  5412.767146249475   1251
YBR118W_mRNA    0.005255622293616427    451.5   5255.622293616427   1377
YPR080W_mRNA    0.005255622293616427    451.5   5255.622293616427   1377
YKL152C_mRNA    0.005226521394980677    449.0   5226.5213949806775  744
(nanocount)

Write selected alignment to BAM file

NanoCount -i ./data/aligned_reads_sorted.bam -o ./output/tx_counts.tsv -b ./output/aligned_reads_selected.bam --extra_tx_info
head ./output/tx_counts.tsv

## Checking options and input files ##
## Initialise Nanocount ##
    Parse Bam file and filter low quality alignments
    Summary of alignments parsed in input bam file
        Valid alignments: 150,517
        Discarded unmapped alignments: 9,545
        Discarded alignment with invalid 3 prime end: 6,133
        Discarded negative strand alignments: 4,515
        Discarded supplementary alignments: 334
    Summary of reads filtered
        Reads with valid best alignment: 85,908
        Invalid secondary alignments: 60,120
        Valid secondary alignments: 2,622
        Reads with low query fraction aligned: 1,628
    Write selected alignments to BAM file
    Summary of alignments written to bam
        Alignments to select: 88,530
        Alignments written: 88,530
        Alignments skipped: 82,514
    Generate initial read/transcript compatibility index
## Start EM abundance estimate ##
    Progress: 2.00 rounds [00:00, 7.98 rounds/s]
    Exit EM loop after 2 rounds
    Convergence value: 0.0019361726963877538
## Summarize data ##
    Convert results to dataframe
    Compute estimated counts and TPM
    Write file
(nanocount) transcript_name raw est_count   tpm transcript_length
YHR174W_mRNA    0.5881056948454584  50522.984032783635  588105.6948454584   1314
YGR192C_mRNA    0.02083282680839274 1789.7064854554035  20832.82680839274   999
YLR110C_mRNA    0.009591656190343158    824.0   9591.656190343158   402
YOL086C_mRNA    0.008299576290915864    713.0   8299.576290915864   1047
YKL060C_mRNA    0.006518601294407972    560.0   6518.601294407972   1080
YCR012W_mRNA    0.005412767146249476    464.99999999999994  5412.767146249475   1251
YBR118W_mRNA    0.005255622293616427    451.5   5255.622293616427   1377
YPR080W_mRNA    0.005255622293616427    451.5   5255.622293616427   1377
YKL152C_mRNA    0.005226521394980677    449.0   5226.5213949806775  744
(nanocount)

Relaxing the secondary alignment scoring threshold

The default value is 0.95 (95% of the alignment score of the primary alignment) but this value could be lowered to allow more secondary alignments to be included in the uncertainty calculation. Lowering the value bellow 0.75 might not be relevant and will considerably increase the computation time.

NanoCount -i ./data/aligned_reads_sorted.bam -o ./output/tx_counts.tsv --sec_scoring_threshold 0.8
head ./output/tx_counts.tsv

## Checking options and input files ##
## Initialise Nanocount ##
    Parse Bam file and filter low quality alignments
    Summary of alignments parsed in input bam file
        Valid alignments: 150,517
        Discarded unmapped alignments: 9,545
        Discarded alignment with invalid 3 prime end: 6,133
        Discarded negative strand alignments: 4,515
        Discarded supplementary alignments: 334
    Summary of reads filtered
        Reads with valid best alignment: 85,908
        Valid secondary alignments: 49,092
        Invalid secondary alignments: 13,650
        Reads with low query fraction aligned: 1,628
    Generate initial read/transcript compatibility index
## Start EM abundance estimate ##
    Progress: 17.0 rounds [00:02, 7.01 rounds/s]
    Exit EM loop after 17 rounds
    Convergence value: 0.004795139982321842
## Summarize data ##
    Convert results to dataframe
    Compute estimated counts and TPM
    Write file
(nanocount) transcript_name raw est_count   tpm
YHR174W_mRNA    0.5770419415271139  49572.5191127113    577041.9415271139
YGR192C_mRNA    0.014985653368924351    1287.3875096175532  14985.653368924352
YGR254W_mRNA    0.012367659441483866    1062.480887298996   12367.659441483866
YLR110C_mRNA    0.009591656190343162    824.0000000000003   9591.656190343161
YJR009C_mRNA    0.00941808679575318 809.0890004495642   9418.08679575318
YOL086C_mRNA    0.008299576290915867    713.0000000000003   8299.576290915868
YKL060C_mRNA    0.006518601294407974    560.0000000000002   6518.601294407974
YCR012W_mRNA    0.005412767146249479    465.0000000000003   5412.767146249479
YPR080W_mRNA    0.0052556222936164295   451.5000000000002   5255.6222936164295
(nanocount)

verbose mode

Print additional information for QC and debugging

NanoCount -i ./data/aligned_reads_sorted.bam -o ./output/tx_counts.tsv --sec_scoring_threshold 0.8  --verbose

## Checking options and input files ##
    [DEBUG]: Options summary
    [DEBUG]:    Package name: NanoCount
    [DEBUG]:    Package version: 0.3.0.dev2
    [DEBUG]:    Timestamp: 2021-09-08 22:54:12.755159
    [DEBUG]:    alignment_file: ./data/aligned_reads_sorted.bam
    [DEBUG]:    count_file: ./output/tx_counts.tsv
    [DEBUG]:    filter_bam_out: 
    [DEBUG]:    min_alignment_length: 50
    [DEBUG]:    keep_suplementary: False
    [DEBUG]:    min_query_fraction_aligned: 0.5
    [DEBUG]:    sec_scoring_threshold: 0.8
    [DEBUG]:    sec_scoring_value: alignment_score
    [DEBUG]:    convergence_target: 0.005
    [DEBUG]:    max_em_rounds: 100
    [DEBUG]:    extra_tx_info: False
    [DEBUG]:    primary_score: alignment_score
    [DEBUG]:    max_dist_3_prime: 50
    [DEBUG]:    max_dist_5_prime: -1
    [DEBUG]:    verbose: True
    [DEBUG]:    quiet: False
## Initialise Nanocount ##
    Parse Bam file and filter low quality alignments
    Summary of alignments parsed in input bam file
        Valid alignments: 150,517
        Discarded unmapped alignments: 9,545
        Discarded alignment with invalid 3 prime end: 6,133
        Discarded negative strand alignments: 4,515
        Discarded supplementary alignments: 334
    Summary of reads filtered
        Reads with valid best alignment: 85,908
        Valid secondary alignments: 49,092
        Invalid secondary alignments: 13,650
        Reads with low query fraction aligned: 1,628
    Generate initial read/transcript compatibility index
## Start EM abundance estimate ##
    [DEBUG]: EM Round: 1 / Convergence value: 1
    [DEBUG]: EM Round: 2 / Convergence value: 0.08982516174030376
    [DEBUG]: EM Round: 3 / Convergence value: 0.07275793447585568
    [DEBUG]: EM Round: 4 / Convergence value: 0.05953041461618004
    [DEBUG]: EM Round: 5 / Convergence value: 0.04879243854714777
    [DEBUG]: EM Round: 6 / Convergence value: 0.040022962888262556
    [DEBUG]: EM Round: 7 / Convergence value: 0.03285040500110691
    [DEBUG]: EM Round: 8 / Convergence value: 0.026980252318091508
    [DEBUG]: EM Round: 9 / Convergence value: 0.022174110853707095
    [DEBUG]: EM Round: 10 / Convergence value: 0.01823785737980107
    [DEBUG]: EM Round: 11 / Convergence value: 0.015013106051349104
    [DEBUG]: EM Round: 12 / Convergence value: 0.012370502416389305
    [DEBUG]: EM Round: 13 / Convergence value: 0.010204386062917101
    [DEBUG]: EM Round: 14 / Convergence value: 0.008428311617153536
    [DEBUG]: EM Round: 15 / Convergence value: 0.0069715401043749445
    [DEBUG]: EM Round: 16 / Convergence value: 0.005776253476233076
    [DEBUG]: EM Round: 17 / Convergence value: 0.004795139982321842
    Exit EM loop after 17 rounds
    Convergence value: 0.004795139982321842
## Summarize data ##
    Convert results to dataframe
    Compute estimated counts and TPM
    Write file
(nanocount)