NanoCount python API

Activate virtual environment

from NanoCount.NanoCount import NanoCount
from NanoCount.common import jhelp, head

Running NanoCount

jhelp(NanoCount)

NanoCount (alignment_file, count_file, filter_bam_out, min_alignment_length, keep_suplementary, min_query_fraction_aligned, sec_scoring_threshold, sec_scoring_value, convergence_target, max_em_rounds, extra_tx_info, primary_score, max_dist_3_prime, max_dist_5_prime, verbose, quiet)

Estimate abundance of transcripts using an EM

alignment_file (required) [str]

Sorted and indexed BAM or SAM file containing aligned ONT dRNA-Seq reads including secondary alignments

count_file (default: "") [str]

Output file path where to write estimated counts (TSV format)

filter_bam_out (default: "") [str]

Optional output file path where to write filtered reads selected by NanoCount to perform quantification estimation (BAM format)

min_alignment_length (default: 50) [int]

Minimal length of the alignment to be considered valid

keep_suplementary (default: False) [bool]

Retain any supplementary alignments and considered them like secondary alignments. Discarded by default.

min_query_fraction_aligned (default: 0.5) [float]

Minimal fraction of the primary alignment query aligned to consider the read valid

sec_scoring_threshold (default: 0.95) [float]

Fraction of the alignment score or the alignment length of secondary alignments compared to the primary alignment to be considered valid alignments

sec_scoring_value (default: alignment_score) [str]

Value to use for score thresholding of secondary alignments either "alignment_score" or "alignment_length"

convergence_target (default: 0.005) [float]

Convergence target value of the cummulative difference between abundance values of successive EM round to trigger the end of the EM loop.

max_em_rounds (default: 100) [int]

Maximum number of EM rounds before triggering stop

extra_tx_info (default: False) [bool]

Add transcripts length and zero coverage transcripts to the output file (required valid bam/sam header)

primary_score (default: alignment_score) [str]

Method to pick the best alignment for each read. By default ("alignment_score") uses the best alignment score (AS optional field), but it can be changed to use either the primary alignment defined by the aligner ("primary") or the longest alignment ("alignment_length"). choices = [primary, alignment_score, alignment_length]

max_dist_3_prime (default: 50) [int]

Maximum distance of alignment end to 3 prime of transcript. In ONT dRNA-Seq reads are assumed to start from the polyA tail (-1 to deactivate)

max_dist_5_prime (default: -1) [int]

Maximum distance of alignment start to 5 prime of transcript. In conjunction with max_dist_3_prime it can be used to select near full transcript reads only (-1 to deactivate).

verbose (default: False) [bool]

Increase verbosity for QC and debugging

quiet (default: False) [bool]

Reduce verbosity

Basic command

NanoCount (alignment_file="./data/aligned_reads_sorted.bam", count_file="./output/tx_counts.tsv")
head("./output/tx_counts.tsv")

## Checking options and input files ##
## Initialise Nanocount ##
    Parse Bam file and filter low quality alignments
    Summary of alignments parsed in input bam file
        Valid alignments: 150,517
        Discarded unmapped alignments: 9,545
        Discarded alignment with invalid 3 prime end: 6,133
        Discarded negative strand alignments: 4,515
        Discarded supplementary alignments: 334
    Summary of reads filtered
        Reads with valid best alignment: 85,908
        Invalid secondary alignments: 60,120
        Valid secondary alignments: 2,622
        Reads with low query fraction aligned: 1,628
    Generate initial read/transcript compatibility index
## Start EM abundance estimate ##
    Progress: 2.00 rounds [00:00, 7.55 rounds/s]
    Exit EM loop after 2 rounds
    Convergence value: 0.0019361726963877538
## Summarize data ##
    Convert results to dataframe
    Compute estimated counts and TPM
    Write file

transcript_name raw                  est_count          tpm                
YHR174W_mRNA    0.5881056948454584   50522.984032783635 588105.6948454584  
YGR192C_mRNA    0.02083282680839274  1789.7064854554035 20832.82680839274  
YLR110C_mRNA    0.009591656190343158 824.0              9591.656190343158  
YOL086C_mRNA    0.008299576290915864 713.0              8299.576290915864  
YKL060C_mRNA    0.006518601294407972 560.0              6518.601294407972  
YCR012W_mRNA    0.005412767146249476 464.99999999999994 5412.767146249475  
YPR080W_mRNA    0.005255622293616427 451.5              5255.622293616427  
YBR118W_mRNA    0.005255622293616427 451.5              5255.622293616427  
YKL152C_mRNA    0.005226521394980677 449.0              5226.5213949806775

Using Best Alignment score rather than Primary reads as best reads

NanoCount (alignment_file="./data/aligned_reads_sorted.bam", count_file="./output/tx_counts.tsv", primary_score="align_score")
head("./output/tx_counts.tsv")

## Checking options and input files ##
## Initialise Nanocount ##
    Parse Bam file and filter low quality alignments
    Summary of alignments parsed in input bam file
        Valid alignments: 150,517
        Discarded unmapped alignments: 9,545
        Discarded alignment with invalid 3 prime end: 6,133
        Discarded negative strand alignments: 4,515
        Discarded supplementary alignments: 334
    Summary of reads filtered
        Reads with valid best alignment: 85,805
        Valid secondary alignments: 54,985
        Invalid secondary alignments: 7,651
        Reads with low query fraction aligned: 1,731
    Generate initial read/transcript compatibility index
## Start EM abundance estimate ##
    Progress: 37.0 rounds [00:05, 6.40 rounds/s]
    Exit EM loop after 37 rounds
    Convergence value: 0.004936880753967912
## Summarize data ##
    Convert results to dataframe
    Compute estimated counts and TPM
    Write file

transcript_name raw                  est_count          tpm               
YHR174W_mRNA    0.49362604198536425  42355.58253255418  493626.0419853642 
YGR254W_mRNA    0.09619972574387689  8254.417467453357  96199.72574387689 
YGR192C_mRNA    0.02461891539540753  2112.4260355029433 24618.91539540753 
YLR110C_mRNA    0.009603169978439472 823.9999999999989  9603.169978439471 
YOL086C_mRNA    0.008309539071149688 712.999999999999   8309.539071149688 
YKL060C_mRNA    0.006526426198939447 559.9999999999992  6526.426198939446 
YCR012W_mRNA    0.005419264611619362 464.9999999999994  5419.264611619362 
YBR118W_mRNA    0.005261931122894929 451.4999999999994  5261.931122894929 
YPR080W_mRNA    0.005261931122894929 451.4999999999994  5261.931122894929

Write selected alignment to BAM file

NanoCount (
    alignment_file="./data/aligned_reads_sorted.bam",
    count_file="./output/tx_counts.tsv",
    filter_bam_out = "./output/aligned_reads_selected.bam",
    primary_score="align_score")

head("./output/tx_counts.tsv")

## Checking options and input files ##
## Initialise Nanocount ##
    Parse Bam file and filter low quality alignments
    Summary of alignments parsed in input bam file
        Valid alignments: 150,517
        Discarded unmapped alignments: 9,545
        Discarded alignment with invalid 3 prime end: 6,133
        Discarded negative strand alignments: 4,515
        Discarded supplementary alignments: 334
    Summary of reads filtered
        Reads with valid best alignment: 85,805
        Valid secondary alignments: 54,985
        Invalid secondary alignments: 7,651
        Reads with low query fraction aligned: 1,731
    Write selected alignments to BAM file
    Summary of alignments written to bam
        Alignments to select: 140,790
        Alignments written: 140,790
        Alignments skipped: 30,254
    Generate initial read/transcript compatibility index
## Start EM abundance estimate ##
    Progress: 37.0 rounds [00:05, 6.45 rounds/s]
    Exit EM loop after 37 rounds
    Convergence value: 0.004936880753967912
## Summarize data ##
    Convert results to dataframe
    Compute estimated counts and TPM
    Write file

transcript_name raw                  est_count          tpm               
YHR174W_mRNA    0.49362604198536425  42355.58253255418  493626.0419853642 
YGR254W_mRNA    0.09619972574387689  8254.417467453357  96199.72574387689 
YGR192C_mRNA    0.02461891539540753  2112.4260355029433 24618.91539540753 
YLR110C_mRNA    0.009603169978439472 823.9999999999989  9603.169978439471 
YOL086C_mRNA    0.008309539071149688 712.999999999999   8309.539071149688 
YKL060C_mRNA    0.006526426198939447 559.9999999999992  6526.426198939446 
YCR012W_mRNA    0.005419264611619362 464.9999999999994  5419.264611619362 
YBR118W_mRNA    0.005261931122894929 451.4999999999994  5261.931122894929 
YPR080W_mRNA    0.005261931122894929 451.4999999999994  5261.931122894929

Basic command without file writing and Dataframe output

In interactive mode it is also possible not to write the results out but instead to access the data directly as a pandas DataFrame

nc = NanoCount (alignment_file="./data/aligned_reads_sorted.bam")
display(nc.count_df)

## Checking options and input files ##
## Initialise Nanocount ##
    Parse Bam file and filter low quality alignments
    Summary of alignments parsed in input bam file
        Valid alignments: 150,517
        Discarded unmapped alignments: 9,545
        Discarded alignment with invalid 3 prime end: 6,133
        Discarded negative strand alignments: 4,515
        Discarded supplementary alignments: 334
    Summary of reads filtered
        Reads with valid best alignment: 85,908
        Invalid secondary alignments: 60,120
        Valid secondary alignments: 2,622
        Reads with low query fraction aligned: 1,628
    Generate initial read/transcript compatibility index
## Start EM abundance estimate ##
    Progress: 2.00 rounds [00:00, 8.28 rounds/s]
    Exit EM loop after 2 rounds
    Convergence value: 0.0019361726963877538
## Summarize data ##
    Convert results to dataframe
    Compute estimated counts and TPM

	raw	est_count	tpm
transcript_name
YHR174W_mRNA	5.881057e-01	50522.984033	588105.694845
YGR192C_mRNA	2.083283e-02	1789.706485	20832.826808
YLR110C_mRNA	9.591656e-03	824.000000	9591.656190
YOL086C_mRNA	8.299576e-03	713.000000	8299.576291
YKL060C_mRNA	6.518601e-03	560.000000	6518.601294
...	...	...	...
YDR433W_mRNA	2.328072e-06	0.200000	2.328072
YHL050C_mRNA	1.670483e-06	0.143508	1.670483
YPR204W_mRNA	1.670483e-06	0.143508	1.670483
YEL077C_mRNA	3.423635e-07	0.029412	0.342364
YOR248W_mRNA	1.818806e-07	0.015625	0.181881

3056 rows × 3 columns

Adding extra transcripts information

The extra_tx_info option adds a columns with the transcript lengths and also includes all the zero-coverage transcripts in the results

nc = NanoCount (alignment_file="./data/aligned_reads_sorted.bam", extra_tx_info=True)
display(nc.count_df)

## Checking options and input files ##
## Initialise Nanocount ##
    Parse Bam file and filter low quality alignments
    Summary of alignments parsed in input bam file
        Valid alignments: 150,517
        Discarded unmapped alignments: 9,545
        Discarded alignment with invalid 3 prime end: 6,133
        Discarded negative strand alignments: 4,515
        Discarded supplementary alignments: 334
    Summary of reads filtered
        Reads with valid best alignment: 85,908
        Invalid secondary alignments: 60,120
        Valid secondary alignments: 2,622
        Reads with low query fraction aligned: 1,628
    Generate initial read/transcript compatibility index
## Start EM abundance estimate ##
    Progress: 2.00 rounds [00:00, 7.69 rounds/s]
    Exit EM loop after 2 rounds
    Convergence value: 0.0019361726963877538
## Summarize data ##
    Convert results to dataframe
    Compute estimated counts and TPM

	raw	est_count	tpm	transcript_length
transcript_name
YHR174W_mRNA	0.588106	50522.984033	588105.694845	1314
YGR192C_mRNA	0.020833	1789.706485	20832.826808	999
YLR110C_mRNA	0.009592	824.000000	9591.656190	402
YOL086C_mRNA	0.008300	713.000000	8299.576291	1047
YKL060C_mRNA	0.006519	560.000000	6518.601294	1080
...	...	...	...	...
YPR200C_mRNA	0.000000	0.000000	0.000000	393
YPR201W_mRNA	0.000000	0.000000	0.000000	1215
YPR202W_mRNA	0.000000	0.000000	0.000000	717
YPR203W_mRNA	0.000000	0.000000	0.000000	309
YPR204C-A_mRNA	0.000000	0.000000	0.000000	483

6612 rows × 4 columns

Relaxing the secondary alignment scoring threshold

The default value is 0.95 (95% of the alignment score of the primary alignment) but this value could be lowered to allow more secondary alignments to be included in the uncertainty calculation. Lowering the value bellow 0.75 might not be relevant and will considerably increase the computation time.

NanoCount (alignment_file="./data/aligned_reads_sorted.bam", count_file="./output/tx_counts.tsv", sec_scoring_threshold=0.8, extra_tx_info=True)
head("./output/tx_counts.tsv")

## Checking options and input files ##
## Initialise Nanocount ##
    Parse Bam file and filter low quality alignments
    Summary of alignments parsed in input bam file
        Valid alignments: 150,517
        Discarded unmapped alignments: 9,545
        Discarded alignment with invalid 3 prime end: 6,133
        Discarded negative strand alignments: 4,515
        Discarded supplementary alignments: 334
    Summary of reads filtered
        Reads with valid best alignment: 85,908
        Valid secondary alignments: 49,092
        Invalid secondary alignments: 13,650
        Reads with low query fraction aligned: 1,628
    Generate initial read/transcript compatibility index
## Start EM abundance estimate ##
    Progress: 17.0 rounds [00:02, 6.81 rounds/s]
    Exit EM loop after 17 rounds
    Convergence value: 0.004795139982321842
## Summarize data ##
    Convert results to dataframe
    Compute estimated counts and TPM
    Write file

transcript_name raw                   est_count          tpm                transcript_length 
YHR174W_mRNA    0.5770419415271139    49572.5191127113   577041.9415271139  1314              
YGR192C_mRNA    0.014985653368924351  1287.3875096175532 14985.653368924352 999               
YGR254W_mRNA    0.012367659441483866  1062.480887298996  12367.659441483866 1314              
YLR110C_mRNA    0.009591656190343162  824.0000000000003  9591.656190343161  402               
YJR009C_mRNA    0.00941808679575318   809.0890004495642  9418.08679575318   999               
YOL086C_mRNA    0.008299576290915867  713.0000000000003  8299.576290915868  1047              
YKL060C_mRNA    0.006518601294407974  560.0000000000002  6518.601294407974  1080              
YCR012W_mRNA    0.005412767146249479  465.0000000000003  5412.767146249479  1251              
YBR118W_mRNA    0.0052556222936164295 451.5000000000002  5255.6222936164295 1377