CpG_Aggregate API usage

Import module

# Import main module 
from pycoMeth.CpG_Aggregate import CpG_Aggregate

# Optionally inport jupyter helper functions
from pycoMeth.common import head, jhelp, stdout_print

Getting help

jhelp(CpG_Aggregate)

CpG_Aggregate (nanopolish_fn, ref_fasta_fn, output_bed_fn, output_tsv_fn, min_depth, sample_id, min_llr, verbose, quiet, progress, kwargs)

Calculate methylation frequency at genomic CpG sites from the output of nanopolish call-methylation

nanopolish_fn (required) [list(str)]

Path to a nanopolish call_methylation tsv output file or a list of files or a regex matching several files (can be gzipped)

ref_fasta_fn (required) [str]

Reference file used for alignment in Fasta format (ideally already indexed with samtools faidx)

output_bed_fn (default: "") [str]

Path to write a summary result file in BED format (At least 1 output file is required) (can be gzipped)

output_tsv_fn (default: "") [str]

Path to write a more extensive result report in TSV format (At least 1 output file is required) (can be gzipped)

min_depth (default: 10) [int]

Minimal number of reads covering a site to be reported

sample_id (default: "") [str]

Sample ID to be used for the BED track header

min_llr (default: 2) [float]

Minimal log likelyhood ratio to consider a site significantly methylated or unmethylated in output BED file

verbose (default: False) [bool]
quiet (default: False) [bool]
progress (default: False) [bool]
kwargs

Example usage

Basic usage

ff = CpG_Aggregate (
    nanopolish_fn="./data/nanopolish_sample_1.tsv",
    ref_fasta_fn="./data/ref.fa",
    output_bed_fn="./results/CpG_Aggregate_sample_1.bed",
    output_tsv_fn="./results/CpG_Aggregate_sample_1.tsv.gz",
    sample_id="sample_1",
    progress=True)

head("./results/CpG_Aggregate_sample_1.tsv.gz")
head("./results/CpG_Aggregate_sample_1.bed")

## Checking options and input files ##
## Parsing methylation_calls file ##
    Starting to parse file Nanopolish methylation call file
    Progress: 100%|██████████| 51.9M/51.9M [00:06<00:00, 8.65M bytes/s]
    Parsing summary
        Lines Parsed: 543,135
        Line successfully parsed: 543,135
        Input files: 1
    Filtering out low coverage sites
    Sorting each chromosome by coordinates
    Sites summary
        Total Valid Lines: 543,135
        Initial Sites: 218,353
        Low Count Sites: 218,114
        Valid Sites Found: 239
## Processing valid sites found and write to file ##
    Progress: 100%|██████████| 239/239 [00:00<00:00, 2.07k sites/s]
    Results summary
        Total Sites Writen: 239
        Unmethylated sites: 162
        Ambiguous sites: 77

chromosome start  end    sequence       num_motifs median_llr llr_list                                           
VIII       138415 138416 GGTCTCGCTTT    1          -2.355     [-9.42,-5.49,-5.18,-5.11,-2.43,-1.1,0.46,-0.68,...
VIII       138429 138430 AGCTTCGAGGA    1          -4.525     [-3.62,-5.58,1.12,-2.5,-10.4,-2.39,-8.33,-7.29,...
VIII       212351 212352 TGGGGCGACAT    1          -2.77      [-2.95,-11.55,-9.31,-0.07,-11.21,-4.14,0.66,-2....
VIII       212392 212393 ATTAACGTATA    1          -2.51      [-6.76,3.04,0.11,-2.51,0.32,-3.7,-2.92,-2.01,-3...
VIII       212457 212461 AGAATCGTCGATTA 2          -6.08      [-6.08,-13.01,-3.52,-1.3,-8.11,-8.88,-1.47,-4.7...
VIII       212530 212531 CTATTCGTTTC    1          -1.27      [-5.33,-1.27,1.12,-3.72,0.48,-4.4,-0.48,-1.02,-...
VIII       212581 212582 GTTACCGCAGG    1          0.075      [1.19,-0.11,-0.02,-5.77,2.08,0.17,0.84,2.46,-4....
VIII       212596 212600 TTTGTCGTCGCTGT 2          -4.86      [-13.76,-4.43,-1.37,-8.36,-6.67,-6.3,1.13,-4.67...
VIII       212612 212613 CACCCCGTTGG    1          -2.91      [-7.45,1.01,-2.76,-0.81,-3.06,-2.63,-3.66,-3.11...

track name=sample_1_CpG itemRgb=On
VIII    138415  138416  .   -2.355  .   138415  138416  52,168,194
VIII    138429  138430  .   -4.525  .   138429  138430  33,102,171
VIII    212351  212352  .   -2.77   .   212351  212352  52,168,194
VIII    212392  212393  .   -2.51   .   212392  212393  52,168,194
VIII    212457  212461  .   -6.08   .   212457  212461  28,45,131
VIII    212530  212531  .   -1.27   .   212530  212531  230,230,230
VIII    212581  212582  .   0.075   .   212581  212582  230,230,230
VIII    212596  212600  .   -4.86   .   212596  212600  33,102,171
VIII    212612  212613  .   -2.91   .   212612  212613  52,168,194

Example usage using a regex and with a lower depth threshold

ff = CpG_Aggregate (
    nanopolish_fn="./data/nanopolish_sample_*.tsv",
    ref_fasta_fn="./data/ref.fa",
    output_bed_fn="./results/CpG_Aggregate_sample_all.bed",
    output_tsv_fn="./results/CpG_Aggregate_sample_all.tsv",
    min_depth=5,
    sample_id="sample_all",
    progress=True)

head("./results/CpG_Aggregate_sample_all.tsv")
head("./results/CpG_Aggregate_sample_all.bed")

## Checking options and input files ##
## Parsing methylation_calls file ##
    Starting to parse file Nanopolish methylation call file
    Progress: 100%|██████████| 209M/209M [00:18<00:00, 11.2M bytes/s] 
    Parsing summary
        Lines Parsed: 2,180,231
        Line successfully parsed: 2,180,231
        Input files: 4
    Filtering out low coverage sites
    Sorting each chromosome by coordinates
    Sites summary
        Total Valid Lines: 2,180,231
        Initial Sites: 251,674
        Valid Sites Found: 228,163
        Low Count Sites: 23,511
## Processing valid sites found and write to file ##
    Progress: 100%|██████████| 228k/228k [00:31<00:00, 7.19k sites/s] 
    Results summary
        Total Sites Writen: 228,163
        Unmethylated sites: 168,018
        Ambiguous sites: 60,129
        Methylated sites: 16

chromosome start end sequence              num_motifs median_llr llr_list                                  
I          144   145 CCACTCGTTAC           1          -2.2       [-0.7,2.77,-3.01,-2.2,-8.42]              
I          175   176 CACTCCGAACC           1          -1.35      [-1.35,-8.02,-1.07,1.94,-2.01]            
I          216   217 CCCACCGTTAC           1          -2.16      [-6.62,-2.16,-2.85,-0.27,-0.41]           
I          325   326 TGAAACGCTAA           1          -2.66      [-0.41,0.01,-5.79,-4.93,-2.66]            
I          339   340 ATGATCGTAAA           1          -1.21      [-0.02,-2.85,-4.49,-1.21,-1.08]           
I          354   355 ACACACGTGCT           1          -1.39      [-1.11,-4.6,-1.63,-1.39,-1.2]             
I          422   433 TTTTACGTACGCACACGGATG 3          -10.52     [-2.49,-7.21,-10.79,-13.29,-10.52]        
I          542   543 ATGCACGGCAC           1          -0.78      [-2.03,-3.57,0.47,-3.81,2.14,2.59]        
I          557   558 CTCAGCGGTCT           1          -2.3       [-5.5,-1.85,-4.84,-2.3,-4.34,-1.14,-1.11] 

track name=sample_all_CpG itemRgb=On
I   144 145 .   -2.2    .   144 145 52,168,194
I   175 176 .   -1.35   .   175 176 230,230,230
I   216 217 .   -2.16   .   216 217 52,168,194
I   325 326 .   -2.66   .   325 326 52,168,194
I   339 340 .   -1.21   .   339 340 230,230,230
I   354 355 .   -1.39   .   354 355 230,230,230
I   422 433 .   -10.52  .   422 433 28,45,131
I   542 543 .   -0.78   .   542 543 230,230,230
I   557 558 .   -2.3    .   557 558 52,168,194

Example with multiple files

for i in range (1, 5):
    stdout_print(f"##### SAMPLE {i} #####")
    CpG_Aggregate (
        nanopolish_fn=f"./data/nanopolish_sample_{i}.tsv",
        ref_fasta_fn="./data/ref.fa",
        output_bed_fn=f"./results/CpG_Aggregate_sample_{i}.bed",
        output_tsv_fn=f"./results/CpG_Aggregate_sample_{i}.tsv",
        sample_id=f"sample_{i}",
        min_depth=3,
        min_llr=1,
        quiet=True)

##### SAMPLE 1 #####

## Checking options and input files ##
## Parsing methylation_calls file ##
## Processing valid sites found and write to file ##

##### SAMPLE 2 #####

## Checking options and input files ##
## Parsing methylation_calls file ##
## Processing valid sites found and write to file ##

##### SAMPLE 3 #####

## Checking options and input files ##
## Parsing methylation_calls file ##
## Processing valid sites found and write to file ##

##### SAMPLE 4 #####

## Checking options and input files ##
## Parsing methylation_calls file ##
## Processing valid sites found and write to file ##