Skip to content

Meth_Comp API usage

Import module

# Import main module 
from pycoMeth.Comp_Report import Comp_Report, cpg_heatmap, cpg_ridgeplot, category_barplot, chr_ideogram_plot, tss_dist_plot

# optionally inport jupyter helper functions
from pycoMeth.common import head, jhelp, Kaleido

Getting help

jhelp(Comp_Report)

Comp_Report (methcomp_fn, gff3_fn, ref_fasta_fn, outdir, n_top, max_tss_distance, pvalue_threshold, min_diff_llr, n_len_bin, api_mode, export_static_plots, report_non_significant, verbose, quiet, progress, kwargs)

Generate an HTML report of significantly differentially methylated CpG intervals from Meth_Comp text output. Significant intervals are annotated with their closest transcript TSS.


  • methcomp_fn (required) [str]

Input tsv file generated by Meth_comp (can be gzipped). At the moment only data binned by intervals with Interval_Aggregate are supported.

  • gff3_fn (required) [str]

Path to an ensembl GFF3 file containing genomic annotations. Only the transcripts details are extracted.

  • ref_fasta_fn (required) [str]

Reference file used for alignment in Fasta format (ideally already indexed with samtools faidx)

  • outdir (default: ./) [str]

Directory where to output HTML reports, By default current directory

  • n_top (default: 100) [int]

Number of top interval candidates for which to generate an interval report. If there are not enough significant candidates this is automatically scaled down.

  • max_tss_distance (default: 100000) [int]

Maximal distance to transcription stat site to find transcripts close to interval candidates

  • pvalue_threshold (default: 0.01) [float]

pValue cutoff for top interval candidates

  • min_diff_llr (default: 1) [float]

Minimal llr boundary for negative and positive median llr. 1 is recommanded for vizualization purposes.

  • n_len_bin (default: 500) [int]

Number of genomic intervals for the longest chromosome of the ideogram figure

  • api_mode (default: False) [bool]

Don't generate reports or tables, just parse data and return a tuple containing an overall median CpG dataframe and a dictionary of CpG dataframes for the top candidates found. These dataframes can then be used to with the plotting functions containned in this module

  • export_static_plots (default: False) [bool]

Export all the plots from the reports in SVG format.

  • report_non_significant (default: False) [bool]

Report all valid CpG islands, significant or not in the text report. This option also adds a non-significant track to the TSS_distance plot

  • verbose (default: False) [bool]

  • quiet (default: False) [bool]

  • progress (default: False) [bool]

  • kwargs

Example usage in interactive API mode

Parse data

In api_mode Comp_report doesn't generate any HTML reports but returns a tuple with 3 elements: * A Dataframe containing summary information about each CpG islands including the closest transcript * A dataframe of median methylation values for all CpG Islands * A list of CpG Island level methylation values ranked by ascending pvalues.

report_non_significant significantly increases the execution time but reports the closest transcript for all sites and generate additional output

all_summary_df, all_cpg_df, top_cpg_df_d  = Comp_Report (
    methcomp_fn = "./data/Medaka_CGI_meth_comp.tsv.gz",
    gff3_fn = "./data/medaka.gff3",
    ref_fasta_fn="./data/medaka.fa",
    report_non_significant=True,
    n_top=50,
    progress=True,
    api_mode=True)

display(all_summary_df.head())
display(all_cpg_df.head())
display(top_cpg_df_d[1].head())
## Checking options and input files ##
## Loading and preparing data ##
    Loading Methcomp data from TSV file
    Loading transcripts info from GFF file
    Loading chromosome info from reference FASTA file
    Number of significant intervals found (adjusted pvalue<0.01): 3532
    Finding top candidates
## Parsing methcomp data ##
    Iterating over significant intervals
    Progress: 100%|██████████| 10.8k/10.8k [01:21<00:00, 133 intervals/s] 

pvalue chromosome start end Number of nearby TSS closest tx id closest tx name closest tx biotype distance to tss
6600 1.081882e-60 15 13014693 13015794 26 ensorlt00000004176 <NA> protein_coding 0
7189 7.391533e-58 16 26220047 26221325 10 ensorlt00000019438 sh3bp5b-201 protein_coding -24958
7341 1.164849e-56 17 316390 318610 5 ensorlt00000002667 <NA> protein_coding -155
5132 5.370468e-55 11 27686928 27688355 3 ensorlt00000017935 <NA> protein_coding -21358
1459 1.197991e-53 4 7874048 7875618 12 ensorlt00000006287 onecut3a-201 protein_coding -11370
1-308880-309181 1-503417-503927 1-514965-516003 1-570414-576486 1-623809-625652 1-727090-729031 1-729239-730182 1-1077484-1077924 1-1162762-1163132 1-1314335-1314688 ... 24-21407316-21407672 24-22058632-22059037 24-22331849-22332483 24-22382403-22382773 24-22452449-22452848 24-22584245-22584601 24-22760710-22761123 24-22892617-22893113 24-23134697-23135071 24-23191101-23191433
Sample 11-1_A3 0.655 2.598 -2.230 NaN -0.968 1.255 -2.085 4.352 2.855 -2.058 ... 0.090 -0.195 3.020 -0.055 2.935 -0.058 0.220 1.760 -1.502 1.797
Sample 117-2_C4 -1.875 1.795 -3.225 0.660 -0.265 0.810 -3.300 1.758 3.080 2.130 ... -0.010 -1.902 3.555 0.770 -3.710 -2.742 2.210 1.220 0.880 1.655
Sample 131-1_F4 0.570 1.522 1.200 1.628 0.095 1.995 -1.792 3.035 0.340 1.985 ... 0.550 -1.185 3.108 1.440 2.270 NaN 0.900 NaN 0.965 2.100
Sample 134-1_H4 -1.152 -1.142 1.265 1.518 -1.925 1.760 NaN 0.872 0.590 -3.310 ... 2.705 -1.322 3.278 0.730 3.915 2.805 1.205 1.030 1.545 2.418
Sample 134-2_A5 2.145 -5.395 3.050 1.638 3.125 1.705 -3.133 -3.960 -2.475 1.635 ... -0.205 -0.692 3.980 0.150 3.400 -2.365 1.625 1.785 0.615 2.260

5 rows × 3532 columns

15-13,014,749 15-13,014,782 15-13,014,816 15-13,014,843 15-13,014,876 15-13,014,892 15-13,014,924 15-13,014,942 15-13,014,970 15-13,015,026 ... 15-13,015,535 15-13,015,568 15-13,015,587 15-13,015,618 15-13,015,629 15-13,015,651 15-13,015,671 15-13,015,693 15-13,015,739 15-13,015,768
Sample 11-1_A3 -1.510 -2.730 -5.46 -0.320 -3.330 -7.600 -1.490 -2.210 -1.690 -1.290 ... -10.740 -3.065 -7.785 -1.520 -2.795 -1.630 -2.330 -3.350 -1.90 -2.18
Sample 117-2_C4 -1.775 -1.010 -3.16 0.030 1.120 4.520 2.840 3.990 6.885 0.290 ... 5.800 1.040 2.290 0.880 0.430 2.140 -0.290 -0.830 -0.13 0.60
Sample 131-1_F4 -1.210 -3.400 -4.31 -0.810 -0.995 -13.525 0.300 -1.740 -4.870 -2.110 ... -13.535 -2.915 -9.220 -1.485 -4.485 -1.825 -2.885 -3.995 -2.57 -4.33
Sample 134-1_H4 -2.875 -4.845 -4.02 -0.845 -4.675 -7.845 -0.065 -2.525 -3.125 -0.555 ... -7.200 -2.500 -5.920 0.140 -2.160 -0.450 -1.220 -1.830 -0.54 -1.98
Sample 134-2_A5 -0.500 3.230 -3.22 0.000 3.620 7.900 2.180 4.380 3.160 1.570 ... 8.830 1.700 3.860 2.405 1.475 2.860 1.765 1.600 1.69 1.82

5 rows × 39 columns

Plotting functions

To display static version in jupyter (easier to export that dynamic d3js plots), one can use the Kaleido wrapper included in pycoMeth

kaleido = Kaleido()

cpg_heatmap

Can be used at both genome scale (all_cpg_df) or individual CpG island (top_cpg_df_d)

fig = cpg_heatmap(top_cpg_df_d[1], lim_llr=7, column_widths=[0.9, 0.10])
display(kaleido.render_plotly_svg(fig, width=1200))
Sample 69-1_F3Sample 4-2_B2Sample 80-1_H3Sample 4-1_B2Sample 134-2_A5Sample 117-2_C4Sample 79-2_G3Sample 7-2_F2Sample 11-1_A3Sample 134-1_H4Sample 7-1_E2Sample 131-1_F4−6−4−20246Median LLRCpG positions
fig = cpg_heatmap(all_cpg_df, lim_llr=3, column_widths=[0.80, 0.20])
display(kaleido.render_plotly_svg(fig, width=1200))
Sample 80-1_H3Sample 79-2_G3Sample 4-2_B2Sample 4-1_B2Sample 69-1_F3Sample 11-1_A3Sample 7-2_F2Sample 7-1_E2Sample 131-1_F4Sample 117-2_C4Sample 134-2_A5Sample 134-1_H4−3−2−10123Median LLRCpG positions

cpg_ridgeplot

Can be used at both genome scale (all_cpg_df) or individual CpG island (top_cpg_df_d)

fig = cpg_ridgeplot(all_cpg_df, box=True, scatter=False)
display(kaleido.render_plotly_svg(fig, width=1200))
−10−50510Sample 11-1_A3Sample 4-2_B2Sample 80-1_H3Sample 7-2_F2Sample 79-2_G3Sample 134-2_A5Sample 69-1_F3Sample 117-2_C4Sample 131-1_F4Sample 7-1_E2Sample 134-1_H4Sample 4-1_B2CpG median log likelihood ratioAmbiguousUnmethylatedMethylated
fig = cpg_ridgeplot(top_cpg_df_d[1])
display(kaleido.render_plotly_svg(fig, width=1200))
−15−10−50510Sample 131-1_F4Sample 7-1_E2Sample 7-2_F2Sample 11-1_A3Sample 134-1_H4Sample 79-2_G3Sample 117-2_C4Sample 69-1_F3Sample 4-2_B2Sample 134-2_A5Sample 80-1_H3Sample 4-1_B2CpG median log likelihood ratioAmbiguousUnmethylatedMethylated

category_barplot

Can be used at both genome scale (all_cpg_df) or individual CpG island (top_cpg_df_d)

fig = category_barplot(all_cpg_df)
display(kaleido.render_plotly_svg(fig, width=1200))
Sample 11-1_A3Sample 117-2_C4Sample 131-1_F4Sample 134-1_H4Sample 134-2_A5Sample 4-1_B2Sample 4-2_B2Sample 69-1_F3Sample 7-1_E2Sample 7-2_F2Sample 79-2_G3Sample 80-1_H30500100015002000250030003500No dataAmbiguousMethylatedUnmethylatedCounts per category
fig = category_barplot(top_cpg_df_d[1])
display(kaleido.render_plotly_svg(fig, width=1200))
Sample 11-1_A3Sample 117-2_C4Sample 131-1_F4Sample 134-1_H4Sample 134-2_A5Sample 4-1_B2Sample 4-2_B2Sample 69-1_F3Sample 7-1_E2Sample 7-2_F2Sample 79-2_G3Sample 80-1_H30510152025303540No dataAmbiguousMethylatedUnmethylatedCounts per category

chr_ideogram_plot

Can only be used at genome scale and requires to provide the reference fasta file

fig = chr_ideogram_plot(all_cpg_df, ref_fasta_fn="./data/medaka.fa", n_len_bin=250)
display(kaleido.render_plotly_svg(fig, width=1200, height=700))
05M10M15M20M25M30M35Mchr 24chr 23chr 22chr 21chr 20chr 19chr 18chr 17chr 16chr 15chr 14chr 13chr 12chr 11chr 10chr 9chr 8chr 7chr 6chr 5chr 4chr 3chr 2chr 1012345Genomic coordinatesReference sequences

tss_dist_plot

Can only be used from CpG island summary file

fig = tss_dist_plot(all_summary_df)
display(kaleido.render_plotly_svg(fig, width=1200))
−80k−60k−40k−20k020k40k60k80k00.0050.010.0150.020.025Significantn=3455Non-significantn=6995Distance to closest TSSCpG Islands density

Example usage in report mode

Example with a single significant result

Comp_Report (
    methcomp_fn = "./data/Yeast_CGI_meth_comp.tsv.gz",
    gff3_fn = "./data/yeast.gff3",
    ref_fasta_fn="./data/yeast.fa",
    outdir = "yeast_html",
    pvalue_threshold = 0.05,
    verbose=True)
## Checking options and input files ##
    [DEBUG]: Options summary
    [DEBUG]:    Package name: pycoMeth
    [DEBUG]:    Package version: 0.4.14
    [DEBUG]:    Timestamp: 2020-07-16 18:23:30.034053
    [DEBUG]:    methcomp_fn: ./data/Yeast_CGI_meth_comp.tsv.gz
    [DEBUG]:    gff3_fn: ./data/yeast.gff3
    [DEBUG]:    ref_fasta_fn: ./data/yeast.fa
    [DEBUG]:    outdir: yeast_html
    [DEBUG]:    n_top: 100
    [DEBUG]:    max_tss_distance: 100000
    [DEBUG]:    pvalue_threshold: 0.05
    [DEBUG]:    min_diff_llr: 1
    [DEBUG]:    n_len_bin: 500
    [DEBUG]:    api_mode: False
    [DEBUG]:    export_static_plots: False
    [DEBUG]:    report_non_significant: False
    [DEBUG]:    verbose: True
    [DEBUG]:    quiet: False
    [DEBUG]:    progress: False
    [DEBUG]:    kwargs
## Loading and preparing data ##
    Loading Methcomp data from TSV file
    Loading transcripts info from GFF file
    Loading chromosome info from reference FASTA file
    Number of significant intervals found (adjusted pvalue<0.05): 1
ERROR: Low number of significant sites. The summary report will likely contain errors
ERROR: Number of significant intervals lower than number of top candidates to plot
    Finding top candidates
    Creating output directory structure
    Computing source md5
## Parsing methcomp data ##
    Iterating over significant intervals
    [DEBUG]: Ploting top candidates rank: #1
    Generating summary report

Usage with large dataset, including static plot export

Comp_Report (
    methcomp_fn = "./data/Medaka_CGI_meth_comp.tsv.gz",
    gff3_fn = "./data/medaka.gff3",
    ref_fasta_fn="./data/medaka.fa",
    outdir = "medaka_html",
    export_static_plots=True,
    report_non_significant=True,
    n_top=50,
    progress=True)
## Checking options and input files ##
## Loading and preparing data ##
    Loading Methcomp data from TSV file
    Loading transcripts info from GFF file
    Loading chromosome info from reference FASTA file
    Number of significant intervals found (adjusted pvalue<0.01): 3532
    Finding top candidates
    Creating output directory structure
    Computing source md5
## Parsing methcomp data ##
    Iterating over significant intervals
    Progress: 100%|██████████| 10.8k/10.8k [01:45<00:00, 102 intervals/s] 
    Generating summary report