Meth_Comp API usage

Import module

# Import main module 
from pycoMeth.Comp_Report import Comp_Report, cpg_heatmap, cpg_ridgeplot, category_barplot, chr_ideogram_plot, tss_dist_plot

# optionally inport jupyter helper functions
from pycoMeth.common import head, jhelp, Kaleido

Getting help

jhelp(Comp_Report)

Comp_Report (methcomp_fn, gff3_fn, ref_fasta_fn, outdir, n_top, max_tss_distance, pvalue_threshold, min_diff_llr, n_len_bin, api_mode, export_static_plots, report_non_significant, verbose, quiet, progress, kwargs)

Generate an HTML report of significantly differentially methylated CpG intervals from Meth_Comp text output. Significant intervals are annotated with their closest transcript TSS.

methcomp_fn (required) [str]

Input tsv file generated by Meth_comp (can be gzipped). At the moment only data binned by intervals with Interval_Aggregate are supported.

gff3_fn (required) [str]

Path to an ensembl GFF3 file containing genomic annotations. Only the transcripts details are extracted.

ref_fasta_fn (required) [str]

Reference file used for alignment in Fasta format (ideally already indexed with samtools faidx)

outdir (default: ./) [str]

Directory where to output HTML reports, By default current directory

n_top (default: 100) [int]

Number of top interval candidates for which to generate an interval report. If there are not enough significant candidates this is automatically scaled down.

max_tss_distance (default: 100000) [int]

Maximal distance to transcription stat site to find transcripts close to interval candidates

pvalue_threshold (default: 0.01) [float]

pValue cutoff for top interval candidates

min_diff_llr (default: 1) [float]

Minimal llr boundary for negative and positive median llr. 1 is recommanded for vizualization purposes.

n_len_bin (default: 500) [int]

Number of genomic intervals for the longest chromosome of the ideogram figure

api_mode (default: False) [bool]

Don't generate reports or tables, just parse data and return a tuple containing an overall median CpG dataframe and a dictionary of CpG dataframes for the top candidates found. These dataframes can then be used to with the plotting functions containned in this module

export_static_plots (default: False) [bool]

Export all the plots from the reports in SVG format.

report_non_significant (default: False) [bool]

Report all valid CpG islands, significant or not in the text report. This option also adds a non-significant track to the TSS_distance plot

verbose (default: False) [bool]
quiet (default: False) [bool]
progress (default: False) [bool]
kwargs

Example usage in interactive API mode

Parse data

In api_mode Comp_report doesn't generate any HTML reports but returns a tuple with 3 elements: * A Dataframe containing summary information about each CpG islands including the closest transcript * A dataframe of median methylation values for all CpG Islands * A list of CpG Island level methylation values ranked by ascending pvalues.

report_non_significant significantly increases the execution time but reports the closest transcript for all sites and generate additional output

all_summary_df, all_cpg_df, top_cpg_df_d  = Comp_Report (
    methcomp_fn = "./data/Medaka_CGI_meth_comp.tsv.gz",
    gff3_fn = "./data/medaka.gff3",
    ref_fasta_fn="./data/medaka.fa",
    report_non_significant=True,
    n_top=50,
    progress=True,
    api_mode=True)

display(all_summary_df.head())
display(all_cpg_df.head())
display(top_cpg_df_d[1].head())

## Checking options and input files ##
## Loading and preparing data ##
    Loading Methcomp data from TSV file
    Loading transcripts info from GFF file
    Loading chromosome info from reference FASTA file
    Number of significant intervals found (adjusted pvalue<0.01): 3532
    Finding top candidates
## Parsing methcomp data ##
    Iterating over significant intervals
    Progress: 100%|██████████| 10.8k/10.8k [01:21<00:00, 133 intervals/s]

	pvalue	chromosome	start	end	Number of nearby TSS	closest tx id	closest tx name	closest tx biotype	distance to tss
6600	1.081882e-60	15	13014693	13015794	26	ensorlt00000004176	<NA>	protein_coding	0
7189	7.391533e-58	16	26220047	26221325	10	ensorlt00000019438	sh3bp5b-201	protein_coding	-24958
7341	1.164849e-56	17	316390	318610	5	ensorlt00000002667	<NA>	protein_coding	-155
5132	5.370468e-55	11	27686928	27688355	3	ensorlt00000017935	<NA>	protein_coding	-21358
1459	1.197991e-53	4	7874048	7875618	12	ensorlt00000006287	onecut3a-201	protein_coding	-11370

	1-308880-309181	1-503417-503927	1-514965-516003	1-570414-576486	1-623809-625652	1-727090-729031	1-729239-730182	1-1077484-1077924	1-1162762-1163132	1-1314335-1314688	...	24-21407316-21407672	24-22058632-22059037	24-22331849-22332483	24-22382403-22382773	24-22452449-22452848	24-22584245-22584601	24-22760710-22761123	24-22892617-22893113	24-23134697-23135071	24-23191101-23191433
Sample 11-1_A3	0.655	2.598	-2.230	NaN	-0.968	1.255	-2.085	4.352	2.855	-2.058	...	0.090	-0.195	3.020	-0.055	2.935	-0.058	0.220	1.760	-1.502	1.797
Sample 117-2_C4	-1.875	1.795	-3.225	0.660	-0.265	0.810	-3.300	1.758	3.080	2.130	...	-0.010	-1.902	3.555	0.770	-3.710	-2.742	2.210	1.220	0.880	1.655
Sample 131-1_F4	0.570	1.522	1.200	1.628	0.095	1.995	-1.792	3.035	0.340	1.985	...	0.550	-1.185	3.108	1.440	2.270	NaN	0.900	NaN	0.965	2.100
Sample 134-1_H4	-1.152	-1.142	1.265	1.518	-1.925	1.760	NaN	0.872	0.590	-3.310	...	2.705	-1.322	3.278	0.730	3.915	2.805	1.205	1.030	1.545	2.418
Sample 134-2_A5	2.145	-5.395	3.050	1.638	3.125	1.705	-3.133	-3.960	-2.475	1.635	...	-0.205	-0.692	3.980	0.150	3.400	-2.365	1.625	1.785	0.615	2.260

5 rows × 3532 columns

	15-13,014,749	15-13,014,782	15-13,014,816	15-13,014,843	15-13,014,876	15-13,014,892	15-13,014,924	15-13,014,942	15-13,014,970	15-13,015,026	...	15-13,015,535	15-13,015,568	15-13,015,587	15-13,015,618	15-13,015,629	15-13,015,651	15-13,015,671	15-13,015,693	15-13,015,739	15-13,015,768
Sample 11-1_A3	-1.510	-2.730	-5.46	-0.320	-3.330	-7.600	-1.490	-2.210	-1.690	-1.290	...	-10.740	-3.065	-7.785	-1.520	-2.795	-1.630	-2.330	-3.350	-1.90	-2.18
Sample 117-2_C4	-1.775	-1.010	-3.16	0.030	1.120	4.520	2.840	3.990	6.885	0.290	...	5.800	1.040	2.290	0.880	0.430	2.140	-0.290	-0.830	-0.13	0.60
Sample 131-1_F4	-1.210	-3.400	-4.31	-0.810	-0.995	-13.525	0.300	-1.740	-4.870	-2.110	...	-13.535	-2.915	-9.220	-1.485	-4.485	-1.825	-2.885	-3.995	-2.57	-4.33
Sample 134-1_H4	-2.875	-4.845	-4.02	-0.845	-4.675	-7.845	-0.065	-2.525	-3.125	-0.555	...	-7.200	-2.500	-5.920	0.140	-2.160	-0.450	-1.220	-1.830	-0.54	-1.98
Sample 134-2_A5	-0.500	3.230	-3.22	0.000	3.620	7.900	2.180	4.380	3.160	1.570	...	8.830	1.700	3.860	2.405	1.475	2.860	1.765	1.600	1.69	1.82

5 rows × 39 columns

Plotting functions

To display static version in jupyter (easier to export that dynamic d3js plots), one can use the Kaleido wrapper included in pycoMeth

kaleido = Kaleido()

cpg_heatmap

Can be used at both genome scale (all_cpg_df) or individual CpG island (top_cpg_df_d)

fig = cpg_heatmap(top_cpg_df_d[1], lim_llr=7, column_widths=[0.9, 0.10])
display(kaleido.render_plotly_svg(fig, width=1200))

fig = cpg_heatmap(all_cpg_df, lim_llr=3, column_widths=[0.80, 0.20])
display(kaleido.render_plotly_svg(fig, width=1200))

cpg_ridgeplot

Can be used at both genome scale (all_cpg_df) or individual CpG island (top_cpg_df_d)

fig = cpg_ridgeplot(all_cpg_df, box=True, scatter=False)
display(kaleido.render_plotly_svg(fig, width=1200))

fig = cpg_ridgeplot(top_cpg_df_d[1])
display(kaleido.render_plotly_svg(fig, width=1200))

category_barplot

Can be used at both genome scale (all_cpg_df) or individual CpG island (top_cpg_df_d)

fig = category_barplot(all_cpg_df)
display(kaleido.render_plotly_svg(fig, width=1200))

fig = category_barplot(top_cpg_df_d[1])
display(kaleido.render_plotly_svg(fig, width=1200))

chr_ideogram_plot

Can only be used at genome scale and requires to provide the reference fasta file

fig = chr_ideogram_plot(all_cpg_df, ref_fasta_fn="./data/medaka.fa", n_len_bin=250)
display(kaleido.render_plotly_svg(fig, width=1200, height=700))

tss_dist_plot

Can only be used from CpG island summary file

fig = tss_dist_plot(all_summary_df)
display(kaleido.render_plotly_svg(fig, width=1200))

Example usage in report mode

Example with a single significant result

Comp_Report (
    methcomp_fn = "./data/Yeast_CGI_meth_comp.tsv.gz",
    gff3_fn = "./data/yeast.gff3",
    ref_fasta_fn="./data/yeast.fa",
    outdir = "yeast_html",
    pvalue_threshold = 0.05,
    verbose=True)

## Checking options and input files ##
    [DEBUG]: Options summary
    [DEBUG]:    Package name: pycoMeth
    [DEBUG]:    Package version: 0.4.14
    [DEBUG]:    Timestamp: 2020-07-16 18:23:30.034053
    [DEBUG]:    methcomp_fn: ./data/Yeast_CGI_meth_comp.tsv.gz
    [DEBUG]:    gff3_fn: ./data/yeast.gff3
    [DEBUG]:    ref_fasta_fn: ./data/yeast.fa
    [DEBUG]:    outdir: yeast_html
    [DEBUG]:    n_top: 100
    [DEBUG]:    max_tss_distance: 100000
    [DEBUG]:    pvalue_threshold: 0.05
    [DEBUG]:    min_diff_llr: 1
    [DEBUG]:    n_len_bin: 500
    [DEBUG]:    api_mode: False
    [DEBUG]:    export_static_plots: False
    [DEBUG]:    report_non_significant: False
    [DEBUG]:    verbose: True
    [DEBUG]:    quiet: False
    [DEBUG]:    progress: False
    [DEBUG]:    kwargs
## Loading and preparing data ##
    Loading Methcomp data from TSV file
    Loading transcripts info from GFF file
    Loading chromosome info from reference FASTA file
    Number of significant intervals found (adjusted pvalue<0.05): 1
ERROR: Low number of significant sites. The summary report will likely contain errors
ERROR: Number of significant intervals lower than number of top candidates to plot
    Finding top candidates
    Creating output directory structure
    Computing source md5
## Parsing methcomp data ##
    Iterating over significant intervals
    [DEBUG]: Ploting top candidates rank: #1
    Generating summary report

Usage with large dataset, including static plot export

Comp_Report (
    methcomp_fn = "./data/Medaka_CGI_meth_comp.tsv.gz",
    gff3_fn = "./data/medaka.gff3",
    ref_fasta_fn="./data/medaka.fa",
    outdir = "medaka_html",
    export_static_plots=True,
    report_non_significant=True,
    n_top=50,
    progress=True)

## Checking options and input files ##
## Loading and preparing data ##
    Loading Methcomp data from TSV file
    Loading transcripts info from GFF file
    Loading chromosome info from reference FASTA file
    Number of significant intervals found (adjusted pvalue<0.01): 3532
    Finding top candidates
    Creating output directory structure
    Computing source md5
## Parsing methcomp data ##
    Iterating over significant intervals
    Progress: 100%|██████████| 10.8k/10.8k [01:45<00:00, 102 intervals/s] 
    Generating summary report