CGI_Finder CLI usage
Activate virtual environment
# Using virtualenvwrapper here but can also be done with Conda
workon pycoMeth
(pycoMeth) (pycoMeth)
Getting help
pycoMeth CGI_Finder --help
usage: pycoMeth CGI_Finder [-h] -f REF_FASTA_FN [-b OUTPUT_BED_FN]
[-t OUTPUT_TSV_FN] [-m MERGE_GAP] [-w MIN_WIN_LEN]
[-c MIN_CG_FREQ] [-r MIN_OBS_CG_RATIO] [-v] [-q]
[-p]
Simple method to find putative CpG islands in DNA sequences by using a sliding
window and merging overlapping windows satisfying the CpG island definition.
Results can be saved in bed and tsv format
optional arguments:
-h, --help show this help message and exit
Input/Output options:
-f REF_FASTA_FN, --ref_fasta_fn REF_FASTA_FN
Reference file used for alignment in Fasta format
(ideally already indexed with samtools faidx)
(required) [str]
-b OUTPUT_BED_FN, --output_bed_fn OUTPUT_BED_FN
Path to write a summary result file in BED format (At
least 1 output file is required) (default: None) [str]
-t OUTPUT_TSV_FN, --output_tsv_fn OUTPUT_TSV_FN
Path to write an more extensive result report in TSV
format (At least 1 output file is required) (default:
None) [str]
Misc options:
-m MERGE_GAP, --merge_gap MERGE_GAP
Merge close CpG island within a given distance in
bases (default: 0) [int]
-w MIN_WIN_LEN, --min_win_len MIN_WIN_LEN
Length of the minimal window containing CpG. Used as
the sliding window length (default: 200) [int]
-c MIN_CG_FREQ, --min_CG_freq MIN_CG_FREQ
Minimal C+G frequency in a window to be counted as a
valid CpG island (default: 0.5) [float]
-r MIN_OBS_CG_RATIO, --min_obs_CG_ratio MIN_OBS_CG_RATIO
Minimal Observed CG dinucleotidefrequency over
expected distribution in a window to be counted as a
valid CpG island (default: 0.6) [float]
Verbosity options:
-v, --verbose Increase verbosity
-q, --quiet Reduce verbosity
-p, --progress Display a progress bar
(pycoMeth)
Example usage
Basic usage with yeast genome
pycoMeth CGI_Finder \
-f ./data/yeast.fa \
-b ./results/yeast_CGI.bed \
-t ./results/yeast_CGI.tsv \
--progress
head ./results/yeast_CGI.bed
head ./results/yeast_CGI.tsv
## Checking options and input files ##
## Parsing reference fasta file ##
Parsing Reference sequence: I
Progress: 100%|█████████████████████████| 230k/230k [00:00<00:00, 838k bases/s]
Parsing Reference sequence: II
Progress: 100%|█████████████████████████| 813k/813k [00:00<00:00, 917k bases/s]
Parsing Reference sequence: III
Progress: 100%|█████████████████████████| 316k/316k [00:00<00:00, 854k bases/s]
Parsing Reference sequence: IV
Progress: 100%|███████████████████████| 1.53M/1.53M [00:01<00:00, 978k bases/s]
Parsing Reference sequence: V
Progress: 100%|█████████████████████████| 577k/577k [00:00<00:00, 878k bases/s]
Parsing Reference sequence: VI
Progress: 100%|█████████████████████████| 270k/270k [00:00<00:00, 888k bases/s]
Parsing Reference sequence: VII
Progress: 100%|███████████████████████| 1.09M/1.09M [00:01<00:00, 989k bases/s]
Parsing Reference sequence: VIII
Progress: 100%|█████████████████████████| 562k/562k [00:00<00:00, 925k bases/s]
Parsing Reference sequence: IX
Progress: 100%|█████████████████████████| 440k/440k [00:00<00:00, 937k bases/s]
Parsing Reference sequence: X
Progress: 100%|█████████████████████████| 746k/746k [00:00<00:00, 966k bases/s]
Parsing Reference sequence: XI
Progress: 100%|█████████████████████████| 667k/667k [00:00<00:00, 924k bases/s]
Parsing Reference sequence: XII
Progress: 100%|███████████████████████| 1.08M/1.08M [00:01<00:00, 906k bases/s]
Parsing Reference sequence: XIII
Progress: 100%|█████████████████████████| 924k/924k [00:00<00:00, 971k bases/s]
Parsing Reference sequence: XIV
Progress: 100%|█████████████████████████| 784k/784k [00:00<00:00, 923k bases/s]
Parsing Reference sequence: XV
Progress: 100%|███████████████████████| 1.09M/1.09M [00:01<00:00, 963k bases/s]
Parsing Reference sequence: XVI
Progress: 100%|█████████████████████████| 948k/948k [00:00<00:00, 960k bases/s]
Parsing Reference sequence: Mito
Progress: 100%|███████████████████████| 85.6k/85.6k [00:00<00:00, 863k bases/s]
Results summary
Valid minimal size windows: 216,083
Valid merged windows: 2,041
Number of reference sequences: 17
(pycoMeth) (pycoMeth) track name=CpG_islands
I 17 333
I 1804 2170
I 25527 25912
I 31835 32949
I 33497 34371
I 38163 38471
I 44294 44565
I 44730 44988
I 45308 45526
(pycoMeth) chromosome start end length num_CpG CG_freq obs_exp_freq
I 17 333 316 4 0.509 0.614
I 1804 2170 366 14 0.495 0.650
I 25527 25912 385 16 0.488 0.776
I 31835 32949 1114 59 0.497 0.876
I 33497 34371 874 39 0.506 0.715
I 38163 38471 308 13 0.487 0.715
I 44294 44565 271 12 0.487 0.747
I 44730 44988 258 9 0.481 0.608
I 45308 45526 218 12 0.495 0.908
(pycoMeth)