Eventalign_collapse usage
This program collapses the raw file generated by nanopolish eventalign
by kmers rather than by event.
Options
- input_fn
Path to a nanopolish eventalign
tsv output file (read access required). In command line mode it is also possible to pipe the output of the nanopolish eventalign
directly into Eventalign_collapse
.
- outdir
Path to an existing directory where to write all the output files generated by Eventalign_collapse
(write access required). If the directory does not exist an error is raised.
- outprefix
Prefix for all the files generated by the program
- write_samples
Write the concatenated sample data values corresponding to each kmer in the output data file. This options make only sense if nanopolish eventalign
was ran with --samples
option.
- max_reads
Controls the maximum number of read to parse before stopping to read the input file. This could be usefull for testing or downsampling.
- stat_fields
Specify the list of statistical analyses to perform on a kmer basis and add in the output data file. This will only be performed if nanopolish eventalign
was ran with --samples
option.
Valid statistics fields
- mean (mean of signal intensity)
- median (mean of signal intensity)
- std (standard deviation of signal intensity))
- mad (median absolute deviation of signal intensity))
- num_signals (Number of raw signal data point)
- threads
Eventalign_collapse
is multi threaded to speed up the data processing and keep pace with Nanopolish if using the direct piping strategy. Take advantage of many threads if you have access to a large compute cluster
Output files format
Contrary to nanopolish eventalign
output text file, in Eventalign_collapse
the reads are separated by a hashtag headers containing the read_id and ref_id. This reduces the redundancy and makes it easier to find the start and end of a read.
Example : #7ef1d7b9-5824-4382-b23b-78d82c07ebbd YHR055C.
The main data file contains the following fields:
- ref_pos: Reference sequence ID (contig).
- ref_kmer: Sequence of the reference kmers.
- num_events: Number of events for this kmer before collapsing.
- dwell_time: dwell time for this kmer in seconds
- NNNNN_dwell_time: dwell time of events for this kmers with a model sequence "NNNNN" (events ignored by nanopolish HMM).
- mismatch_dwell_time: dwell time of events for this kmers with a model sequence different from the reference kmer
- start_idx: Only if nanopolish eventalign called with --signal_idx. Start coordinate on original raw signal in fast5 file
- end_idx: Only if nanopolish eventalign called with --signal_idx. End coordinate on original raw signal in fast5 file
- mean: Only if nanopolish eventalign called with --samples. Mean of the normalised signal values provided by Nanopolish eventalign
- median: Only if nanopolish eventalign called with --samples. Median of the normalised signal values provided by Nanopolish eventalign
- std: Only if nanopolish eventalign called with --samples. Standard deviation of the normalised signal values provided by Nanopolish eventalign
- mad: Only if nanopolish eventalign called with --samples. Median absolute deviation of the normalised signal values provided by Nanopolish eventalign
- num_signals: Only if nanopolish eventalign called with --samples. Number of raw signal points.
- samples: Only if nanopolish eventalign called with --samples and Eventalign_collapse called with --write_samples. List of normalised signal intensity values for this kmer
In addition Eventalign_collapse
also generates an useful index file containing reads level information. It contains the following fields:
- read_id: Name or index of the read
- ref_id: Name of the reference sequence the read was aligned on (contig)
- ref_start: Start coordinate of the alignment on the reference sequence
- ref_end: End coordinate of the alignment on the reference sequence
- dwell_time: Cumulative dwell time in seconds for the entire resquiggled sequence
- kmers: Overall number of resquiggled kmers
- NNNNN_kmers: Number of resquiggled kmers containing at least 1 event for which the model sequence was "NNNNN"
- mismatching_kmers: Number of resquiggled kmers containing at least 1 event for which the model sequence diverged from the reference sequence
- missing_kmers: Number of skipped/missing reference positions in nanopolish output
- byte_offset: Number of characters before the start of the sequence in the main output file. This can be used in conjunction with file.seek() to directly access the start of a read. An example is provided in the Usage notebook.
- byte_len: Length of characters after byte_offset to the end of the read, excluding the last newline. This can be used in conjunction with read() to read all the text chunk corresponding to the read.
Bash command line usage
Command line help
%%bash # Load local bashrc and activate virtual environment source ~/.bashrc workon Nanopolish_0.11.1 NanopolishComp Eventalign_collapse --help
usage: NanopolishComp Eventalign_collapse [-h] [-i INPUT_FN] [-o OUTDIR] [-p OUTPREFIX] [-s] [-r MAX_READS] [-f STAT_FIELDS [STAT_FIELDS ...]] [-t THREADS] [-v | -q] Collapse the nanopolish eventalign output at kmers level and compute kmer level statistics optional arguments: -h, --help show this help message and exit -v, --verbose Increase verbosity (default: False) -q, --quiet Reduce verbosity (default: False) Input/Output options: -i INPUT_FN, --input_fn INPUT_FN Path to a nanopolish eventalign tsv output file. If '0' read from std input (default: 0) -o OUTDIR, --outdir OUTDIR Path to the output folder (will be created if it does exist yet) (default: ./) -p OUTPREFIX, --outprefix OUTPREFIX text outprefix for all the files generated (default: out) Run parameters options: -s, --write_samples If given, will write the raw sample if nanopolish eventalign was ran with --samples option (default: False) -r MAX_READS, --max_reads MAX_READS Maximum number of read to parse. 0 to deactivate (default: 0) -f STAT_FIELDS [STAT_FIELDS ...], --stat_fields STAT_FIELDS [STAT_FIELDS ...] List of statistical fields to compute if nanopolish eventalign was ran with --sample option. Valid values = mean, std, median, mad, num_signals (default: ['mean', 'median', 'num_signals']) Other options: -t THREADS, --threads THREADS Total number of threads. 1 thread is used for the reader and 1 for the writer (default: 4)
Example usage
From an existing nanopolish eventalign output to a file
%%bash # Load local bashrc and activate virtual environment source ~/.bashrc workon Nanopolish_0.11.1 NanopolishComp Eventalign_collapse -i ./data/eventalign_collapse//nanopolish_reads.tsv -o ./output/eventalign_collapse/ -v head ./output/eventalign_collapse/out_eventalign_collapse.tsv head ./output/eventalign_collapse/out_eventalign_collapse.tsv.idx
#0 YGR240C ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time 1656 GAAAA 1 0.00266 0.0 0.0 1657 AAAAC 1 0.00764 0.0 0.0 1658 AAACA 1 0.00398 0.0 0.0 1659 AACAA 1 0.00432 0.0 0.0 1660 ACAAA 1 0.00498 0.0 0.0 1661 CAAAG 1 0.00564 0.0 0.0 1662 AAAGA 1 0.00963 0.0 0.0 1663 AAGAT 1 0.00299 0.0 0.0 ref_id ref_start ref_end read_id kmers dwell_time NNNNN_kmers mismatch_kmers missing_kmers byte_offset byte_len YGR240C 1656 2960 0 1250 13.788570000000009 35 0 54 0 38028 YCR030C 1578 2576 1 971 11.487010000000005 23 0 27 38029 29704 YHR174W 0 839 2 825 9.659210000000002 15 0 14 67734 24231 YHR174W 218 1309 3 1028 11.06325000000001 36 0 63 91966 30801 YHR174W 462 1309 4 818 10.73776000000001 18 0 29 122768 24589 YLR441C 173 764 5 554 5.556939999999999 20 0 37 147358 16437 YGR192C 1 989 6 927 11.731470000000003 37 0 61 163796 27388 YDR500C 9 252 8 231 2.9179100000000027 7 0 14 191185 6751 YGR192C 3 995 7 946 12.30464000000001 31 0 50 197937 28123 Options summary package_name: NanopolishComp package_version: 0.6.1 timestamp: 2019-05-08 11:59:11.239171 quiet: False verbose: True threads: 4 stat_fields: ['mean', 'median', 'num_signals'] write_samples: False max_reads: None outprefix: out outdir: ./output/eventalign_collapse/ input_fn: ./data/eventalign_collapse//nanopolish_reads.tsv Checking arguments Testing input file readability Creating output folder Checking number of threads Checking if stat_fields names are valid Starting to process files [split_reads] Start reading input file/stream [process_read 1] Starting processing reads [process_read 2] Starting processing reads [write_output] Start rwriting output 14 reads [00:00, 74.94 reads/s] [split_reads] Done [process_read 1] Done [process_read 2] Done 21 reads [00:00, 88.36 reads/s] [write_output] Done [Eventalign_collapse] total reads: 21 [86.45 reads/s]
From standard input to a file
%%bash # Load local bashrc and activate virtual environment source ~/.bashrc workon Nanopolish_0.11.1 cat ./data/eventalign_collapse//nanopolish_reads_index.tsv | NanopolishComp Eventalign_collapse -o ./output/eventalign_collapse/ --verbose head ./output/eventalign_collapse/out_eventalign_collapse.tsv head ./output/eventalign_collapse/out_eventalign_collapse.tsv.idx
#0 YGR240C ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time 1656 GAAAA 1 0.00266 0.0 0.0 1657 AAAAC 1 0.00764 0.0 0.0 1658 AAACA 1 0.00398 0.0 0.0 1659 AACAA 1 0.00432 0.0 0.0 1660 ACAAA 1 0.00498 0.0 0.0 1661 CAAAG 1 0.00564 0.0 0.0 1662 AAAGA 1 0.00963 0.0 0.0 1663 AAGAT 1 0.00299 0.0 0.0 ref_id ref_start ref_end read_id kmers dwell_time NNNNN_kmers mismatch_kmers missing_kmers byte_offset byte_len YGR240C 1656 2960 0 1250 13.788570000000009 35 0 54 0 38028 YCR030C 1578 2576 1 971 11.487010000000005 23 0 27 38029 29704 YHR174W 0 839 2 825 9.659210000000002 15 0 14 67734 24231 YHR174W 218 1309 3 1028 11.06325000000001 36 0 63 91966 30801 YHR174W 462 1309 4 818 10.73776000000001 18 0 29 122768 24589 YLR441C 173 764 5 554 5.556939999999999 20 0 37 147358 16437 YGR192C 1 989 6 927 11.731470000000003 37 0 61 163796 27388 YDR500C 9 252 8 231 2.9179100000000027 7 0 14 191185 6751 YGR192C 3 995 7 946 12.30464000000001 31 0 50 197937 28123 Options summary package_name: NanopolishComp package_version: 0.6.1 timestamp: 2019-05-08 11:59:12.535254 quiet: False verbose: True threads: 4 stat_fields: ['mean', 'median', 'num_signals'] write_samples: False max_reads: None outprefix: out outdir: ./output input_fn: 0 Checking arguments Testing input file readability Creating output folder Checking number of threads Checking if stat_fields names are valid Starting to process files [split_reads] Start reading input file/stream [process_read 1] Starting processing reads [process_read 2] Starting processing reads [write_output] Start rwriting output 19 reads [00:00, 74.77 reads/s] [split_reads] Done [process_read 1] Done [process_read 2] Done 21 reads [00:00, 90.18 reads/s] [write_output] Done [Eventalign_collapse] total reads: 21 [89.3 reads/s]
On the fly, from nanopolish eventalign to a file
%%bash # Load local bashrc and activate virtual environment source ~/.bashrc workon Nanopolish_0.11.1 nanopolish eventalign -t 4 --samples --scale-events --print-read-name \ --reads ./data/eventalign_collapse/reads.fastq \ --bam ./data/eventalign_collapse/aligned_reads.bam \ --genome ./data/eventalign_collapse/reference.fa | \ NanopolishComp Eventalign_collapse -o ./output head ./output/eventalign_collapse/out_eventalign_collapse.tsv head ./output/eventalign_collapse/out_eventalign_collapse.tsv.idx
#0 YGR240C ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time 1656 GAAAA 1 0.00266 0.0 0.0 1657 AAAAC 1 0.00764 0.0 0.0 1658 AAACA 1 0.00398 0.0 0.0 1659 AACAA 1 0.00432 0.0 0.0 1660 ACAAA 1 0.00498 0.0 0.0 1661 CAAAG 1 0.00564 0.0 0.0 1662 AAAGA 1 0.00963 0.0 0.0 1663 AAGAT 1 0.00299 0.0 0.0 ref_id ref_start ref_end read_id kmers dwell_time NNNNN_kmers mismatch_kmers missing_kmers byte_offset byte_len YGR240C 1656 2960 0 1250 13.788570000000009 35 0 54 0 38028 YCR030C 1578 2576 1 971 11.487010000000005 23 0 27 38029 29704 YHR174W 0 839 2 825 9.659210000000002 15 0 14 67734 24231 YHR174W 218 1309 3 1028 11.06325000000001 36 0 63 91966 30801 YHR174W 462 1309 4 818 10.73776000000001 18 0 29 122768 24589 YLR441C 173 764 5 554 5.556939999999999 20 0 37 147358 16437 YGR192C 1 989 6 927 11.731470000000003 37 0 61 163796 27388 YDR500C 9 252 8 231 2.9179100000000027 7 0 14 191185 6751 YGR192C 3 995 7 946 12.30464000000001 31 0 50 197937 28123 Checking arguments Starting to process files 18 reads [00:01, 7.25 reads/s][post-run summary] total reads: 21, unparseable: 0, qc fail: 0, could not calibrate: 0, no alignment: 0, bad fast5: 0 21 reads [00:01, 7.84 reads/s] [Eventalign_collapse] total reads: 21 [14.06 reads/s]
Python API usage
Import the package
# Import main program from NanopolishComp.Eventalign_collapse import Eventalign_collapse # Import helper functions from NanopolishComp.common import jhelp, head
python API help
jhelp(Eventalign_collapse)
NanopolishComp.Eventalign_collapse.init
Collapse the nanopolish eventalign output by kmers rather that by events. kmer level statistics (mean, median, std, mad) are only computed if nanopolish is run with --samples option
- input_fn : str (required)
Path to a nanopolish eventalign tsv output file.
- outdir : str (default = ./)
Path to the output folder (will be created if it does exist yet)
- outprefix : str (default = out)
text outprefix for all the files generated
- max_reads : int (default = None)
Maximum number of read to parse. 0 to deactivate (default = 0)
- write_samples : bool (default = False)
If given, will write the raw sample if nanopolish eventalign was ran with --samples option
- stat_fields : list of str (default = ['mean', 'median', 'num_signals'])
List of statistical fields to compute if nanopolish eventalign was ran with --samples option. Valid values = "mean", "std", "median", "mad", "num_signals"
- threads : int (default = 4)
Total number of threads. 1 thread is used for the reader and 1 for the writer (default = 4)
- verbose : bool (default = False)
Increase verbosity
- quiet : bool (default = False)
Reduce verbosity
Example usage
Example with minimal file
input_fn = "./data/eventalign_collapse//nanopolish_reads.tsv" outdir = "./output/eventalign_collapse" outprefix = "basic" Eventalign_collapse (input_fn=input_fn, outdir=outdir, outprefix=outprefix, threads=6) head("./output/eventalign_collapse/basic_eventalign_collapse.tsv") head("./output/eventalign_collapse/basic_eventalign_collapse.tsv.idx")
Checking arguments Starting to process files 21 reads [00:00, 99.56 reads/s] [Eventalign_collapse] total reads: 21 [90.56 reads/s] #0 YGR240C ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time 1656 GAAAA 1 0.00266 0.0 0.0 1657 AAAAC 1 0.00764 0.0 0.0 1658 AAACA 1 0.00398 0.0 0.0 1659 AACAA 1 0.00432 0.0 0.0 1660 ACAAA 1 0.00498 0.0 0.0 1661 CAAAG 1 0.00564 0.0 0.0 1662 AAAGA 1 0.00963 0.0 0.0 1663 AAGAT 1 0.00299 0.0 0.0 ref_id ref_start ref_end read_id kmers dwell_time NNNNN_kmers mismatch_kmers missing_kmers byte_offset byte_len YGR240C 1656 2960 0 1250 13.788570000000009 35 0 54 0 38028 YCR030C 1578 2576 1 971 11.487010000000005 23 0 27 38029 29704 YHR174W 0 839 2 825 9.659210000000002 15 0 14 67734 24231 YHR174W 218 1309 3 1028 11.06325000000001 36 0 63 91966 30801 YHR174W 462 1309 4 818 10.73776000000001 18 0 29 122768 24589 YLR441C 173 764 5 554 5.556939999999999 20 0 37 147358 16437 YGR192C 1 989 6 927 11.731470000000003 37 0 61 163796 27388 YDR500C 9 252 8 231 2.9179100000000027 7 0 14 191185 6751 YGR192C 3 995 7 946 12.30464000000001 31 0 50 197937 28123
Example with indexes
input_fn = "./data/eventalign_collapse//nanopolish_reads_index.tsv" outdir = "./output/eventalign_collapse" outprefix = "index" Eventalign_collapse (input_fn=input_fn, outdir=outdir, outprefix=outprefix, threads=6) head("./output/eventalign_collapse/index_eventalign_collapse.tsv") head("./output/eventalign_collapse/index_eventalign_collapse.tsv.idx")
Checking arguments Starting to process files 21 reads [00:00, 94.30 reads/s] [Eventalign_collapse] total reads: 21 [82.25 reads/s] #0 YGR240C ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time start_idx end_idx 1656 GAAAA 1 0.00266 0.0 0.0 63446 63454 1657 AAAAC 1 0.00764 0.0 0.0 63423 63446 1658 AAACA 1 0.00398 0.0 0.0 63411 63423 1659 AACAA 1 0.00432 0.0 0.0 63398 63411 1660 ACAAA 1 0.00498 0.0 0.0 63383 63398 1661 CAAAG 1 0.00564 0.0 0.0 63366 63383 1662 AAAGA 1 0.00963 0.0 0.0 63337 63366 1663 AAGAT 1 0.00299 0.0 0.0 63328 63337 ref_id ref_start ref_end read_id kmers dwell_time NNNNN_kmers mismatch_kmers missing_kmers byte_offset byte_len YGR240C 1656 2960 0 1250 13.788570000000009 35 0 54 0 53046 YCR030C 1578 2576 1 971 11.487010000000005 23 0 27 53047 41374 YHR174W 0 839 2 825 9.659210000000002 15 0 14 94422 33595 YHR174W 218 1309 3 1028 11.06325000000001 36 0 63 128018 43155 YHR174W 462 1309 4 818 10.73776000000001 18 0 29 171174 34423 YLR441C 173 764 5 554 5.556939999999999 20 0 37 205598 23103 YGR192C 1 989 6 927 11.731470000000003 37 0 61 228702 38529 YDR500C 9 252 8 231 2.9179100000000027 7 0 14 267232 9541 YGR192C 3 995 7 946 12.30464000000001 31 0 50 276774 39493
Example including samples
input_fn = "./data/eventalign_collapse//nanopolish_reads_index.tsv" outdir = "./output/eventalign_collapse" outprefix = "stats" Eventalign_collapse (input_fn=input_fn, outdir=outdir, outprefix=outprefix, threads=6, quiet=True, stat_fields=["mean", "std", "median", "mad", "num_signals"]) head("./output/eventalign_collapse/stats_eventalign_collapse.tsv") head("./output/eventalign_collapse/stats_eventalign_collapse.tsv.idx")
[Eventalign_collapse] total reads: 21 [86.98 reads/s] #0 YGR240C ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time start_idx end_idx 1656 GAAAA 1 0.00266 0.0 0.0 63446 63454 1657 AAAAC 1 0.00764 0.0 0.0 63423 63446 1658 AAACA 1 0.00398 0.0 0.0 63411 63423 1659 AACAA 1 0.00432 0.0 0.0 63398 63411 1660 ACAAA 1 0.00498 0.0 0.0 63383 63398 1661 CAAAG 1 0.00564 0.0 0.0 63366 63383 1662 AAAGA 1 0.00963 0.0 0.0 63337 63366 1663 AAGAT 1 0.00299 0.0 0.0 63328 63337 ref_id ref_start ref_end read_id kmers dwell_time NNNNN_kmers mismatch_kmers missing_kmers byte_offset byte_len YGR240C 1656 2960 0 1250 13.788570000000009 35 0 54 0 53046 YCR030C 1578 2576 1 971 11.487010000000005 23 0 27 53047 41374 YHR174W 0 839 2 825 9.659210000000002 15 0 14 94422 33595 YHR174W 218 1309 3 1028 11.06325000000001 36 0 63 128018 43155 YHR174W 462 1309 4 818 10.73776000000001 18 0 29 171174 34423 YLR441C 173 764 5 554 5.556939999999999 20 0 37 205598 23103 YGR192C 1 989 6 927 11.731470000000003 37 0 61 228702 38529 YDR500C 9 252 8 231 2.9179100000000027 7 0 14 267232 9541 YGR192C 3 995 7 946 12.30464000000001 31 0 50 276774 39493
Example including samples and writing samples values in ouput file
input_fn = "./data/eventalign_collapse//nanopolish_reads_samples.tsv" outdir = "./output/eventalign_collapse" outprefix = "samples" Eventalign_collapse (input_fn=input_fn, outdir=outdir, outprefix=outprefix, threads=6, quiet=True, stat_fields=["mean", "std", "median", "mad", "num_signals"], write_samples=True) head("./output/eventalign_collapse/samples_eventalign_collapse.tsv") head("./output/eventalign_collapse/samples_eventalign_collapse.tsv.idx")
[Eventalign_collapse] total reads: 21 [12.19 reads/s] #2 YHR174W ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time mean std median mad num_signals samples 0 ATGGC 1 0.01096 0.0 0.0 86.3893051147461 3.8192403316497803 85.67970275878906 2.618194580078125 33 96.4434,93.5342,94.5524,87.1342,85.6797,86.6978,85.0978,91.0615,89.316,84.0797,82.916,82.3342,89.316,84.3706,84.3706,85.8251,81.4615,83.0615,85.9706,85.0978,80.7342,82.4796,93.0979,88.0069,82.916,85.3887,84.0797,86.5524,84.3706,82.3342,88.2979,88.2979,85.9706 1 TGGCT 4 0.03253 0.0 0.0 104.78181457519531 6.572873592376709 106.18900299072266 3.9269943237304688 98 104.007,96.1524,106.189,99.0615,108.371,102.407,107.498,100.662,113.462,106.189,111.28,111.28,112.298,102.989,101.971,109.68,106.916,109.243,106.916,103.28,101.825,110.553,112.443,105.898,101.098,99.0615,105.462,102.843,100.516,107.643,109.098,107.789,113.316,109.098,110.116,108.807,106.916,112.443,114.916,110.843,109.098,111.716,103.862,110.262,109.098,105.171,109.68,108.807,112.589,111.425,123.498,110.116,108.371,110.843,111.134,110.843,96.007,94.5524,97.8979,90.6251,94.9888,97.0252,93.9706,104.589,109.971,106.48,108.807,101.389,103.716,96.8797,89.7524,97.4615,98.0434,86.8433,104.443,103.716,102.407,107.207,102.989,93.6797,103.571,104.007,100.08,103.862,109.68,106.625,101.534,97.3161,103.571,97.607,88.7342,109.534,106.771,113.025,108.807,93.2433,108.952,91.207 2 GGCTG 1 0.00465 0.0 0.0 96.55765533447266 6.874111652374268 97.75244903564453 5.382049560546875 14 105.752,98.9161,98.7706,96.7343,95.5706,100.08,105.607,96.4434,88.5888,101.68,85.6797,104.589,88.1524,85.2433 3 GCTGT 5 0.0385 0.0 0.0 81.69342803955078 5.14208984375 80.29779815673828 1.5273017883300781 116 94.1161,90.6251,93.2433,81.316,105.316,87.716,95.7161,90.3342,98.0434,86.116,87.716,107.789,96.2979,83.4978,84.2251,83.3524,79.5705,80.8796,79.1342,78.116,80.4433,80.5887,79.4251,79.2796,81.6069,77.9705,81.6069,81.7524,81.4615,80.4433,86.9888,79.5705,81.316,77.5342,82.6251,77.9705,82.1887,77.6796,80.7342,83.6433,77.9705,78.4069,80.0069,80.1524,80.4433,80.7342,80.2978,84.9524,78.5524,81.316,80.2978,82.1887,81.6069,77.6796,81.0251,78.4069,79.4251,80.4433,79.8615,82.1887,79.8615,80.0069,78.6978,80.7342,80.5887,78.5524,78.8433,79.4251,82.6251,80.0069,77.5342,80.0069,77.5342,83.7887,84.0797,83.0615,82.1887,81.6069,80.8796,81.316,80.0069,78.9887,82.7706,84.516,81.4615,80.5887,79.2796,78.9887,78.4069,80.5887,80.2978,83.7887,78.4069,78.116,79.5705,78.5524,79.4251,79.2796,81.0251,78.8433,78.9887,78.5524,80.1524,77.9705,78.2614,80.8796,78.4069,79.2796,78.2614,79.716,77.6796,79.4251,77.6796,78.6978,77.3887,76.9523 4 CTGTC 1 0.00232 0.0 0.0 102.4487075805664 1.5290673971176147 103.57099914550781 0.2910003662109375 7 103.571,103.716,103.862,101.243,100.371,103.862,100.516 5 TGTCT 4 0.019249999999999996 0.0 0.0 107.04158020019531 5.813553810119629 107.64350128173828 2.5454978942871094 58 113.462,109.534,107.789,112.298,110.116,112.153,109.243,111.862,107.352,112.88,103.716,102.262,103.571,101.971,104.007,104.152,99.6434,104.88,109.971,109.825,114.625,105.898,107.062,88.7342,105.462,112.589,108.516,114.771,106.916,111.716,110.116,106.334,108.08,107.207,110.262,110.116,105.898,109.825,108.08,106.771,107.207,108.08,107.498,114.48,116.807,107.062,109.68,105.607,110.407,110.553,106.189,105.025,109.389,104.88,86.6978,103.425,88.2979,97.4615 6 GTCTC 2 0.00797 0.0 0.0 76.83718872070312 2.1963260173797607 76.87960052490234 1.1636505126953125 24 78.5524,75.6432,75.7887,79.4251,82.3342,77.3887,79.4251,77.8251,76.9523,78.4069,76.9523,77.8251,76.6614,78.9887,76.6614,76.8069,75.7887,71.8614,74.7705,74.6251,77.3887,76.8069,74.3341,72.8796 7 TCTCT 1 0.00232 0.0 0.0 80.0484619140625 1.0306347608566284 80.00689697265625 0.727294921875 7 78.6978,79.2796,79.2796,80.0069,81.4615,81.6069,80.0069 ref_id ref_start ref_end read_id kmers dwell_time NNNNN_kmers mismatch_kmers missing_kmers byte_offset byte_len YHR174W 0 839 2 825 9.659210000000002 15 0 14 0 317027 YCR030C 1578 2576 1 971 11.487010000000005 23 0 27 317028 376326 YGR240C 1656 2960 0 1250 13.788570000000009 35 0 54 693355 461011 YHR174W 218 1309 3 1028 11.06325000000001 36 0 63 1154367 372126 YLR441C 173 764 5 554 5.556939999999999 20 0 37 1526494 190670 YDR500C 9 252 8 231 2.9179100000000027 7 0 14 1717165 93821 YHR174W 462 1309 4 818 10.73776000000001 18 0 29 1810987 342528 YGR192C 3 995 7 946 12.30464000000001 31 0 50 2153516 392222 YDR224C 0 389 9 376 4.2840300000000004 8 0 13 2545739 141262
Using the index to random access a specific entry in the file
Random access with standard library
output_fn = "./output/eventalign_collapse/stats_eventalign_collapse.tsv" index_fn = "./output/eventalign_collapse/stats_eventalign_collapse.tsv.idx" # Imports import csv from tabulate import tabulate from random import sample from itertools import islice # read index file and select random lines index_list = [] with open (index_fn) as fp: for l in csv.DictReader(fp, delimiter='\t'): index_list.append(l) random_index = sample(index_list, k=5) print ("RANDOM INDEX LINES") print (tabulate(random_index, headers="keys")) # Open the collapsed event align file with open (output_fn) as fp: for index in random_index: # Access the header corresponding to the randomly selected index line using seek fp.seek(0) # Return to file start fp.seek(int(index["byte_offset"])) # Move to the offset indicated in the index file print ("\n" + fp.readline().rstrip()) # Print read header # Get 5 first lines and print then data_list = [] for l in islice(csv.DictReader(fp, delimiter='\t'), 5): data_list.append(l) print (tabulate(data_list, headers="keys"))
RANDOM INDEX LINES ref_id ref_start ref_end read_id kmers dwell_time NNNNN_kmers mismatch_kmers missing_kmers byte_offset byte_len -------- ----------- --------- --------- ------- ------------ ------------- ---------------- --------------- ------------- ---------- YMR116C 180 950 16 740 9.30597 20 0 30 482785 30690 YLR110C 14 395 15 364 4.61444 9 0 17 450178 15023 YGR240C 1656 2960 0 1250 13.7886 35 0 54 0 53046 YGR192C 3 995 7 946 12.3046 31 0 50 276774 39493 YHR174W 462 1309 4 818 10.7378 18 0 29 171174 34423 #16 YMR116C ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time start_idx end_idx --------- ---------- ------------ ------------ ------------------ --------------------- ----------- --------- 180 TTCAA 3 0.06408 0 0 41362 41555 181 TCAAG 1 0.00598 0 0 41344 41362 182 CAAGG 2 0.00963 0 0 41315 41344 183 AAGGG 1 0.00697 0 0 41294 41315 184 AGGGT 1 0.00199 0 0 41288 41294 #15 YLR110C ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time start_idx end_idx --------- ---------- ------------ ------------ ------------------ --------------------- ----------- --------- 14 TGTCG 2 0.00598 0 0 23697 23715 15 GTCGC 1 0.0073 0 0 23675 23697 16 TCGCT 1 0.00564 0 0 23658 23675 17 CGCTT 1 0.00398 0 0 23646 23658 18 GCTTC 2 0.01328 0 0 23606 23646 #0 YGR240C ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time start_idx end_idx --------- ---------- ------------ ------------ ------------------ --------------------- ----------- --------- 1656 GAAAA 1 0.00266 0 0 63446 63454 1657 AAAAC 1 0.00764 0 0 63423 63446 1658 AAACA 1 0.00398 0 0 63411 63423 1659 AACAA 1 0.00432 0 0 63398 63411 1660 ACAAA 1 0.00498 0 0 63383 63398 #7 YGR192C ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time start_idx end_idx --------- ---------- ------------ ------------ ------------------ --------------------- ----------- --------- 3 GTTAG 1 0.00564 0 0 55070 55087 4 TTAGA 1 0.0073 0 0 55048 55070 5 TAGAG 1 0.01162 0 0 55013 55048 6 AGAGT 1 0.00465 0 0 54999 55013 7 GAGTT 3 0.01128 0 0 54965 54999 #4 YHR174W ref_pos ref_kmer num_events dwell_time NNNNN_dwell_time mismatch_dwell_time start_idx end_idx --------- ---------- ------------ ------------ ------------------ --------------------- ----------- --------- 462 TTGAA 2 0.00797 0 0 42640 42664 463 TGAAC 1 0.00531 0 0 42624 42640 464 GAACG 3 0.01726 0 0 42572 42624 465 AACGG 5 0.02423 0 0 42499 42572 466 ACGGT 1 0.00764 0 0 42476 42499
Random access with pandas
output_fn = "./output/eventalign_collapse/stats_eventalign_collapse.tsv" index_fn = "./output/eventalign_collapse/stats_eventalign_collapse.tsv.idx" # Imports import pandas as pd pd.set_option('display.max_columns', 6) # read index file and select random lines index_df = pd.read_csv (index_fn, sep="\t") random_lines = index_df.sample(5) print ("Random index lines") display (random_lines) # Open the collapsed event align file with open (output_fn) as fp: for id, read in random_lines.iterrows(): # Access the header corresponding to the randomly selected index line using seek fp.seek(0) # Return to file start fp.seek(read.byte_offset) # Move to the offset indicated in the index file print (fp.readline().rstrip()) # Print read header df = pd.read_csv (fp, nrows=5, sep="\t") # Read lines corresponding to the read display(df)
Random index lines
ref_id | ref_start | ref_end | ... | missing_kmers | byte_offset | byte_len | |
---|---|---|---|---|---|---|---|
14 | YDR382W | 5 | 329 | ... | 16 | 437025 | 13152 |
6 | YGR192C | 1 | 989 | ... | 61 | 228702 | 38529 |
1 | YCR030C | 1578 | 2576 | ... | 27 | 53047 | 41374 |
5 | YLR441C | 173 | 764 | ... | 37 | 205598 | 23103 |
12 | YLR293C | 1 | 654 | ... | 43 | 382914 | 26166 |
5 rows × 11 columns
#14 YDR382W
ref_pos | ref_kmer | num_events | ... | mismatch_dwell_time | start_idx | end_idx | |
---|---|---|---|---|---|---|---|
0 | 5 | ATACT | 2 | ... | 0.0 | 48813 | 48858 |
1 | 6 | TACTT | 2 | ... | 0.0 | 48794 | 48813 |
2 | 7 | ACTTA | 4 | ... | 0.0 | 48741 | 48794 |
3 | 8 | CTTAG | 1 | ... | 0.0 | 48709 | 48741 |
4 | 9 | TTAGC | 1 | ... | 0.0 | 48681 | 48709 |
5 rows × 8 columns
#6 YGR192C
ref_pos | ref_kmer | num_events | ... | mismatch_dwell_time | start_idx | end_idx | |
---|---|---|---|---|---|---|---|
0 | 1 | TGGTT | 1 | ... | 0.0 | 45267 | 45311 |
1 | 2 | GGTTA | 1 | ... | 0.0 | 45248 | 45267 |
2 | 3 | GTTAG | 1 | ... | 0.0 | 45223 | 45248 |
3 | 4 | TTAGA | 4 | ... | 0.0 | 45159 | 45223 |
4 | 5 | TAGAG | 2 | ... | 0.0 | 45137 | 45159 |
5 rows × 8 columns
#1 YCR030C
ref_pos | ref_kmer | num_events | ... | mismatch_dwell_time | start_idx | end_idx | |
---|---|---|---|---|---|---|---|
0 | 1578 | CCACC | 1 | ... | 0.0 | 47517 | 47528 |
1 | 1579 | CACCC | 15 | ... | 0.0 | 47223 | 47517 |
2 | 1580 | ACCCT | 1 | ... | 0.0 | 47216 | 47223 |
3 | 1581 | CCCTC | 1 | ... | 0.0 | 47209 | 47216 |
4 | 1582 | CCTCA | 5 | ... | 0.0 | 47162 | 47209 |
5 rows × 8 columns
#5 YLR441C
ref_pos | ref_kmer | num_events | ... | mismatch_dwell_time | start_idx | end_idx | |
---|---|---|---|---|---|---|---|
0 | 173 | AGATG | 4 | ... | 0.0 | 27161 | 27241 |
1 | 174 | GATGC | 1 | ... | 0.0 | 27150 | 27161 |
2 | 175 | ATGCT | 1 | ... | 0.0 | 27105 | 27150 |
3 | 178 | CTTTG | 1 | ... | 0.0 | 27078 | 27105 |
4 | 179 | TTTGA | 1 | ... | 0.0 | 27061 | 27078 |
5 rows × 8 columns
#12 YLR293C
ref_pos | ref_kmer | num_events | ... | mismatch_dwell_time | start_idx | end_idx | |
---|---|---|---|---|---|---|---|
0 | 1 | TGTCT | 1 | ... | 0.0 | 81520 | 81535 |
1 | 2 | GTCTG | 2 | ... | 0.0 | 81436 | 81520 |
2 | 3 | TCTGC | 2 | ... | 0.0 | 81371 | 81436 |
3 | 4 | CTGCC | 1 | ... | 0.0 | 81361 | 81371 |
4 | 5 | TGCCC | 1 | ... | 0.0 | 81351 | 81361 |
5 rows × 8 columns