

Eventalign_collapse usage

This program collapses the raw file generated by nanopolish eventalign by kmers rather than by event.

Options

input_fn

Path to a nanopolish eventalign tsv output file (read access required). In command line mode it is also possible to pipe the output of the nanopolish eventalign directly into Eventalign_collapse.

outdir

Path to an existing directory where to write all the output files generated by Eventalign_collapse(write access required). If the directory does not exist an error is raised.

outprefix

Prefix for all the files generated by the program

write_samples

Write the concatenated sample data values corresponding to each kmer in the output data file. This options make only sense if nanopolish eventalign was ran with --samples option.

max_reads

Controls the maximum number of read to parse before stopping to read the input file. This could be usefull for testing or downsampling.

stat_fields

Specify the list of statistical analyses to perform on a kmer basis and add in the output data file. This will only be performed if nanopolish eventalign was ran with --samples option.

Valid statistics fields

mean (mean of signal intensity)
median (mean of signal intensity)
std (standard deviation of signal intensity))
mad (median absolute deviation of signal intensity))
num_signals (Number of raw signal data point)

threads

Eventalign_collapse is multi threaded to speed up the data processing and keep pace with Nanopolish if using the direct piping strategy. Take advantage of many threads if you have access to a large compute cluster

Output files format

Contrary to nanopolish eventalign output text file, in Eventalign_collapse the reads are separated by a hashtag headers containing the read_id and ref_id. This reduces the redundancy and makes it easier to find the start and end of a read.

Example : #7ef1d7b9-5824-4382-b23b-78d82c07ebbd YHR055C.

The main data file contains the following fields:

ref_pos: Reference sequence ID (contig).
ref_kmer: Sequence of the reference kmers.
num_events: Number of events for this kmer before collapsing.
dwell_time: dwell time for this kmer in seconds
NNNNN_dwell_time: dwell time of events for this kmers with a model sequence "NNNNN" (events ignored by nanopolish HMM).
mismatch_dwell_time: dwell time of events for this kmers with a model sequence different from the reference kmer
start_idx: Only if nanopolish eventalign called with --signal_idx. Start coordinate on original raw signal in fast5 file
end_idx: Only if nanopolish eventalign called with --signal_idx. End coordinate on original raw signal in fast5 file
mean: Only if nanopolish eventalign called with --samples. Mean of the normalised signal values provided by Nanopolish eventalign
median: Only if nanopolish eventalign called with --samples. Median of the normalised signal values provided by Nanopolish eventalign
std: Only if nanopolish eventalign called with --samples. Standard deviation of the normalised signal values provided by Nanopolish eventalign
mad: Only if nanopolish eventalign called with --samples. Median absolute deviation of the normalised signal values provided by Nanopolish eventalign
num_signals: Only if nanopolish eventalign called with --samples. Number of raw signal points.
samples: Only if nanopolish eventalign called with --samples and Eventalign_collapse called with --write_samples. List of normalised signal intensity values for this kmer

In addition Eventalign_collapse also generates an useful index file containing reads level information. It contains the following fields:

read_id: Name or index of the read
ref_id: Name of the reference sequence the read was aligned on (contig)
ref_start: Start coordinate of the alignment on the reference sequence
ref_end: End coordinate of the alignment on the reference sequence
dwell_time: Cumulative dwell time in seconds for the entire resquiggled sequence
kmers: Overall number of resquiggled kmers
NNNNN_kmers: Number of resquiggled kmers containing at least 1 event for which the model sequence was "NNNNN"
mismatching_kmers: Number of resquiggled kmers containing at least 1 event for which the model sequence diverged from the reference sequence
missing_kmers: Number of skipped/missing reference positions in nanopolish output
byte_offset: Number of characters before the start of the sequence in the main output file. This can be used in conjunction with file.seek() to directly access the start of a read. An example is provided in the Usage notebook.
byte_len: Length of characters after byte_offset to the end of the read, excluding the last newline. This can be used in conjunction with read() to read all the text chunk corresponding to the read.

Bash command line usage

Command line help

%%bash

# Load local bashrc and activate virtual environment
source ~/.bashrc
workon Nanopolish_0.11.1

NanopolishComp Eventalign_collapse --help

usage: NanopolishComp Eventalign_collapse [-h] [-i INPUT_FN] [-o OUTDIR]
                                          [-p OUTPREFIX] [-s] [-r MAX_READS]
                                          [-f STAT_FIELDS [STAT_FIELDS ...]]
                                          [-t THREADS] [-v | -q]

Collapse the nanopolish eventalign output at kmers level and compute kmer
level statistics

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         Increase verbosity (default: False)
  -q, --quiet           Reduce verbosity (default: False)

Input/Output options:
  -i INPUT_FN, --input_fn INPUT_FN
                        Path to a nanopolish eventalign tsv output file. If
                        '0' read from std input (default: 0)
  -o OUTDIR, --outdir OUTDIR
                        Path to the output folder (will be created if it does
                        exist yet) (default: ./)
  -p OUTPREFIX, --outprefix OUTPREFIX
                        text outprefix for all the files generated (default:
                        out)

Run parameters options:
  -s, --write_samples   If given, will write the raw sample if nanopolish
                        eventalign was ran with --samples option (default:
                        False)
  -r MAX_READS, --max_reads MAX_READS
                        Maximum number of read to parse. 0 to deactivate
                        (default: 0)
  -f STAT_FIELDS [STAT_FIELDS ...], --stat_fields STAT_FIELDS [STAT_FIELDS ...]
                        List of statistical fields to compute if nanopolish
                        eventalign was ran with --sample option. Valid values
                        = mean, std, median, mad, num_signals (default:
                        ['mean', 'median', 'num_signals'])

Other options:
  -t THREADS, --threads THREADS
                        Total number of threads. 1 thread is used for the
                        reader and 1 for the writer (default: 4)

Example usage

From an existing nanopolish eventalign output to a file

%%bash

# Load local bashrc and activate virtual environment
source ~/.bashrc
workon Nanopolish_0.11.1

NanopolishComp Eventalign_collapse -i ./data/eventalign_collapse//nanopolish_reads.tsv -o ./output/eventalign_collapse/ -v
head ./output/eventalign_collapse/out_eventalign_collapse.tsv
head ./output/eventalign_collapse/out_eventalign_collapse.tsv.idx

#0  YGR240C
ref_pos ref_kmer    num_events  dwell_time  NNNNN_dwell_time    mismatch_dwell_time
1656    GAAAA   1   0.00266 0.0 0.0
1657    AAAAC   1   0.00764 0.0 0.0
1658    AAACA   1   0.00398 0.0 0.0
1659    AACAA   1   0.00432 0.0 0.0
1660    ACAAA   1   0.00498 0.0 0.0
1661    CAAAG   1   0.00564 0.0 0.0
1662    AAAGA   1   0.00963 0.0 0.0
1663    AAGAT   1   0.00299 0.0 0.0
ref_id  ref_start   ref_end read_id kmers   dwell_time  NNNNN_kmers mismatch_kmers  missing_kmers   byte_offset byte_len
YGR240C 1656    2960    0   1250    13.788570000000009  35  0   54  0   38028
YCR030C 1578    2576    1   971 11.487010000000005  23  0   27  38029   29704
YHR174W 0   839 2   825 9.659210000000002   15  0   14  67734   24231
YHR174W 218 1309    3   1028    11.06325000000001   36  0   63  91966   30801
YHR174W 462 1309    4   818 10.73776000000001   18  0   29  122768  24589
YLR441C 173 764 5   554 5.556939999999999   20  0   37  147358  16437
YGR192C 1   989 6   927 11.731470000000003  37  0   61  163796  27388
YDR500C 9   252 8   231 2.9179100000000027  7   0   14  191185  6751
YGR192C 3   995 7   946 12.30464000000001   31  0   50  197937  28123


Options summary
    package_name: NanopolishComp
    package_version: 0.6.1
    timestamp: 2019-05-08 11:59:11.239171
    quiet: False
    verbose: True
    threads: 4
    stat_fields: ['mean', 'median', 'num_signals']
    write_samples: False
    max_reads: None
    outprefix: out
    outdir: ./output/eventalign_collapse/
    input_fn: ./data/eventalign_collapse//nanopolish_reads.tsv

Checking arguments
    Testing input file readability
    Creating output folder
    Checking number of threads
    Checking if stat_fields names are valid
Starting to process files
    [split_reads] Start reading input file/stream
    [process_read 1] Starting processing reads
    [process_read 2] Starting processing reads
    [write_output] Start rwriting output
14 reads [00:00, 74.94 reads/s] [split_reads] Done
    [process_read 1] Done
    [process_read 2] Done
21 reads [00:00, 88.36 reads/s]
    [write_output] Done
[Eventalign_collapse] total reads: 21 [86.45 reads/s]

From standard input to a file

%%bash

# Load local bashrc and activate virtual environment
source ~/.bashrc
workon Nanopolish_0.11.1

cat ./data/eventalign_collapse//nanopolish_reads_index.tsv | NanopolishComp Eventalign_collapse -o ./output/eventalign_collapse/ --verbose
head ./output/eventalign_collapse/out_eventalign_collapse.tsv
head ./output/eventalign_collapse/out_eventalign_collapse.tsv.idx

#0  YGR240C
ref_pos ref_kmer    num_events  dwell_time  NNNNN_dwell_time    mismatch_dwell_time
1656    GAAAA   1   0.00266 0.0 0.0
1657    AAAAC   1   0.00764 0.0 0.0
1658    AAACA   1   0.00398 0.0 0.0
1659    AACAA   1   0.00432 0.0 0.0
1660    ACAAA   1   0.00498 0.0 0.0
1661    CAAAG   1   0.00564 0.0 0.0
1662    AAAGA   1   0.00963 0.0 0.0
1663    AAGAT   1   0.00299 0.0 0.0
ref_id  ref_start   ref_end read_id kmers   dwell_time  NNNNN_kmers mismatch_kmers  missing_kmers   byte_offset byte_len
YGR240C 1656    2960    0   1250    13.788570000000009  35  0   54  0   38028
YCR030C 1578    2576    1   971 11.487010000000005  23  0   27  38029   29704
YHR174W 0   839 2   825 9.659210000000002   15  0   14  67734   24231
YHR174W 218 1309    3   1028    11.06325000000001   36  0   63  91966   30801
YHR174W 462 1309    4   818 10.73776000000001   18  0   29  122768  24589
YLR441C 173 764 5   554 5.556939999999999   20  0   37  147358  16437
YGR192C 1   989 6   927 11.731470000000003  37  0   61  163796  27388
YDR500C 9   252 8   231 2.9179100000000027  7   0   14  191185  6751
YGR192C 3   995 7   946 12.30464000000001   31  0   50  197937  28123


Options summary
    package_name: NanopolishComp
    package_version: 0.6.1
    timestamp: 2019-05-08 11:59:12.535254
    quiet: False
    verbose: True
    threads: 4
    stat_fields: ['mean', 'median', 'num_signals']
    write_samples: False
    max_reads: None
    outprefix: out
    outdir: ./output
    input_fn: 0

Checking arguments
    Testing input file readability
    Creating output folder
    Checking number of threads
    Checking if stat_fields names are valid
Starting to process files
    [split_reads] Start reading input file/stream
    [process_read 1] Starting processing reads
    [process_read 2] Starting processing reads
    [write_output] Start rwriting output
19 reads [00:00, 74.77 reads/s] [split_reads] Done
    [process_read 1] Done
    [process_read 2] Done
21 reads [00:00, 90.18 reads/s]
    [write_output] Done
[Eventalign_collapse] total reads: 21 [89.3 reads/s]

On the fly, from nanopolish eventalign to a file

%%bash

# Load local bashrc and activate virtual environment
source ~/.bashrc
workon Nanopolish_0.11.1

nanopolish eventalign -t 4 --samples --scale-events --print-read-name \
    --reads ./data/eventalign_collapse/reads.fastq \
    --bam ./data/eventalign_collapse/aligned_reads.bam \
    --genome ./data/eventalign_collapse/reference.fa | \
    NanopolishComp Eventalign_collapse -o ./output

head ./output/eventalign_collapse/out_eventalign_collapse.tsv
head ./output/eventalign_collapse/out_eventalign_collapse.tsv.idx

#0  YGR240C
ref_pos ref_kmer    num_events  dwell_time  NNNNN_dwell_time    mismatch_dwell_time
1656    GAAAA   1   0.00266 0.0 0.0
1657    AAAAC   1   0.00764 0.0 0.0
1658    AAACA   1   0.00398 0.0 0.0
1659    AACAA   1   0.00432 0.0 0.0
1660    ACAAA   1   0.00498 0.0 0.0
1661    CAAAG   1   0.00564 0.0 0.0
1662    AAAGA   1   0.00963 0.0 0.0
1663    AAGAT   1   0.00299 0.0 0.0
ref_id  ref_start   ref_end read_id kmers   dwell_time  NNNNN_kmers mismatch_kmers  missing_kmers   byte_offset byte_len
YGR240C 1656    2960    0   1250    13.788570000000009  35  0   54  0   38028
YCR030C 1578    2576    1   971 11.487010000000005  23  0   27  38029   29704
YHR174W 0   839 2   825 9.659210000000002   15  0   14  67734   24231
YHR174W 218 1309    3   1028    11.06325000000001   36  0   63  91966   30801
YHR174W 462 1309    4   818 10.73776000000001   18  0   29  122768  24589
YLR441C 173 764 5   554 5.556939999999999   20  0   37  147358  16437
YGR192C 1   989 6   927 11.731470000000003  37  0   61  163796  27388
YDR500C 9   252 8   231 2.9179100000000027  7   0   14  191185  6751
YGR192C 3   995 7   946 12.30464000000001   31  0   50  197937  28123


Checking arguments
Starting to process files
18 reads [00:01,  7.25 reads/s][post-run summary] total reads: 21, unparseable: 0, qc fail: 0, could not calibrate: 0, no alignment: 0, bad fast5: 0
21 reads [00:01,  7.84 reads/s]
[Eventalign_collapse] total reads: 21 [14.06 reads/s]

Python API usage

Import the package

# Import main program
from NanopolishComp.Eventalign_collapse import Eventalign_collapse

# Import helper functions
from NanopolishComp.common import jhelp, head

python API help

jhelp(Eventalign_collapse)

NanopolishComp.Eventalign_collapse.init

Collapse the nanopolish eventalign output by kmers rather that by events. kmer level statistics (mean, median, std, mad) are only computed if nanopolish is run with --samples option

input_fn : str (required)

Path to a nanopolish eventalign tsv output file.

outdir : str (default = ./)

Path to the output folder (will be created if it does exist yet)

outprefix : str (default = out)

text outprefix for all the files generated

max_reads : int (default = None)

Maximum number of read to parse. 0 to deactivate (default = 0)

write_samples : bool (default = False)

If given, will write the raw sample if nanopolish eventalign was ran with --samples option

stat_fields : list of str (default = ['mean', 'median', 'num_signals'])

List of statistical fields to compute if nanopolish eventalign was ran with --samples option. Valid values = "mean", "std", "median", "mad", "num_signals"

threads : int (default = 4)

Total number of threads. 1 thread is used for the reader and 1 for the writer (default = 4)

verbose : bool (default = False)

Increase verbosity

quiet : bool (default = False)

Reduce verbosity

Example usage

Example with minimal file

input_fn = "./data/eventalign_collapse//nanopolish_reads.tsv"
outdir = "./output/eventalign_collapse"
outprefix = "basic"

Eventalign_collapse (input_fn=input_fn, outdir=outdir, outprefix=outprefix, threads=6)

head("./output/eventalign_collapse/basic_eventalign_collapse.tsv")
head("./output/eventalign_collapse/basic_eventalign_collapse.tsv.idx")

Checking arguments
Starting to process files
21 reads [00:00, 99.56 reads/s] 
[Eventalign_collapse] total reads: 21 [90.56 reads/s]



#0  YGR240C
ref_pos ref_kmer    num_events  dwell_time  NNNNN_dwell_time    mismatch_dwell_time
1656    GAAAA   1   0.00266 0.0 0.0
1657    AAAAC   1   0.00764 0.0 0.0
1658    AAACA   1   0.00398 0.0 0.0
1659    AACAA   1   0.00432 0.0 0.0
1660    ACAAA   1   0.00498 0.0 0.0
1661    CAAAG   1   0.00564 0.0 0.0
1662    AAAGA   1   0.00963 0.0 0.0
1663    AAGAT   1   0.00299 0.0 0.0

ref_id  ref_start ref_end read_id kmers dwell_time         NNNNN_kmers mismatch_kmers missing_kmers byte_offset byte_len 
YGR240C 1656      2960    0       1250  13.788570000000009 35          0              54            0           38028    
YCR030C 1578      2576    1       971   11.487010000000005 23          0              27            38029       29704    
YHR174W 0         839     2       825   9.659210000000002  15          0              14            67734       24231    
YHR174W 218       1309    3       1028  11.06325000000001  36          0              63            91966       30801    
YHR174W 462       1309    4       818   10.73776000000001  18          0              29            122768      24589    
YLR441C 173       764     5       554   5.556939999999999  20          0              37            147358      16437    
YGR192C 1         989     6       927   11.731470000000003 37          0              61            163796      27388    
YDR500C 9         252     8       231   2.9179100000000027 7           0              14            191185      6751     
YGR192C 3         995     7       946   12.30464000000001  31          0              50            197937      28123

Example with indexes

input_fn = "./data/eventalign_collapse//nanopolish_reads_index.tsv"
outdir = "./output/eventalign_collapse"
outprefix = "index"

Eventalign_collapse (input_fn=input_fn, outdir=outdir, outprefix=outprefix, threads=6)

head("./output/eventalign_collapse/index_eventalign_collapse.tsv")
head("./output/eventalign_collapse/index_eventalign_collapse.tsv.idx")

Checking arguments
Starting to process files
21 reads [00:00, 94.30 reads/s]
[Eventalign_collapse] total reads: 21 [82.25 reads/s]



#0  YGR240C
ref_pos ref_kmer    num_events  dwell_time  NNNNN_dwell_time    mismatch_dwell_time start_idx   end_idx
1656    GAAAA   1   0.00266 0.0 0.0 63446   63454
1657    AAAAC   1   0.00764 0.0 0.0 63423   63446
1658    AAACA   1   0.00398 0.0 0.0 63411   63423
1659    AACAA   1   0.00432 0.0 0.0 63398   63411
1660    ACAAA   1   0.00498 0.0 0.0 63383   63398
1661    CAAAG   1   0.00564 0.0 0.0 63366   63383
1662    AAAGA   1   0.00963 0.0 0.0 63337   63366
1663    AAGAT   1   0.00299 0.0 0.0 63328   63337

ref_id  ref_start ref_end read_id kmers dwell_time         NNNNN_kmers mismatch_kmers missing_kmers byte_offset byte_len 
YGR240C 1656      2960    0       1250  13.788570000000009 35          0              54            0           53046    
YCR030C 1578      2576    1       971   11.487010000000005 23          0              27            53047       41374    
YHR174W 0         839     2       825   9.659210000000002  15          0              14            94422       33595    
YHR174W 218       1309    3       1028  11.06325000000001  36          0              63            128018      43155    
YHR174W 462       1309    4       818   10.73776000000001  18          0              29            171174      34423    
YLR441C 173       764     5       554   5.556939999999999  20          0              37            205598      23103    
YGR192C 1         989     6       927   11.731470000000003 37          0              61            228702      38529    
YDR500C 9         252     8       231   2.9179100000000027 7           0              14            267232      9541     
YGR192C 3         995     7       946   12.30464000000001  31          0              50            276774      39493

Example including samples

input_fn = "./data/eventalign_collapse//nanopolish_reads_index.tsv"
outdir = "./output/eventalign_collapse"
outprefix = "stats"

Eventalign_collapse (input_fn=input_fn, outdir=outdir, outprefix=outprefix, threads=6, quiet=True, stat_fields=["mean", "std", "median", "mad", "num_signals"])

head("./output/eventalign_collapse/stats_eventalign_collapse.tsv")
head("./output/eventalign_collapse/stats_eventalign_collapse.tsv.idx")

[Eventalign_collapse] total reads: 21 [86.98 reads/s]



#0  YGR240C
ref_pos ref_kmer    num_events  dwell_time  NNNNN_dwell_time    mismatch_dwell_time start_idx   end_idx
1656    GAAAA   1   0.00266 0.0 0.0 63446   63454
1657    AAAAC   1   0.00764 0.0 0.0 63423   63446
1658    AAACA   1   0.00398 0.0 0.0 63411   63423
1659    AACAA   1   0.00432 0.0 0.0 63398   63411
1660    ACAAA   1   0.00498 0.0 0.0 63383   63398
1661    CAAAG   1   0.00564 0.0 0.0 63366   63383
1662    AAAGA   1   0.00963 0.0 0.0 63337   63366
1663    AAGAT   1   0.00299 0.0 0.0 63328   63337

ref_id  ref_start ref_end read_id kmers dwell_time         NNNNN_kmers mismatch_kmers missing_kmers byte_offset byte_len 
YGR240C 1656      2960    0       1250  13.788570000000009 35          0              54            0           53046    
YCR030C 1578      2576    1       971   11.487010000000005 23          0              27            53047       41374    
YHR174W 0         839     2       825   9.659210000000002  15          0              14            94422       33595    
YHR174W 218       1309    3       1028  11.06325000000001  36          0              63            128018      43155    
YHR174W 462       1309    4       818   10.73776000000001  18          0              29            171174      34423    
YLR441C 173       764     5       554   5.556939999999999  20          0              37            205598      23103    
YGR192C 1         989     6       927   11.731470000000003 37          0              61            228702      38529    
YDR500C 9         252     8       231   2.9179100000000027 7           0              14            267232      9541     
YGR192C 3         995     7       946   12.30464000000001  31          0              50            276774      39493

Example including samples and writing samples values in ouput file

input_fn = "./data/eventalign_collapse//nanopolish_reads_samples.tsv"
outdir = "./output/eventalign_collapse"
outprefix = "samples"

Eventalign_collapse (input_fn=input_fn, outdir=outdir, outprefix=outprefix, threads=6, quiet=True, stat_fields=["mean", "std", "median", "mad", "num_signals"], write_samples=True)

head("./output/eventalign_collapse/samples_eventalign_collapse.tsv")
head("./output/eventalign_collapse/samples_eventalign_collapse.tsv.idx")

[Eventalign_collapse] total reads: 21 [12.19 reads/s]



#2  YHR174W
ref_pos ref_kmer    num_events  dwell_time  NNNNN_dwell_time    mismatch_dwell_time mean    std median  mad num_signals samples
0   ATGGC   1   0.01096 0.0 0.0 86.3893051147461    3.8192403316497803  85.67970275878906   2.618194580078125   33  96.4434,93.5342,94.5524,87.1342,85.6797,86.6978,85.0978,91.0615,89.316,84.0797,82.916,82.3342,89.316,84.3706,84.3706,85.8251,81.4615,83.0615,85.9706,85.0978,80.7342,82.4796,93.0979,88.0069,82.916,85.3887,84.0797,86.5524,84.3706,82.3342,88.2979,88.2979,85.9706
1   TGGCT   4   0.03253 0.0 0.0 104.78181457519531  6.572873592376709   106.18900299072266  3.9269943237304688  98  104.007,96.1524,106.189,99.0615,108.371,102.407,107.498,100.662,113.462,106.189,111.28,111.28,112.298,102.989,101.971,109.68,106.916,109.243,106.916,103.28,101.825,110.553,112.443,105.898,101.098,99.0615,105.462,102.843,100.516,107.643,109.098,107.789,113.316,109.098,110.116,108.807,106.916,112.443,114.916,110.843,109.098,111.716,103.862,110.262,109.098,105.171,109.68,108.807,112.589,111.425,123.498,110.116,108.371,110.843,111.134,110.843,96.007,94.5524,97.8979,90.6251,94.9888,97.0252,93.9706,104.589,109.971,106.48,108.807,101.389,103.716,96.8797,89.7524,97.4615,98.0434,86.8433,104.443,103.716,102.407,107.207,102.989,93.6797,103.571,104.007,100.08,103.862,109.68,106.625,101.534,97.3161,103.571,97.607,88.7342,109.534,106.771,113.025,108.807,93.2433,108.952,91.207
2   GGCTG   1   0.00465 0.0 0.0 96.55765533447266   6.874111652374268   97.75244903564453   5.382049560546875   14  105.752,98.9161,98.7706,96.7343,95.5706,100.08,105.607,96.4434,88.5888,101.68,85.6797,104.589,88.1524,85.2433
3   GCTGT   5   0.0385  0.0 0.0 81.69342803955078   5.14208984375   80.29779815673828   1.5273017883300781  116 94.1161,90.6251,93.2433,81.316,105.316,87.716,95.7161,90.3342,98.0434,86.116,87.716,107.789,96.2979,83.4978,84.2251,83.3524,79.5705,80.8796,79.1342,78.116,80.4433,80.5887,79.4251,79.2796,81.6069,77.9705,81.6069,81.7524,81.4615,80.4433,86.9888,79.5705,81.316,77.5342,82.6251,77.9705,82.1887,77.6796,80.7342,83.6433,77.9705,78.4069,80.0069,80.1524,80.4433,80.7342,80.2978,84.9524,78.5524,81.316,80.2978,82.1887,81.6069,77.6796,81.0251,78.4069,79.4251,80.4433,79.8615,82.1887,79.8615,80.0069,78.6978,80.7342,80.5887,78.5524,78.8433,79.4251,82.6251,80.0069,77.5342,80.0069,77.5342,83.7887,84.0797,83.0615,82.1887,81.6069,80.8796,81.316,80.0069,78.9887,82.7706,84.516,81.4615,80.5887,79.2796,78.9887,78.4069,80.5887,80.2978,83.7887,78.4069,78.116,79.5705,78.5524,79.4251,79.2796,81.0251,78.8433,78.9887,78.5524,80.1524,77.9705,78.2614,80.8796,78.4069,79.2796,78.2614,79.716,77.6796,79.4251,77.6796,78.6978,77.3887,76.9523
4   CTGTC   1   0.00232 0.0 0.0 102.4487075805664   1.5290673971176147  103.57099914550781  0.2910003662109375  7   103.571,103.716,103.862,101.243,100.371,103.862,100.516
5   TGTCT   4   0.019249999999999996    0.0 0.0 107.04158020019531  5.813553810119629   107.64350128173828  2.5454978942871094  58  113.462,109.534,107.789,112.298,110.116,112.153,109.243,111.862,107.352,112.88,103.716,102.262,103.571,101.971,104.007,104.152,99.6434,104.88,109.971,109.825,114.625,105.898,107.062,88.7342,105.462,112.589,108.516,114.771,106.916,111.716,110.116,106.334,108.08,107.207,110.262,110.116,105.898,109.825,108.08,106.771,107.207,108.08,107.498,114.48,116.807,107.062,109.68,105.607,110.407,110.553,106.189,105.025,109.389,104.88,86.6978,103.425,88.2979,97.4615
6   GTCTC   2   0.00797 0.0 0.0 76.83718872070312   2.1963260173797607  76.87960052490234   1.1636505126953125  24  78.5524,75.6432,75.7887,79.4251,82.3342,77.3887,79.4251,77.8251,76.9523,78.4069,76.9523,77.8251,76.6614,78.9887,76.6614,76.8069,75.7887,71.8614,74.7705,74.6251,77.3887,76.8069,74.3341,72.8796
7   TCTCT   1   0.00232 0.0 0.0 80.0484619140625    1.0306347608566284  80.00689697265625   0.727294921875  7   78.6978,79.2796,79.2796,80.0069,81.4615,81.6069,80.0069

ref_id  ref_start ref_end read_id kmers dwell_time         NNNNN_kmers mismatch_kmers missing_kmers byte_offset byte_len 
YHR174W 0         839     2       825   9.659210000000002  15          0              14            0           317027   
YCR030C 1578      2576    1       971   11.487010000000005 23          0              27            317028      376326   
YGR240C 1656      2960    0       1250  13.788570000000009 35          0              54            693355      461011   
YHR174W 218       1309    3       1028  11.06325000000001  36          0              63            1154367     372126   
YLR441C 173       764     5       554   5.556939999999999  20          0              37            1526494     190670   
YDR500C 9         252     8       231   2.9179100000000027 7           0              14            1717165     93821    
YHR174W 462       1309    4       818   10.73776000000001  18          0              29            1810987     342528   
YGR192C 3         995     7       946   12.30464000000001  31          0              50            2153516     392222   
YDR224C 0         389     9       376   4.2840300000000004 8           0              13            2545739     141262

Using the index to random access a specific entry in the file

Random access with standard library

output_fn = "./output/eventalign_collapse/stats_eventalign_collapse.tsv"
index_fn = "./output/eventalign_collapse/stats_eventalign_collapse.tsv.idx"

# Imports
import csv
from tabulate import tabulate
from random import sample
from itertools import islice

# read index file and select random lines
index_list = []
with open (index_fn) as fp:
    for l in csv.DictReader(fp, delimiter='\t'):
        index_list.append(l)
random_index = sample(index_list, k=5)
print ("RANDOM INDEX LINES")
print (tabulate(random_index, headers="keys"))

# Open the collapsed event align file
with open (output_fn) as fp:
    for index in random_index:
        # Access the header corresponding to the randomly selected index line using seek 
        fp.seek(0) # Return to file start
        fp.seek(int(index["byte_offset"])) # Move to the offset indicated in the index file
        print ("\n" + fp.readline().rstrip()) # Print read header
        # Get 5 first lines and print then
        data_list = []
        for l in islice(csv.DictReader(fp, delimiter='\t'), 5):
            data_list.append(l)
        print (tabulate(data_list, headers="keys"))

RANDOM INDEX LINES
ref_id      ref_start    ref_end    read_id    kmers    dwell_time    NNNNN_kmers    mismatch_kmers    missing_kmers    byte_offset    byte_len
--------  -----------  ---------  ---------  -------  ------------  -------------  ----------------  ---------------  -------------  ----------
YMR116C           180        950         16      740       9.30597             20                 0               30         482785       30690
YLR110C            14        395         15      364       4.61444              9                 0               17         450178       15023
YGR240C          1656       2960          0     1250      13.7886              35                 0               54              0       53046
YGR192C             3        995          7      946      12.3046              31                 0               50         276774       39493
YHR174W           462       1309          4      818      10.7378              18                 0               29         171174       34423

#16 YMR116C
  ref_pos  ref_kmer      num_events    dwell_time    NNNNN_dwell_time    mismatch_dwell_time    start_idx    end_idx
---------  ----------  ------------  ------------  ------------------  ---------------------  -----------  ---------
      180  TTCAA                  3       0.06408                   0                      0        41362      41555
      181  TCAAG                  1       0.00598                   0                      0        41344      41362
      182  CAAGG                  2       0.00963                   0                      0        41315      41344
      183  AAGGG                  1       0.00697                   0                      0        41294      41315
      184  AGGGT                  1       0.00199                   0                      0        41288      41294

#15 YLR110C
  ref_pos  ref_kmer      num_events    dwell_time    NNNNN_dwell_time    mismatch_dwell_time    start_idx    end_idx
---------  ----------  ------------  ------------  ------------------  ---------------------  -----------  ---------
       14  TGTCG                  2       0.00598                   0                      0        23697      23715
       15  GTCGC                  1       0.0073                    0                      0        23675      23697
       16  TCGCT                  1       0.00564                   0                      0        23658      23675
       17  CGCTT                  1       0.00398                   0                      0        23646      23658
       18  GCTTC                  2       0.01328                   0                      0        23606      23646

#0  YGR240C
  ref_pos  ref_kmer      num_events    dwell_time    NNNNN_dwell_time    mismatch_dwell_time    start_idx    end_idx
---------  ----------  ------------  ------------  ------------------  ---------------------  -----------  ---------
     1656  GAAAA                  1       0.00266                   0                      0        63446      63454
     1657  AAAAC                  1       0.00764                   0                      0        63423      63446
     1658  AAACA                  1       0.00398                   0                      0        63411      63423
     1659  AACAA                  1       0.00432                   0                      0        63398      63411
     1660  ACAAA                  1       0.00498                   0                      0        63383      63398

#7  YGR192C
  ref_pos  ref_kmer      num_events    dwell_time    NNNNN_dwell_time    mismatch_dwell_time    start_idx    end_idx
---------  ----------  ------------  ------------  ------------------  ---------------------  -----------  ---------
        3  GTTAG                  1       0.00564                   0                      0        55070      55087
        4  TTAGA                  1       0.0073                    0                      0        55048      55070
        5  TAGAG                  1       0.01162                   0                      0        55013      55048
        6  AGAGT                  1       0.00465                   0                      0        54999      55013
        7  GAGTT                  3       0.01128                   0                      0        54965      54999

#4  YHR174W
  ref_pos  ref_kmer      num_events    dwell_time    NNNNN_dwell_time    mismatch_dwell_time    start_idx    end_idx
---------  ----------  ------------  ------------  ------------------  ---------------------  -----------  ---------
      462  TTGAA                  2       0.00797                   0                      0        42640      42664
      463  TGAAC                  1       0.00531                   0                      0        42624      42640
      464  GAACG                  3       0.01726                   0                      0        42572      42624
      465  AACGG                  5       0.02423                   0                      0        42499      42572
      466  ACGGT                  1       0.00764                   0                      0        42476      42499

Random access with pandas

output_fn = "./output/eventalign_collapse/stats_eventalign_collapse.tsv"
index_fn = "./output/eventalign_collapse/stats_eventalign_collapse.tsv.idx"

# Imports
import pandas as pd
pd.set_option('display.max_columns', 6)

# read index file and select random lines
index_df = pd.read_csv (index_fn, sep="\t")
random_lines = index_df.sample(5)
print ("Random index lines")
display (random_lines)

# Open the collapsed event align file
with open (output_fn) as fp:
    for id, read in random_lines.iterrows():

        # Access the header corresponding to the randomly selected index line using seek 
        fp.seek(0) # Return to file start
        fp.seek(read.byte_offset) # Move to the offset indicated in the index file
        print (fp.readline().rstrip()) # Print read header
        df = pd.read_csv (fp, nrows=5, sep="\t") # Read lines corresponding to the read
        display(df)

Random index lines

	ref_id	ref_start	ref_end	...	missing_kmers	byte_offset	byte_len
14	YDR382W	5	329	...	16	437025	13152
6	YGR192C	1	989	...	61	228702	38529
1	YCR030C	1578	2576	...	27	53047	41374
5	YLR441C	173	764	...	37	205598	23103
12	YLR293C	1	654	...	43	382914	26166

5 rows × 11 columns

#14 YDR382W

	ref_pos	ref_kmer	num_events	...	start_idx	end_idx
0	5	ATACT	2	...	48813	48858
1	6	TACTT	2	...	48794	48813
2	7	ACTTA	4	...	48741	48794
3	8	CTTAG	1	...	48709	48741
4	9	TTAGC	1	...	48681	48709

5 rows × 8 columns

#6  YGR192C

	ref_pos	ref_kmer	num_events	...	start_idx	end_idx
0	1	TGGTT	1	...	45267	45311
1	2	GGTTA	1	...	45248	45267
2	3	GTTAG	1	...	45223	45248
3	4	TTAGA	4	...	45159	45223
4	5	TAGAG	2	...	45137	45159

5 rows × 8 columns

#1  YCR030C

	ref_pos	ref_kmer	num_events	...	start_idx	end_idx
0	1578	CCACC	1	...	47517	47528
1	1579	CACCC	15	...	47223	47517
2	1580	ACCCT	1	...	47216	47223
3	1581	CCCTC	1	...	47209	47216
4	1582	CCTCA	5	...	47162	47209

5 rows × 8 columns

#5  YLR441C

	ref_pos	ref_kmer	num_events	...	start_idx	end_idx
0	173	AGATG	4	...	27161	27241
1	174	GATGC	1	...	27150	27161
2	175	ATGCT	1	...	27105	27150
3	178	CTTTG	1	...	27078	27105
4	179	TTTGA	1	...	27061	27078

5 rows × 8 columns

#12 YLR293C

	ref_pos	ref_kmer	num_events	...	start_idx	end_idx
0	1	TGTCT	1	...	81520	81535
1	2	GTCTG	2	...	81436	81520
2	3	TCTGC	2	...	81371	81436
3	4	CTGCC	1	...	81361	81371
4	5	TGCCC	1	...	81351	81361

5 rows × 8 columns