Seedlib

Data structures

Maël Kerbiriou @ BONSAI Project coordinated by Helene Touzet, Antoine Limasset and Rayan Chikhi

Problem: Mapping with

Approximate String Matching

(Self)-Mapping nanopore reads
Large expression ranges (RNA)

⇒ Need better sensibility, while avoiding flood from false positives

Goals:

Practical implementation of ASM for large datasets
⇒ targeting Quentin's application while being reusable for other projects
Specialized (simpler than OSS): a single error is tolerated
Preserve sensibility: short seeds $\simeq 15$-mer
No full text index: never go back to the text

Methods for Approximate String Matching

[1]'s main lemma: All strings within $n$-editions neighborhood of a query pattern can be divided in $n+2$ parts, ensuring that:

2 parts are exact matches of their corresponding parts in the pattern
These exact matches frame 0 to $n$ parts, each with exactly one error

[1] Vroland, Christophe, et al. "Approximate search of short patterns with high error rates using the 01⁎0 lossless seeds." Journal of Discrete Algorithms 37 (2016): 3-16.

$\operatorname{lev}(\cdot,\cdot)=1$ approximate matches

[1]'s main lemma: All strings within $n$-editions neighborhood of a query pattern can be divided in $\mathbf{1}+2$ parts, ensuring that:

2 parts are exact matches of their corresponding parts in the pattern
These exact matches frame 0 to $\mathbf{1}$ parts, each with exactly one error

[1] Vroland, Christophe, et al. "Approximate search of short patterns with high error rates using the 01⁎0 lossless seeds." Journal of Discrete Algorithms 37 (2016): 3-16.

$\operatorname{lev}(\cdot,\cdot)=1$ approximate matches

Find $3k$-mer seeds on target matching the query pattern with at most 1 InDel or substitution

3 parts patterns: 2 exact matches framing either:

1 exact match: 000
0 approximate match: 100, 001
1 approximate match: 010

$\operatorname{lev}(\cdot,\cdot)=1$ approximate matches

000, 001, 100 $\implies$ 00 matches

000, 001, 100 cases are covered by $2k$-mer exact matches from 00 queries

010 matches

Find $3k$-mer seeds on target matching the $\left(3k+\{-1,0,1\}\right)$-mer query pattern with at most 1 InDel or substitution

010 matches

Find $3k$-mer seeds on target matching the $\left(3k+\{-1,0,1\}\right)$-mer query pattern with at most 1 InDel or substitution

General idea

Split all $3k$-seeds from the target text into $(b_1,b_2,b_3)$ equal parts.
Make a pair of indexes for looking up a $(b'_1,b'_2,b'_3)$ query pattern:

000: $(b'_1, b'_2) \rightarrow \left[ pos \right]$
010: $(b'_1, b'_3) \rightarrow \left[ (b_2, pos) \right]$

010 queries

$(b'_1, b'_3) \rightarrow \left[ (b_2, pos) \right]$

Filter $\left[ (b_2, pos) \right]$ results such that $\operatorname{lev}(b_2, b'_2)=1$
$b'_2$ from query is varying in length: $k+\{-1,0,1\}$
Allows matches with I/M/D errors
⇒ three $b'_3$ are extracted from the query at differents positions

Indexing

Partition by $b_1$

$\Omega^k$ pairs of indexes answering:

000: $(b'_1, b'_2) \rightarrow \left[ pos \right]$
010: $(b'_1, b'_3) \rightarrow \left[ (b_2, pos) \right]$

Indexing

Partition by $b_1$

$\Omega^k$ pairs of indexes answering:

000: $b'_2 \rightarrow \left[ pos \right]$
010: $b'_3 \rightarrow \left[ (b_2, pos) \right]$

Indexing

Within each $b_1$-labeled partition,
link the two indexes:

000: $b'_2 \rightarrow \left[ pos \right]$
000: $i \rightarrow (b_2, pos)$
010: $b_3 \rightarrow [ i ]$

Where $i$s are numeric indices in $(b_2, pos)$ table

Indexing

Within each $b_1$-labeled partition,
link the two indexes:

000: $b'_2 \rightarrow \left[ pos \right]$
000: $i \rightarrow (b_2, pos)$
010: $b_3 \rightarrow [ i ]$

Data structure

multimap from dense keys: $k \rightarrow [v]$

Forward lookup
lookup values by indices: $i \rightarrow v$
lookup keys by indices: $i \rightarrow k$

yield: $i \rightarrow (k, v)$

Data structure $k \rightarrow [v]$

Data structure

Implementation:

Sort $(k,v)$ pairs by keys, then store
- values ordered by keys $f\colon i \rightarrow v$
- indices of first ocurrence of each key $g\colon k \rightarrow i_{k,0}$
Forward lookup: $[f]^{g(k+1)}_{g(k)}$
lookup keys by indices: bissect $g$

yield: $i \xrightarrow[]{g^{-1}, f} (k, v)$

Data structure

Alternative implementation: rank/select

Sort the $N$ $(k,v)$ pairs by keys, then store
- values ordered by keys $f\colon i \rightarrow v$
- A (succint) $N$-bits array with 1s delimiting the runs of constant $k$
Forward lookup: $[f]^{\operatorname{select}(k+1)}_{\operatorname{select}(k)}$
lookup keys by indices: $\operatorname{rank}(i)$

Slower: rank/select doing bissections anyway. Succint representation is not that useful since $f$ takes the majority of space in both cases

Recap

Example of 010 query for CGA AGT ATA :

Recap

Partition seeds by their $b_1$ prefixes
Uses two multimaps by partition
- One for $b'_2 \rightarrow [pos]$ (00 lookups),
  along with $i \rightarrow (b_2, pos)$ lookups
- One for $b'_3 \rightarrow [i]$
- Combining them, we get the 010 lookups: $b'_3 \rightarrow \left[(b_2, pos)\right]$
- The first multimap allocates an array for accelerating the bissection

Validation

Naive seed and extend implementation in Python.

The test runner generate a random reference sequence and simulated reads. Then, mappings from both implementations are compared.

$\rightarrow$ Same mappings, with the difference that C++ implentation misses $k$ $2k$-mers at the end of reads.

Performance

Dataset BYK_CNA_ONT_1_FAH60542_A: nanopore cDNA, avg length: 697bp

	reads	kmers (M)	indexing time (s)	index size	00 results (M)	010 results (M)	query time (s)	time per match (µs)
Bench
1000	1000	0.64	0.48	10.37	0.28	2.54	0.45	160.43
10000	10000	6.82	1.60	38.00	218.18	27.91	19.22	78.10
10000, batch queries	10000	6.82	1.61	38.00	216.26	27.83	11.88	48.69
10000, batch queries, E>2.5bits	10000	6.44	1.47	35.86	157.78	15.30	9.68	55.92
100000	100000	76.55	15.68	412.00	25302.13	3144.67	1055.83	37.12

Conclusion

Software library for specialized 01⁎0 seed indexing
Re-usable key-dense multimaps that can be linked together (based on sdsl::intvector<>)
Failled to use succint indexes. (Succint permutations encoding ?)