Seedlib

Data structures

Maël Kerbiriou @ BONSAI Project coordinated by Helene Touzet, Antoine Limasset and Rayan Chikhi

Problem: Mapping with

Approximate String Matching

  • (Self)-Mapping nanopore reads
  • Large expression ranges (RNA)

⇒ Need better sensibility, while avoiding flood from false positives

Goals:
  • Practical implementation of ASM for large datasets
    ⇒ targeting Quentin's application while being reusable for other projects
  • Specialized (simpler than OSS): a single error is tolerated
  • Preserve sensibility: short seeds $\simeq 15$-mer
  • No full text index: never go back to the text
Methods for Approximate String Matching

[1]'s main lemma: All strings within $n$-editions neighborhood of a query pattern can be divided in $n+2$ parts, ensuring that:

  • 2 parts are exact matches of their corresponding parts in the pattern
  • These exact matches frame 0 to $n$ parts, each with exactly one error

[1] Vroland, Christophe, et al. "Approximate search of short patterns with high error rates using the 01⁎0 lossless seeds." Journal of Discrete Algorithms 37 (2016): 3-16.

$\operatorname{lev}(\cdot,\cdot)=1$ approximate matches

[1]'s main lemma: All strings within $n$-editions neighborhood of a query pattern can be divided in $\mathbf{1}+2$ parts, ensuring that:

  • 2 parts are exact matches of their corresponding parts in the pattern
  • These exact matches frame 0 to $\mathbf{1}$ parts, each with exactly one error

[1] Vroland, Christophe, et al. "Approximate search of short patterns with high error rates using the 01⁎0 lossless seeds." Journal of Discrete Algorithms 37 (2016): 3-16.

$\operatorname{lev}(\cdot,\cdot)=1$ approximate matches

Find $3k$-mer seeds on target matching the query pattern with at most 1 InDel or substitution

3 parts patterns: 2 exact matches framing either:

  • 1 exact match: 000
  • 0 approximate match: 100, 001
  • 1 approximate match: 010
$\operatorname{lev}(\cdot,\cdot)=1$ approximate matches
000, 001, 100 $\implies$ 00 matches

000, 001, 100 cases are covered by $2k$-mer exact matches from 00 queries

010 matches
Find $3k$-mer seeds on target matching the $\left(3k+\{-1,0,1\}\right)$-mer query pattern with at most 1 InDel or substitution
010 matches
Find $3k$-mer seeds on target matching the $\left(3k+\{-1,0,1\}\right)$-mer query pattern with at most 1 InDel or substitution
General idea

Split all $3k$-seeds from the target text into $(b_1,b_2,b_3)$ equal parts.
Make a pair of indexes for looking up a $(b'_1,b'_2,b'_3)$ query pattern:

  • 000: $(b'_1, b'_2) \rightarrow \left[ pos \right]$
  • 010: $(b'_1, b'_3) \rightarrow \left[ (b_2, pos) \right]$
010 queries

$(b'_1, b'_3) \rightarrow \left[ (b_2, pos) \right]$

  • Filter $\left[ (b_2, pos) \right]$ results such that $\operatorname{lev}(b_2, b'_2)=1$
  • $b'_2$ from query is varying in length: $k+\{-1,0,1\}$
    Allows matches with I/M/D errors
    ⇒ three $b'_3$ are extracted from the query at differents positions
Indexing

Partition by $b_1$

$\Omega^k$ pairs of indexes answering:

  • 000: $(b'_1, b'_2) \rightarrow \left[ pos \right]$
  • 010: $(b'_1, b'_3) \rightarrow \left[ (b_2, pos) \right]$
Indexing

Partition by $b_1$

$\Omega^k$ pairs of indexes answering:

  • 000: $b'_2 \rightarrow \left[ pos \right]$
  • 010: $b'_3 \rightarrow \left[ (b_2, pos) \right]$
Indexing

Within each $b_1$-labeled partition,
link the two indexes:

  • 000: $b'_2 \rightarrow \left[ pos \right]$
    000: $i \rightarrow (b_2, pos)$
  • 010: $b_3 \rightarrow [ i ]$

Where $i$s are numeric indices in $(b_2, pos)$ table

Indexing

Within each $b_1$-labeled partition,
link the two indexes:

  • 000: $b'_2 \rightarrow \left[ pos \right]$
    000: $i \rightarrow (b_2, pos)$
  • 010: $b_3 \rightarrow [ i ]$
Data structure

multimap from dense keys: $k \rightarrow [v]$

  • Forward lookup
  • lookup values by indices: $i \rightarrow v$
  • lookup keys by indices: $i \rightarrow k$

yield: $i \rightarrow (k, v)$

Data structure $k \rightarrow [v]$
Data structure $k \rightarrow [v]$
Data structure

Implementation:

  • Sort $(k,v)$ pairs by keys, then store
    • values ordered by keys $f\colon i \rightarrow v$
    • indices of first ocurrence of each key $g\colon k \rightarrow i_{k,0}$
  • Forward lookup: $[f]^{g(k+1)}_{g(k)}$
  • lookup keys by indices: bissect $g$
  • yield: $i \xrightarrow[]{g^{-1}, f} (k, v)$

Data structure

Alternative implementation: rank/select

  • Sort the $N$ $(k,v)$ pairs by keys, then store
    • values ordered by keys $f\colon i \rightarrow v$
    • A (succint) $N$-bits array with 1s delimiting the runs of constant $k$
  • Forward lookup: $[f]^{\operatorname{select}(k+1)}_{\operatorname{select}(k)}$
  • lookup keys by indices: $\operatorname{rank}(i)$
  • Slower: rank/select doing bissections anyway. Succint representation is not that useful since $f$ takes the majority of space in both cases

Recap
Example of 010 query for CGA AGT ATA :
Recap
  • Partition seeds by their $b_1$ prefixes
  • Uses two multimaps by partition
    • One for $b'_2 \rightarrow [pos]$ (00 lookups),
      along with $i \rightarrow (b_2, pos)$ lookups
    • One for $b'_3 \rightarrow [i]$
    • Combining them, we get the 010 lookups: $b'_3 \rightarrow \left[(b_2, pos)\right]$
    • The first multimap allocates an array for accelerating the bissection
Validation

Naive seed and extend implementation in Python.

The test runner generate a random reference sequence and simulated reads. Then, mappings from both implementations are compared.

$\rightarrow$ Same mappings, with the difference that C++ implentation misses $k$ $2k$-mers at the end of reads.

Performance
Dataset BYK_CNA_ONT_1_FAH60542_A: nanopore cDNA, avg length: 697bp
 
reads kmers
(M)
indexing
time (s)
index size 00 results
(M)
010 results
(M)
query time
(s)
time per
match (µs)
Bench
1000 1000 0.64 0.48 10.37 0.28 2.54 0.45 160.43
10000 10000 6.82 1.60 38.00 218.18 27.91 19.22 78.10
10000, batch queries 10000 6.82 1.61 38.00 216.26 27.83 11.88 48.69
10000, batch queries, E>2.5bits 10000 6.44 1.47 35.86 157.78 15.30 9.68 55.92
100000 100000 76.55 15.68 412.00 25302.13 3144.67 1055.83 37.12

Conclusion

  • Software library for specialized 01⁎0 seed indexing
  • Re-usable key-dense multimaps that can be linked together (based on sdsl::intvector<>)
  • Failled to use succint indexes. (Succint permutations encoding ?)