Maël Kerbiriou @ BONSAI
Project coordinated by Helene Touzet, Antoine Limasset and Rayan Chikhi
Problem: Mapping with
Approximate String Matching
(Self)-Mapping nanopore reads
Large expression ranges (RNA)
⇒ Need better sensibility, while avoiding flood from false positives
Goals:
Practical implementation of ASM for large datasets ⇒ targeting
Quentin's application while being reusable for other projects
Specialized (simpler than OSS): a single error is tolerated
Preserve sensibility: short seeds $\simeq 15$-mer
No full text index: never go back to the text
Methods for Approximate String Matching
[1]'s main lemma: All strings within $n$-editions neighborhood of a query pattern can be divided in
$n+2$ parts, ensuring
that:
2 parts are exact matches of their corresponding parts in the pattern
These exact matches frame 0 to $n$ parts, each with exactly one error
[1] Vroland, Christophe, et al. "Approximate search
of short patterns with high
error rates using the 01⁎0 lossless seeds."
Journal of Discrete Algorithms 37 (2016): 3-16.
[1]'s main lemma: All strings within $n$-editions neighborhood of a query pattern can be divided in
$\mathbf{1}+2$ parts, ensuring
that:
2 parts are exact matches of their corresponding parts in the pattern
These exact matches frame 0 to $\mathbf{1}$ parts, each with exactly one error
[1] Vroland, Christophe, et al. "Approximate search
of short patterns with high
error rates using the 01⁎0 lossless seeds."
Journal of Discrete Algorithms 37 (2016): 3-16.
Filter $\left[ (b_2, pos) \right]$ results such that $\operatorname{lev}(b_2, b'_2)=1$
$b'_2$ from query is varying in length: $k+\{-1,0,1\}$ Allows matches
with I/M/D errors
⇒ three $b'_3$ are extracted from the query at differents positions