Indels, though differing in allele sequence and position, are biologically equivalent when they lead to the same altered sequences. Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and may mislead downstream analysis and interpretations. About 10% of the human indels stored in dbSNP are redundant. It is thus desirable to have a unified system for identifying and representing equivalent indels in publically available databases. Moreover, a unified system is also desirable to compare the indel calling results produced by different tools. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system, which also can be used to compare indel calling results produced by different tools.
UPS-indel identifies nearly 15% indels in dbSNP (version 142) as redundant across all human chromosomes, higher than previously reported. When applied to COSMIC coding and noncoding indel datasets, UPS-indel identifies nearly 29% and 13% indels as redundant, respectively. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP; 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. Moreover, comparing UPS-indel to other state-of-the-art approaches for indel call set comparison demonstrates that UPS-indel is clearly superior to other approaches in finding indels in common among call sets.
UPS-indel is theoretically proven to find all equivalent indels, and is thus exhaustive. UPS-indel is written in C++ and the source code is freely available here.
2160K Torgersen Hall,
620 Drillfield Dr,
Blacksburg, VA 24060.
lqzhang (AT) cs (DOT) vt (DOT) edu