![]() ![]() ![]() LSH has also been used in bioinformatics, including in homology search and metagenomics classification. More generally, MinHash can be seen as a kind of Locality-Sensitive Hashing (LSH), which involves hash functions designed to map similar inputs the same value. Minimizers, which can be thought of as a special case of MinHash, are widely used in metagenomics classification and alignment and assembly. MinHash is related to other core methods in bioinformatics. These make it possible to cluster sequences and otherwise solve massive genomic nearest-neighbor problems. From these cardinalities one can obtain a Jaccard coefficient ( J) or a “Mash distance,” which is a proxy for Average Nucleotide Identity (ANI). The summary is much smaller than the original data but can be used to estimate relevant set cardinalities such as the size of the union or the intersection between the k-mer contents of two genomes. A collection is reduced to a set of representative k-mers and ultimately stored as a list of integers. Whereas MinHash was originally developed to find similar web pages, here it is being used to summarize large genomic sequence collections such as reference genomes or sequencing datasets. They are used to cluster genomes from large databases, search for datasets with certain sequence content, accelerate the overlapping step in genome assemblers, map sequencing reads, and find similarity thresholds characterizing species-level distinctions. Since the release of the seminal Mash tool, data sketches such as MinHash have become instrumental in comparative genomics. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |