Merge cell data sets, panorama style


Merge cell data sets, panorama style

A new algorithm developed by MIT researchers takes panoramic photo photos to merge huge and diverse data sets into a single source that can be used for medical and biological studies. Credit: Massachusetts Institute of Technology

A new algorithm developed by MIT researchers takes panoramic photo photos to merge huge and diverse data sets into a single source that can be used for medical and biological studies.

Single cell datasets outline the gene expressions of human cells – such as neurons, muscles, and immune cells – to obtain information about human health and treat diseases. Datasets are produced by a variety of laboratories and technologies and contain extremely diverse cell types. Combining these datasets into a single data pool can open up new search possibilities, but this is difficult to do effectively and efficiently.

Traditional methods tend to group cells based on non-biological patterns – such as by laboratory or used technologies – or accidentally merge different cells that look the same. Methods that correct these errors do not fit well into large datasets and require that all merged datasets share at least one common cell type.

In an article published today in Natural Biotechnology, the MIT researchers describe an algorithm that can efficiently merge more than 20 sets of data from very different cell types into a larger "pan". The algorithm, called "Scanorama," automatically finds and merges cell types shared between two sets of data, such as combining overlapping pixels into images to generate a panoramic photo.

As long as any other dataset shares a cell type with a single dataset in the final panorama, it can also be merged. But all data sets do not have to have a common cell type. The algorithm preserves all cell types specific to each dataset.

"Traditional methods force cells to line up regardless of cell types. They create a structureless bubble and you lose all interesting biological differences," says Brian Hie, Ph.D. Computation and Artificial Intelligence (CSAIL) and researcher of the Computing and Biology group. "You can provide Scanorama datasets that should not align together, and the algorithm will separate the datasets according to biological differences."

In their paper, the researchers successfully merged more than 100,000 cells from 26 different datasets containing a wide range of human cells, creating a single, diverse data source. With traditional methods, this would take about a day of computing, but Scanorama completed the task in about 30 minutes. Researchers say the work represents the largest number of data sets ever merged.

Joining Hie on paper are: Bonnie Berger, mathematics professor Simons at MIT, professor of electrical engineering and computer science and head of the Computing and Biology group; and Bryan Bryson, assistant professor of biological engineering at MIT.

Linking "mutual neighbors"

Humans have hundreds of categories and subcategories of cells, and each cell expresses a diverse set of genes. Techniques such as RNA sequencing capture this information in wide multidimensional space. Cells are points scattered throughout space, and each dimension corresponds to the expression of a different gene.

Scanorama executes a modified algorithm of computer vision, called "mutual matching of nearest neighbors", which finds the nearest (most similar) points in two computational spaces. Developed in CSAIL, the algorithm was initially used to find pixels with matching features – such as color levels – in different photos. This could help computers combine a patch of pixels representing one object in one image to the same patch of pixels in another image where the position of the object has been drastically changed. It can also be used to merge very different images into one panorama.

The researchers adapted the algorithm to find cells with overlapping gene expression – instead of overlapping pixel features – and into multiple sets of data instead of two. The level of gene expression in a cell determines its function and, in turn, its location in the computational space. If stacked on top of each other, cells with similar gene expression, even if they are from different datasets, will be approximately in the same locations.

For each set of data, Scanorama first binds each cell in a dataset to its nearest neighbor among all data sets, which means they are likely to share similar locations. But the algorithm only holds links where the cells in both sets of data are neighbors closest to each other – a mutual link. For example, if the nearest neighbor of cell A is cell B, and cell B is cell A, it is a holder. If, however, the nearest neighbor of cell B is a separate C cell, the link between cell A and B will be discarded.

Keeping the mutual links increases the likelihood that the cells are, in fact, the same cell types. Breaking non-mutual links, on the other hand, prevents the specific cell types from each dataset from being merged with incorrect cell types. Once all the mutual links are found, the algorithm joins all the data set sequences. Doing so combines the same cell types, but retains the cell types unique to any data sets separate from the merged cells. "Mutual links form anchors that allow [correct] alignment of cells between sets of data, "says Berger.

Data reduction, scale up

To ensure that Scanorama is sized for large data sets, researchers have incorporated two optimization techniques. The first one reduces the dimensionality of the data set. Each cell in a dataset could have up to 20,000 measurements of gene expression and so many dimensions. The researchers used a mathematical technique that summarizes matrices of high-dimensional data with a small number of features while maintaining vital information. Basically, this led to a 100-fold reduction in dimensions.

They also used a popular hash technique to find the closest mutual neighbors faster. Traditionally, calculating even reduced samples would take hours. But the hashing technique basically creates buckets of neighbors closer by their higher probabilities. The algorithm only needs to search the highest probability ranges to find mutual links, which reduces the search space and makes the process much less computationally intensive.

In a separate work, the researchers combined Scanorama with another technique developed by them that generates comprehensive samples – or "sketches" – of massive cell data sets that reduced the combining time of more than 500,000 cells from two hours to eight minutes . To do this, they generated the "geometric sketches", performed Scanorama on them, and extrapolated what they learned about the fusion of geometric sketches with the larger datasets. This technique itself derives from the compressive genomics developed by the Berger group.

"Even if you need to sketch, integrate, and reapply this information to the complete datasets, it's still an order of magnitude faster than the combination of entire datasets," says Hie.

New approach could accelerate efforts to catalog large numbers of cells

More information:
Efficient integration of heterogeneous unicellular transcriptomes using Scanorama, Natural Biotechnology (2019) DOI: 10.1038 / s41587-019-0113-3,

Provided by
Massachusetts Institute of Technology

Merge cell data sets, panorama style (2019, May 7)
recovered May 7, 2019

This document is subject to copyright. In addition to any fair dealing for private study or research,
may be reproduced without written permission. Content is provided for informational purposes only.


Source link