Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

Future Blog Post

less than 1 minute read

Published:

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.

Blog Post number 4

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 3

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 2

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 1

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

portfolio

publications

On random weights for texture generation in one layer CNNS

Published in ICASSP, 2017

Recent work in the literature has shown experimentally that one can use the lower layers of a trained convolutional neural network (CNN) to model natural textures. More interestingly, it has also been experimentally shown that only one layer with random filters can also model textures although with less variability. In this paper we ask the question as to why one layer CNNs with random filters are so effective in generating textures? We theoretically show that one layer convolutional architectures (without a non-linearity) paired with the an energy function used in previous literature, can in fact preserve and modulate frequency coefficients in a manner so that random weights and pretrained weights will generate the same type of images. Based on the results of this analysis we question whether similar properties hold in the case where one uses one convolution layer with a non-linearity. We show that in the case of ReLu non-linearity there are situations where only one input will give the minimum possible energy whereas in the case of no nonlinearity, there are always infinite solutions that will give the minimum possible energy. Thus we can show that in certain situations adding a ReLu non-linearity generates less variable images.

Recommended citation: Mongia, M., Kumar, K., Erraqabi, A. and Bengio, Y., 2017, March. On random weights for texture generation in one layer CNNS. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2207-2211). IEEE. https://ieeexplore.ieee.org/document/7952548

Efficient Database Search via Tensor Distribution Bucketing

Published in Advances in Knowledge Discovery and Data Mining, 2020

In mass spectrometry-based proteomics, one needs to search billions of mass spectra against the human proteome with billions of amino acids, where many of the amino acids go through post-translational modifications. In order to account for novel modifications, we need to search all the spectra against all the peptides using a joint probabilistic model that can be learned from training data. Assuming M spectra and N possible peptides, currently the state of the art search methods have runtime of O(MN). Here, we propose a novel bucketing method that sends pairs with high likelihood under the joint probabilistic model to the same bucket with higher probability than those pairs with low likelihood. We demonstrate that the runtime of this method grows sub-linearly with the data size, and our results show that our method is orders of magnitude faster than methods from the locality sensitive hashing literature.

Recommended citation: Mongia, M., Soudry, B., Davoodi, A.G. and Mohimani, H., 2020. Efficient Database Search via Tensor Distribution Bucketing. In Advances in Knowledge Discovery and Data Mining: 24th Pacific-Asia Conference, PAKDD 2020, Singapore, May 11–14, 2020, Proceedings, Part II 24 (pp. 341-353). Springer International Publishing. https://link.springer.com/chapter/10.1007/978-3-030-47436-2_26

ForestDSH: a universal hash design for discrete probability distributions

Published in Data Mining and Knowledge Discovery, 2021

In this paper, we consider the problem of classification of high dimensional queries to high dimensional classes from discrete alphabets where the probabilistic model that relates data to the classes is known. This problem has applications in various fields including the database search problem in mass spectrometry. The problem is analogous to the nearest neighbor search problem, where the goal is to find the data point in a database that is the most similar to a query point. The state of the art method for solving an approximate version of the nearest neighbor search problem in high dimensions is locality sensitive hashing (LSH). LSH is based on designing hash functions that map near points to the same buckets with a probability higher than random (far) points. To solve our high dimensional classification problem, we introduce distribution sensitive hashes that map jointly generated pairs to the same bucket with probability higher than random pairs. We design distribution sensitive hashes using a forest of decision trees and we analytically derive the complexity of search. We further show that the proposed hashes perform faster than state of the art approximate nearest neighbor search methods for a range of probability distributions, in both theory and simulations. Finally, we apply our method to the spectral library search problem in mass spectrometry, and show that it is an order of magnitude faster than the state of the art methods.

Recommended citation: Davoodi, A.G., Chang, S., Yoo, H.G., Baweja, A., Mongia, M. and Mohimani, H., 2021. ForestDSH: a universal hash design for discrete probability distributions. Data Mining and Knowledge Discovery, 35, pp.748-795. https://link.springer.com/article/10.1007/s10618-020-00732-6

Repository scale classification and decomposition of tandem mass spectral data

Published in Nature Scientific Reports, 2021

Various studies have shown associations between molecular features and phenotypes of biological samples. These studies, however, focus on a single phenotype per study and are not applicable to repository scale metabolomics data. Here we report MetSummarizer, a method for predicting (i) the biological phenotypes of environmental and host-oriented samples, and (ii) the raw ingredient composition of complex mixtures. We show that the aggregation of various metabolomic datasets can improve the accuracy of predictions. Since these datasets have been collected using different standards at various laboratories, in order to get unbiased results it is crucial to detect and discard standard-specific features during the classification step. We further report high accuracy in prediction of the raw ingredient composition of complex foods from the Global Foodomics Project.

Recommended citation: Mongia, M. and Mohimani, H., 2021. Repository scale classification and decomposition of tandem mass spectral data. Scientific Reports, 11(1), pp.1-8. https://www.nature.com/articles/s41598-021-87796-6

An interpretable machine learning approach to identify mechanism of action of antibiotics

Published in Nature Scientific Reports, 2022

As antibiotic resistance is becoming a major public health problem worldwide, one of the approaches for novel antibiotic discovery is re-purposing drugs available on the market for treating antibiotic resistant bacteria. The main economic advantage of this approach is that since these drugs have already passed all the safety tests, it vastly reduces the overall cost of clinical trials. Recently, several machine learning approaches have been developed for predicting promising antibiotics by training on bioactivity data collected on a set of small molecules. However, these methods report hundreds/thousands of bioactive molecules, and it remains unclear which of these molecules possess a novel mechanism of action. While the cost of high-throughput bioactivity testing has dropped dramatically in recent years, determining the mechanism of action of small molecules remains a costly and time-consuming step, and therefore computational methods for prioritizing molecules with novel mechanisms of action are needed. The existing approaches for predicting bioactivity of small molecules are based on uninterpretable machine learning, and therefore are not capable of determining known mechanism of action of small molecules and prioritizing novel mechanisms. We introduce InterPred, an interpretable technique for predicting bioactivity of small molecules and their mechanism of action. InterPred has the same accuracy as the state of the art in bioactivity prediction, and it enables assigning chemical moieties that are responsible for bioactivity. After analyzing bioactivity data of several thousand molecules against bacterial and fungal pathogens available from Community for Open Antimicrobial Drug Discovery and a US Food and Drug Association-approved drug library, InterPred identified five known links between moieties and mechanism of action.

Recommended citation: Mongia, Mihir, Mustafa Guler, and Hosein Mohimani. "An interpretable machine learning approach to identify mechanism of action of antibiotics." Scientific Reports 12, no. 1 (2022): 10342. https://www.nature.com/articles/s41598-022-14229-3

Large scale sequence alignment via efficient inference in generative models

Published in Nature Scientific Reports, 2023

Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretical guarantees and often have low sensitivity especially when reads have many insertions, deletions, and mismatches relative to the genome. Here we develop a theoretically principled and efficient algorithm that has high sensitivity across a wide range of insertion, deletion, and mutation rates. We frame sequence alignment as an inference problem in a probabilistic model. Given a reference database of reads and a query read, we find the match that maximizes a log-likelihood ratio of a reference read and query read being generated jointly from a probabilistic model versus independent models. The brute force solution to this problem computes joint and independent probabilities between each query and reference pair, and its complexity grows linearly with database size. We introduce a bucketing strategy where reads with higher log-likelihood ratio are mapped to the same bucket with high probability. Experimental results show that our method is more accurate than the state-of-the-art approaches in aligning long-reads from Pacific Bioscience sequencers to genome sequences.

Recommended citation: Mongia, Mihir, Chengze Shen, Arash Gholami Davoodi, Guillaume Marçais, and Hosein Mohimani. "Large scale sequence alignment via efficient inference in generative models." Scientific Reports 13, no. 1 (2023): 7285. https://www.nature.com/articles/s41598-023-34257-x

AdenPredictor: accurate prediction of the adenylation domain specificity of nonribosomal peptide biosynthetic gene clusters in microbial genomes

Published in Bioinformatics (also presented at ISMB), 2023

Microbial natural products represent a major source of bioactive compounds for drug discovery. Among these molecules, Non-Ribosomal Peptides (NRPs) represent a diverse class that include antibiotics, immunosuppressants, anticancer agents, toxins, siderophores, pigments, and cytostatics. The discovery of novel NRPs remains a laborious process because many NRPs consist of non-standard amino acids that are assembled by Non-Ribosomal Peptide Synthetases (NRPSs). Adenylation domains (A-domains) in NRPSs are responsible for selection and activation of monomers appearing in NRPs. During the past decade, several support vector machine-based algorithms have been developed for predicting the specificity of the monomers present in NRPs. These algorithms utilize physiochemical features of the amino acids present in the A-domains of NRPSs. In this paper, we benchmarked the performance of various machine learning algorithms and features for predicting specificities of NRPSs and we showed that the extra trees model paired with \textcolor{black}{one-hot} encoding features outperforms the existing approaches. Moreover, we show that unsupervised clustering of 453,560 A-domains reveals many clusters that correspond to potentially novel amino acids. While it is challenging to predict the chemical structure of these amino acids, we developed novel techniques to predict their various properties, including polarity, hydrophobicity, charge, and presence of aromatic rings, carboxyl, and hydroxyl groups.

Recommended citation: Mongia, M., Baral, R., Adduri, A., Yan, D., Liu, Y., Bian, Y., ... & Mohimani, H. (2023). AdenPredictor: accurate prediction of the adenylation domain specificity of nonribosomal peptide biosynthetic gene clusters in microbial genomes. Bioinformatics, 39(Supplement_1), i40-i46. https://academic.oup.com/bioinformatics/article/39/Supplement_1/i40/7210450

Fast mass spectrometry search and clustering of untargeted metabolomics data

Published in Nature Biotechnology, 2024

The throughput of mass spectrometers and the amount of publicly available metabolomics data are growing rapidly, but analysis tools such as molecular networking and Mass Spectrometry Search Tool do not scale to searching and clustering billions of mass spectral data in metabolomics repositories. To address this limitation, we designed MASST+ and Networking+, which can process datasets that are up to three orders of magnitude larger than those processed by state-of-the-art tools.

Recommended citation: Mongia, M., Yasaka, T. M., Liu, Y., Guler, M., Lu, L., Bhagwat, A., ... & Mohimani, H. (2024). Fast mass spectrometry search and clustering of untargeted metabolomics data. Nature Biotechnology, 1-6. https://www.nature.com/articles/s41587-023-01985-4

talks

teaching

Analog Circuits Lab

Undergraduate laboratory physics course covering analog circuits, Stanford University, Physics, 2017

Led 12 undergraduate physics majors in a laboratory course concerning fundamentals of analog design and fundamentals of implementing and measuring circuits in physical hardware.

Introduction to Mathematics and Statistics for Scientists

Introduction to Proof Based Mathematics and Statistics for Scientists, Carnegie Mellon University, Computer Science, 2019

Led study sessions and HW sessions for 80 students on proof based topics related to mathematics and statistics. Course meant for masters students in Computational Biology.

Algorithms and Advanced Data Structures

Theoretical Masters Course in Algorithms for Engineering Students, Carnegie Mellon University, Computer Science, 2020

Led study sessions and HW sessions for 80 students on theoretical topics related to algorithms and data structures.