Genomics
AAOT
Copy Link
This is an SAS program that converts an ordinal trait to a quantitative trait, and prepares a data set that can be fed into FBAT for association analysis.
Faculty: Heping Zhang, PhD
Download: AAOT package
Platform: SAS
Reference: doi.org (AAOT)
ARMI
Copy Link
ARMI
Gene expression studies have been playing a critical role in cancer research. Despite tremendous effort, the analysis results are still often unsatisfactory, because of the weak signals and high data dimensionality. Analysis is often further challenged by the long-tailed distributions of the outcome variables. In recent multidimensional studies, data have been collected on gene expressions as well as their regulators (for example, copy number alterations, methylation, and microRNAs), which can provide additional information on the associations between gene expressions and cancer outcomes. In this study, we develop an ARMI (Assisted Robust Marker Identification) approach for analyzing cancer studies with measurements on gene expressions as well as regulators. The proposed approach borrows information from regulators and can be more effective than analyzing gene expression data alone. A robust objective function is adopted to accommodate long-tailed distributions. Marker identification is effectively realized using penalization. The proposed approach has an intuitive formulation and is computationally much affordable.
Faculty: Shuangge Steven Ma, PhD; Yuan Huang, PhD
Download: ARMI package
Platform: R
Reference: doi.org (ARMI)
BAGEL
Copy Link
Bayesian Analysis of Gene Expression Levels (BAGEL) is a program that allows statistical inferences to be made regarding differential gene expression between two or more samples measured on spotted (two-channel) microarrays. BAGEL makes these inferences from normalized ratio data, on a gene-by-gene basis. The advantages of BAGEL include ease of use, straightforward interpretation of results, statistical robustness, flexibility in accepting different experimental designs, and that it is free. BAGEL was written by Jeffrey Townsend, who periodically updates and improves the program, and to whom bugs should be reported. BAGEL is available for Windows, Mac OS9, Mac OSX, and Linux. BAGEL can be downloaded from the Townsend Lab web site, http://www.yale.edu/townsend/software.html.
Faculty: Jeffrey Townsend, PhD
Download: BAGEL package
Platform: Unix
Reference: doi.org (BAGEL)
BV-LDER-GE
Copy Link
BV-LDER-GE harnesses both correlations with additive genetic effects and full LD information to enhance the statistical power to detect genome-scale G E interactions.
Faculty: Hongyu Zhao, PhD
Download: BV-LDER-GE package
Platform: R
cancereffectsizeR
Copy Link
Welcome to cancereffectsizeR! This R package provides a variety of tools for analyzing somatic variant data and characterizing the evolutionary trajectories of cancers.
Faculty: Jeffrey Townsend, PhD
Download: GitHub / cancereffectsizeR package
Platform: R
Reference: doi.org (cancereffectsizeR)
Composite-trait LDSC
Copy Link
Estimating correlation between composite phenotypes and traits.
Faculty: Hongyu Zhao, PhD
Download: GitHub / Composite-trait LDSC package
Platform: Python
Reference: doi.org (Composite-trait LDSC)
CosGeneGate
Copy Link
CosGeneGate selects multi-functional and credible biomarkers for single-cell analysis.
Faculty: Hongyu Zhao, PhD
Download: GitHub / CosGeneGate package
Platform: Python
Reference: academic.oup.com (CosGeneGate)
cWAS
Copy Link
cWAS is a statistical framework to identify cell types whose genetically regulated proportions are associated with complex diseases.
Faculty: Hongyu Zhao, PhD
Download: GitHub / cWAS package
Platform: R
Reference: journals.plos.org (cWAS)
DENV_pipeline
Copy Link
This pipeline takes raw Illumina read data in the form of fastq files, maps them against provided bed files and then provides a series of outputs for further analysis including consensus sequences. The following pipeline works with ONT data: https://github.com/josephfauver/DENV_MinION_Script
IMPORTANT: the bed files must correspond to the wet lab protocol that you used and the reference sequence used to generate them otherwise the sequences generated will be incorrect.
It calls input files as a virus type if it has more than 50% coverage of the reference genome provided. If running on a server, it is highly recommended to run it using screen/tmux or similar.
Faculty: Nathan Grubaugh, PhD, Verity Hill, PhD, Chrispin Chaguza, PhD
Download: GitHub / DENV_pipeline package
Platform: Python
Reference: doi.org (DENV_pipeline)
diTARV
Copy Link
diTARV is a tree-based method to explore the association between rare variants and certain human diseases, and find potential gene-gene interactions. It considers depth importance in the tree model to measure the strength of association of each variant. This program implements the method described in Hu J., Li T., Wang S., and Zhang H. Supervariants Identification for Breast Cancer, Genetic Epidemiology 44(8), 9324-947, 2020.
Faculty: Heping Zhang, PhD
Download: diTARV package
Platform: R
Reference: doi.org (diTARV)
dcGSA
Copy Link
Distance-correlation based Gene Set Analysis for longitudinal gene expression profiles. In longitudinal studies, the gene expression profiles were collected at each visit from each subject and hence there are multiple measurements of the gene expression profiles for each subject. The dcGSA package could be used to assess the associations between gene sets and clinical outcomes of interest by fully taking advantage of the longitudinal nature of both the gene expression profiles and clinical outcomes.
Faculty: Hongyu Zhao, PhD
Download: Bioconductor / dsGSA Package
Platform: R
Reference: doi.org (dsGSA)
EpiGePT
Copy Link
EpiGePT is a transformer-based model for cross-cell-line prediction of chromatin states by taking long DNA sequence and transcription factor profile as inputs. This is a script for reproducing EpiGePT using TensorFlow.
Faculty: Qiao Liu, PhD
Download: Liu Lab / EpiGePT package
Platform: Python
Reference: doi.org (EpiGePT)
GENJI
Copy Link
Estimating genetic correlation jointly using individual-level and summary-level GWAS data.
Faculty: Hongyu Zhao, PhD
Download: GitHub / GENJI package
Platform: Python
Reference: biorxiv.org (GENJI)
GPA
Copy Link
Realize three approaches for Gene-Environment interaction analysis. All of them adopt Sparse Group Minimax Concave Penalty to identify important G variables and G-E interactions, and simultaneously respect the hierarchy between main G and G-E interaction effects. All the three approaches are available for Linear, Logistic, and Poisson regression. Also realize to mine and construct prior information for G variables and G-E interactions.
Faculty: Hongyu Zhao, PhD
Download: GitHub / GPA Package
Platform: R and RStudio
Reference: doi.org (GPA)
GRAPE
Copy Link
Gene-Ranking Analysis of Pathway Expression (GRAPE) is a tool for summarizing the consensus behavior of biological pathways in the form of a template, and for quantifying the extent to which individual samples deviate from the template. GRAPE templates are based only on the relative rankings of the genes within the pathway and can be used for classification of tissue types or disease subtypes. GRAPE can be used to represent gene-expression samples as vectors of pathway scores, where each pathway score indicates the departure from a given collection of reference samples. The resulting pathway- space representation can be used as the feature set for various applications, including survival analysis and drug-response prediction.
Faculty: Hongyu Zhao, PhD
Download: Cran R / GRAPE Package
Platform: R
Reference: doi.org (GRAPE)
HapForest
Copy Link
This program implements a forest-based approach to accommodate the haplotype uncertainties and variable importance to sort out significant haplotypes and their interactions in genomewide case-control association studies.
Faculty: Heping Zhang, PhD
Download: HapForest package
Platform: Java
Reference: doi.org (HapForest)
ITEB
Copy Link
Iterated & truncated empirical bayes for strong signal detection (ITEB) is a modified two-group model where the null group corresponds to genes which are not direct targets, but can have small non-zero effects.
Faculty: Leying Guan, PhD
Download: GitHub / ITEB package
Platform: R
Reference: doi.org (ITEB)
iVar
Copy Link
iVar is a computational package that contains functions broadly useful for viral amplicon-based sequencing. Additional tools for metagenomic sequencing are actively being incorporated into iVar. While each of these functions can be accomplished using existing tools, iVar contains an intersection of functionality from multiple tools that are required to call iSNVs and consensus sequences from viral sequencing data across multiple replicates. We implemented the following functions in iVar: (1) trimming of primers and low-quality bases, (2) consensus calling, (3) variant calling - both iSNVs and insertions/deletions, and (4) identifying mismatches to primer sequences and excluding the corresponding reads from alignment files.
Faculty: Nathan Grubaugh, PhD
Download: GitHub / iVar package
Platform: Other
Reference: doi.org (iVar)
LGEWIS
Copy Link
Functions for genome-wide association studies (GWAS)/gene-environment-wide interaction studies (GEWIS) with longitudinal outcomes and exposures. He et al. (2017) "Set-Based Tests for Gene-Environment Interaction in Longitudinal Studies" and He et al. (2017) "Rare-variant association tests in longitudinal studies, with an application to the Multi-Ethnic Study of Atherosclerosis (MESA)".
Faculty: Bhramar Mukherjee, PhD
Download: Cran R / LGEWIS package
Platform: R
Reference: doi.org (LGEWIS)
LOT
Copy Link
This program performs linkage analysis of ordinal traits for pedigree data. It implements a latent-variable proportional-odds logistic model that relates inheritance patterns to the distribution of the ordinal trait.
Faculty: Heping Zhang, PhD
Download: LOT package
Platform: Java
Reference: doi.org (LOT)
LOX
Copy Link
Implementation of LOX in R. LOX uses Markov Chain Monte Carlo to estimate Levels Of gene eXpression from high-throughput expressed sequence data sets with multiple treatments or samples. LOXR can be used to run LOX from the R console, as well as to parse and plot LOX's output.
Faculty: Jeffrey Townsend, PhD
Download: LOX package
Platform: R
MACML
Copy Link
MACML is a program that clusters sequences into heterogeneous regions with specific site types, without requiring any prior knowledge, such as cluster count or length, etc. It features maximum likelihood estimation, model selection and model averaging and adopts a divide-and-conquer approach to hierarchically detect clusters within sequences.
Faculty: Jeffrey Townsend, PhD
Download: MACML package
Platform: R
Reference: doi.org (MACML)
MASS-PRF
Copy Link
MASS-PRF (Model Averaged Site Selection with Poisson Random Field theory): A program that quantifies heterogeneity of selection intensity across sites within coding sequences by using polymorphism and divergence data.
Faculty: Jeffrey Townsend, PhD; Michael Cappello, MD
Download: GitHub / MASS-PRF package
Platform: C++
Reference: doi.org (MASS-PRF)
massprf-protein-coloring
Copy Link
Maps the output of massprf into a file that can be used in Chimera to color proteins execute source("batchMASSPRF_To_Chimera.R") from R, which creates the batchMASSPRF_Chimera command in your local environment.
Faculty: Jeffrey Townsend, PhD
Download: GitHub / massprf-protein-coloring package
Platform: R
modSaRa
Copy Link
The modified Screening and Ranking algorithm (modSaRa) can detect chromosome copy number variants with high sensitivity and specificity. For a sequence of intensity values, the modified SaRa will process it by quantile normalization, search for change-point candidates, eliminate unlikely change-points, and then output the potential CNV segments by presenting the start point and end point by SNP or CNV marker index.
Faculty: Heping Zhang, PhD
Download: modSaRa package
Platform: R
Reference: doi.org (modSaRa)
modSaRa2
Copy Link
Although it has been shown that the widely used change-point based methods can increase statistical power to identify variants, it remains challenging to effectively identify CNVs with weak signals due to the noisy nature of genotyping intensity data. modSaRa2 is a novel improvement of our previously developed method modified Screening and Ranking algorithm (modSaRa) by integrating the relative allelic intensity with prior empirical statistics. modSaRa2 markedly improved both sensitivity and specificity over existing methods. The improvement for detecting weak CNV signals is the most substantial, while simultaneously improving stability when CNV size varies.
Faculty: Heping Zhang, PhD
Download: modSaRa2 package
Platform: R
Reference: doi.org (modSaRa2)
multiSaRa
Copy Link
This enhanced Screening and Ranking algorithm can detect chromosome copy number variants in multiple sequences based on the method introduced by Song, Min, and Zhang (Annuals of Applied Statistics, to appear).
Faculty: Heping Zhang, PhD
Download: multiSaRa package
Platform: R
MuSe-GNN
Copy Link
Learning Unified Gene Representation From Multimodal Biological Graph Data.
Faculty: Hongyu Zhao, PhD
Download: GitHub / MuSe-GNN package
Platform: Python
Reference: proceedings.neurips.cc (MuSe-GNN)
Pathscore
Copy Link
PathScore quantifies the level of enrichment of somatic mutations within curated pathways, applying a novel approach that identifies pathways enriched across patients. The application provides several user-friendly, interactive graphic interfaces for data exploration, including tools for comparing pathway effect sizes, significance, gene-set overlap and enrichment differences between projects.
Faculty: Jeffrey Townsend, PhD
Download: GitHub / Pathscore package
Platform: Web app
Reference: doi.org (Pathscore)
PhyDesign
Copy Link
The PhyDesign web application consists of 3 components:
1) A form to upload information and choose an application to calculate the evolutionary rates for each alignment site,
2) A table listing the site rates produced by the analysis, including links for download, and
3) A user-friendly graphical interface to plot the phylogenetic profiles and calculate integration values. This latter component produces profiles of phylogenetic informativeness, calculates net and relative (per site) informativeness over specified time intervals, and offers the ability to integrate informativeness over epochs of interest.
Faculty: Jeffrey Townsend, PhD
Download: PhyDesign package
Platform: Web app
Reference: doi.org (PhyDesign)
PhyInformR
Copy Link
Enables rapid calculation of phylogenetic information content using the latest advances in phylogenetic informativeness based theory. These advances include modifications that incorporate uneven branch lengths and any model of nucleotide substitution to provide assessments of the phylogenetic utility of any given dataset or dataset partition. Also provides new tools for data visualization and routines optimized for rapid statistical calculations, including approaches making use of Bayesian posterior distributions and parallel processing. Users can apply these approaches toward screening datasets for phylogenetic/genomic information content.
Faculty: Jeffrey Townsend, PhD
Download: CranR / PhyInformR package
Platform: R
Reference: doi.org (PhyInformR)
PRSweb
Copy Link
To facilitate scientific collaboration on polygenic risk scores (PRSs) research, we created an extensive PRS online repository for 35 common cancer traits integrating freely available genome-wide association studies (GWASs) summary statistics from three sources: published GWASs, the NHGRI-EBI GWAS Catalog, and UK Biobank-based GWASs. Our framework condenses these summary statistics into PRSs using various approaches such as linkage disequilibrium pruning/p value thresholding (fixed or data-adaptively optimized thresholds) and penalized, genome-wide effect size weighting. We evaluated the PRSs in two biobanks: the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and the population-based UK Biobank (UKB). For each PRS construct, we provide measures on predictive performance and discrimination. Besides PRS evaluation, the Cancer-PRSweb platform features construct downloads and phenome-wide PRS association study results (PRS-PheWAS) for predictive PRSs. We expect this integrated platform to accelerate PRS-related cancer research.
Faculty: Bhramar Mukherjee, PhD
Download: PRSweb package
Platform: R Shiny
Reference: doi.org (PRSweb)
P0PCorNS
Copy Link
P0PCorNS (Perturbation to 0 to Predict Correlated Network Stability) performs informative gene knockouts (One gene was in silico knocked out each time ), and uses BNW (Ziebarth et al. Bioinformatics, 2013.) to generate the Bayesian Networks, then calculates the Jensen-Shannon divergence (JSD) between networks. Finally, P0PCorNS can provide a order list of gene impacts. A gene that exhibits a higher informative impact would play an more importent role in in the gene regulatory networks (potentially function more upstream in a linear regulatory order or be the hub gene) , and will be ranked higher for gene manipulation verification experiments.
Faculty: Jeffrey Townsend, PhD; Meng Liu, PhD; Zheng Wang, PhD
Download: GitHub / P0PCorNS package
Platform: R
Reference: doi.org (P0PCorNS)
RAS
Copy Link
Genome-wide association studies (GWAS) are crucial for identifying numerous single nucleotide polymorphisms (SNPs) linked to various diseases. However, current methods struggle with regional associations due to small effects and the high number of variants, leading to suboptimal power and inflated type I error. To tackle these challenges, we propose a powerful and visualizable method which quantifies regional association strengths at individual SNPs, converts these into time series data, and uses change point detection algorithms to identify key association regions. Extensive simulations demonstrate that our method not only increases detection power but also maintains a significantly lower false positive rate compared to existing techniques, positioning it as a promising tool for regional association detection in GWAS.
Faculty: Heping Zhang, PhD; Yiran Jiang, PhD
Download: GitHub / RAS package
Platform: R
Reference: doi.org (RAS)
ResPAN
Copy Link
A powerful batch correction model for scRNA-seq data through residual adversarial networks.
Faculty: Hongyu Zhao, PhD
Download: GitHub / ResPAN package
Platform: Python
Reference: doi.org (ResPAN)
RTREE
Copy Link
Program that analyzes relative risk and conducts sib pair linkage analysis using tree-based methods. This program can be executed to automatically generate a tree structure or allow the user to construct a tree of his or her choice.
Faculty: Heping Zhang, PhD
Download: RTREE package
Platform: Unix
Reference: doi.org (RTREE)
scAAnet
Copy Link
Non-linear archetypal analysis of single-cell RNA-seq data by deep autoencoders.
Faculty: Hongyu Zhao, PhD
Download: GitHub / scAAnet package
Platform: Python
Reference: doi.org (scAAnet)
scDEC
Copy Link
scDEC is a computational tool for single cell ATAC-seq data analysis with deep generative neural networks. scDEC enables simultaneously learning the deep embedding and clustering of the cells in an unsupervised manner. scDEC is also applicable to multi-modal single cell data. We tested it on the PBMC paired data (scRNA-seq and scATAC-seq) from 10x Genomics (see Tutorials).
Faculty: Qiao Liu, PhD
Download: Liu Lab / scDEC package
Platform: Python
Reference: doi.org (scDEC)
SeroCall
Copy Link
SeroCall can identify and quantitate the capsular serotypes in Illumina whole-genome sequencing samples of S. pneumoniae, calculating abundances of each serotype in mixed cultures. The software is written in Python (compatible with Python 2 or 3), and is freely available under an open source GPLv3 license.
Faculty: James Knight, PhD, Daniel Weinberger, PhD
Download: GitHub / SeroCall package
Platform: Python
Reference: doi.org (SeroCall)
SeqPop
Copy Link
SeqPop: A program for computing population genetic statistics on sequence data, including Pn, Theta, Pi(i,j), Kst(*), Fst(*), and their Monte Carlo significance for population subdivision.
Faculty: Jeffrey Townsend, PhD
Download: SeqPop package
Platform: Unix
SUPERGNOVA
Copy Link
Local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits.
Faculty: Hongyu Zhao, PhD
Download: GitHub / SUPERGNOVA package
Platform: Python
Reference: genomebiology.biomedcentral.com (SUPERGNOVA)
SaRa
Copy Link
The Screening and Ranking algorithm can detect chromosome copy number variants fastly and accurately with computational complexity in the order of O(n). This program implements the methods described in: Niu and Zhang. The screening and ranking algorithm to detect DNA copy number variations. Ann. Appl. Stat. 6,1306-1326, (2012). Hao, Niu and Zhang. Multiple change-point detection via a screening and ranking algorithm. Statistica Sinica 23 (2013).
Faculty: Heping Zhang, PhD
Download: SaRa package
Platform: R
Reference: doi.org (SaRa)
simuRare
Copy Link
simuRare a regression-based algorithm that imputes rare variants in currently available SNP array data, and performs a resampling approach to simulate samples that contain both common and rare SNPs.
Faculty: Heping Zhang, PhD
Download: simuRare package
Platform: R
Reference: doi.org (simuRare)
SSSS
Copy Link
This package provides a fast nonparametric method for short segment detection.
Faculty: Heping Zhang, PhD
Download: SSSS package
Platform: R
STB-STC
Copy Link
STB-STC is a statistical method that can identify joint effects of microbes in human disease considering the sparsity issue and utilizing the hierarchical information of taxonomy annotation. STB and STC yield better detection performance in situations where microbes are highly correlated compared to state-of-the-art differential abundance analysis approaches. We distribute one core R function (SVB-SVC.R) related to STB and STC. It performs the method (STB or STC) on a group of microbes. The script utilizes.R includes all necessary codes to support the running of SVB-SVC.R
Faculty: Heping Zhang, PhD
Download: STB-STC package
Platform: R
subgxe
Copy Link
Classical methods for combining summary data from genome-wide association studies (GWAS) only use marginal genetic effects and power can be compromised in the presence of heterogeneity. 'subgxe' is a R package that implements p-value assisted subset testing for association (pASTA), a method developed by Yu et al. (2019) <doi:10.1159/000496867>. pASTA generalizes association analysis based on subsets by incorporating gene-environment interactions into the testing procedure.
Faculty: Bhramar Mukherjee, PhD
Download: Cran R / subgxe package
Platform: R
Reference: doi.org (subgxe)
TARV
Copy Link
TARV is a tree-based method to explore the association between rare variants and complex diseases, and find potential genetic and environmental factors and their interactions.
Faculty: Heping Zhang, PhD
Download: TARV package
Platform: R
Reference: doi.org (TARV)
T-GEN
Copy Link
T-GEN (Transcriptome-mediated identification of disease-associated Genes with Epigenetic aNnotation) is a framework to identify disease-associated genes leveraging epigenetic information.
Faculty: Hongyu Zhao, PhD
Download: GitHub / T-GEN package
Platform: R
Reference: journals.plos.org (T-GEN)
Twin Analysis
Copy Link
This program uses SAS PROC NLMIXED and PROC MIXED to conduct twin analysis to estimate heritability of binary and quantitative traits.
Faculty: Heping Zhang, PhD
Download: Twin Analysis package
Platform: SAS
Reference: doi.org (Twin Analysis)
Willows
Copy Link
Willows is a software package that includes three classifiers: classification tree, random forest, and deterministic forest. This package is built on the basis of Heping Zhang's RTREE program with two distinctive features. First, the cumulation of data on single nucletide polymorphisms (SNPs) has created data so huge that we have to take specific steps to improve the memory use of the existing software. Willows implements the most efficient memory use for SNP data, while maintaining its general functionality. The second important feature of Willows is a friendly graphical user interface.
Faculty: Heping Zhang, PhD
Download: Willows package
Platform: Unix
Reference: doi.org (Willows)