Genomics

AAOT

This is an SAS program that converts an ordinal trait to a quantitative trait, and prepares a data set that can be fed into FBAT for association analysis.

Faculty: Heping Zhang, PhD

Download: AAOT package

Platform: SAS

Reference: doi.org (AAOT)

ARMI

Gene expression studies have been playing a critical role in cancer research. Despite tremendous effort, the analysis results are still often unsatisfactory, because of the weak signals and high data dimensionality. Analysis is often further challenged by the long-tailed distributions of the outcome variables. In recent multidimensional studies, data have been collected on gene expressions as well as their regulators (for example, copy number alterations, methylation, and microRNAs), which can provide additional information on the associations between gene expressions and cancer outcomes. In this study, we develop an ARMI (Assisted Robust Marker Identification) approach for analyzing cancer studies with measurements on gene expressions as well as regulators. The proposed approach borrows information from regulators and can be more effective than analyzing gene expression data alone. A robust objective function is adopted to accommodate long-tailed distributions. Marker identification is effectively realized using penalization. The proposed approach has an intuitive formulation and is computationally much affordable.

Faculty: Shuangge Steven Ma, PhD; Yuan Huang, PhD

Download: ARMI package

Platform: R

Reference: doi.org (ARMI)

BAGEL

Bayesian Analysis of Gene Expression Levels (BAGEL) is a program that allows statistical inferences to be made regarding differential gene expression between two or more samples measured on spotted (two-channel) microarrays. BAGEL makes these inferences from normalized ratio data, on a gene-by-gene basis. The advantages of BAGEL include ease of use, straightforward interpretation of results, statistical robustness, flexibility in accepting different experimental designs, and that it is free. BAGEL was written by Jeffrey Townsend, who periodically updates and improves the program, and to whom bugs should be reported. BAGEL is available for Windows, Mac OS9, Mac OSX, and Linux. BAGEL can be downloaded from the Townsend Lab web site, http://www.yale.edu/townsend/software.html.

Faculty: Jeffrey Townsend, PhD

Download: BAGEL package

Platform: Unix

Reference: doi.org (BAGEL)

BV-LDER-GE

BV-LDER-GE harnesses both correlations with additive genetic effects and full LD information to enhance the statistical power to detect genome-scale G E interactions.

Faculty: Hongyu Zhao, PhD

Download: BV-LDER-GE package

Platform: R

cancereffectsizeR

Welcome to cancereffectsizeR! This R package provides a variety of tools for analyzing somatic variant data and characterizing the evolutionary trajectories of cancers.

Faculty: Jeffrey Townsend, PhD

Download: GitHub / cancereffectsizeR package

Platform: R

Reference: doi.org (cancereffectsizeR)

Composite-trait LDSC

Estimating correlation between composite phenotypes and traits.

Faculty: Hongyu Zhao, PhD

Download: GitHub / Composite-trait LDSC package

Platform: Python

Reference: doi.org (Composite-trait LDSC)

CosGeneGate

CosGeneGate selects multi-functional and credible biomarkers for single-cell analysis.

Faculty: Hongyu Zhao, PhD

Download: GitHub / CosGeneGate package

Platform: Python

Reference: academic.oup.com (CosGeneGate)

cWAS

cWAS is a statistical framework to identify cell types whose genetically regulated proportions are associated with complex diseases.

Faculty: Hongyu Zhao, PhD

Download: GitHub / cWAS package

Platform: R

Reference: journals.plos.org (cWAS)

DENV_pipeline

This pipeline takes raw Illumina read data in the form of fastq files, maps them against provided bed files and then provides a series of outputs for further analysis including consensus sequences. The following pipeline works with ONT data: https://github.com/josephfauver/DENV_MinION_Script

IMPORTANT: the bed files must correspond to the wet lab protocol that you used and the reference sequence used to generate them otherwise the sequences generated will be incorrect.

It calls input files as a virus type if it has more than 50% coverage of the reference genome provided. If running on a server, it is highly recommended to run it using screen/tmux or similar.

Faculty: Nathan Grubaugh, PhD, Verity Hill, PhD, Chrispin Chaguza, PhD

Download: GitHub / DENV_pipeline package

Platform: Python

Reference: doi.org (DENV_pipeline)

diTARV

diTARV is a tree-based method to explore the association between rare variants and certain human diseases, and find potential gene-gene interactions. It considers depth importance in the tree model to measure the strength of association of each variant. This program implements the method described in Hu J., Li T., Wang S., and Zhang H. Supervariants Identification for Breast Cancer, Genetic Epidemiology 44(8), 9324-947, 2020.

Faculty: Heping Zhang, PhD

Download: diTARV package

Platform: R

Reference: doi.org (diTARV)

dcGSA

Distance-correlation based Gene Set Analysis for longitudinal gene expression profiles. In longitudinal studies, the gene expression profiles were collected at each visit from each subject and hence there are multiple measurements of the gene expression profiles for each subject. The dcGSA package could be used to assess the associations between gene sets and clinical outcomes of interest by fully taking advantage of the longitudinal nature of both the gene expression profiles and clinical outcomes.

Faculty: Hongyu Zhao, PhD

Download: Bioconductor / dsGSA Package

Platform: R

Reference: doi.org (dsGSA)

EpiGePT

EpiGePT is a transformer-based model for cross-cell-line prediction of chromatin states by taking long DNA sequence and transcription factor profile as inputs. This is a script for reproducing EpiGePT using TensorFlow.

Faculty: Qiao Liu, PhD

Download: Liu Lab / EpiGePT package

Platform: Python

Reference: doi.org (EpiGePT)

GENJI

Estimating genetic correlation jointly using individual-level and summary-level GWAS data.

Faculty: Hongyu Zhao, PhD

Download: GitHub / GENJI package

Platform: Python

Reference: biorxiv.org (GENJI)

GPA

Realize three approaches for Gene-Environment interaction analysis. All of them adopt Sparse Group Minimax Concave Penalty to identify important G variables and G-E interactions, and simultaneously respect the hierarchy between main G and G-E interaction effects. All the three approaches are available for Linear, Logistic, and Poisson regression. Also realize to mine and construct prior information for G variables and G-E interactions.

Faculty: Hongyu Zhao, PhD

Download: GitHub / GPA Package

Platform: R and RStudio

Reference: doi.org (GPA)

GRAPE

Gene-Ranking Analysis of Pathway Expression (GRAPE) is a tool for summarizing the consensus behavior of biological pathways in the form of a template, and for quantifying the extent to which individual samples deviate from the template. GRAPE templates are based only on the relative rankings of the genes within the pathway and can be used for classification of tissue types or disease subtypes. GRAPE can be used to represent gene-expression samples as vectors of pathway scores, where each pathway score indicates the departure from a given collection of reference samples. The resulting pathway- space representation can be used as the feature set for various applications, including survival analysis and drug-response prediction.

Faculty: Hongyu Zhao, PhD

Download: Cran R / GRAPE Package

Platform: R

Reference: doi.org (GRAPE)

HapForest

This program implements a forest-based approach to accommodate the haplotype uncertainties and variable importance to sort out significant haplotypes and their interactions in genomewide case-control association studies.

Faculty: Heping Zhang, PhD

Download: HapForest package

Platform: Java

Reference: doi.org (HapForest)

ITEB

Iterated & truncated empirical bayes for strong signal detection (ITEB) is a modified two-group model where the null group corresponds to genes which are not direct targets, but can have small non-zero effects.

Faculty: Leying Guan, PhD

Download: GitHub / ITEB package

Platform: R

Reference: doi.org (ITEB)

iVar

iVar is a computational package that contains functions broadly useful for viral amplicon-based sequencing. Additional tools for metagenomic sequencing are actively being incorporated into iVar. While each of these functions can be accomplished using existing tools, iVar contains an intersection of functionality from multiple tools that are required to call iSNVs and consensus sequences from viral sequencing data across multiple replicates. We implemented the following functions in iVar: (1) trimming of primers and low-quality bases, (2) consensus calling, (3) variant calling - both iSNVs and insertions/deletions, and (4) identifying mismatches to primer sequences and excluding the corresponding reads from alignment files.

Faculty: Nathan Grubaugh, PhD

Download: GitHub / iVar package

Platform: Other

Reference: doi.org (iVar)

LGEWIS

Functions for genome-wide association studies (GWAS)/gene-environment-wide interaction studies (GEWIS) with longitudinal outcomes and exposures. He et al. (2017) "Set-Based Tests for Gene-Environment Interaction in Longitudinal Studies" and He et al. (2017) "Rare-variant association tests in longitudinal studies, with an application to the Multi-Ethnic Study of Atherosclerosis (MESA)".

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / LGEWIS package

Platform: R

Reference: doi.org (LGEWIS)

LOT

This program performs linkage analysis of ordinal traits for pedigree data. It implements a latent-variable proportional-odds logistic model that relates inheritance patterns to the distribution of the ordinal trait.

Faculty: Heping Zhang, PhD

Download: LOT package

Platform: Java

Reference: doi.org (LOT)

LOX

Implementation of LOX in R. LOX uses Markov Chain Monte Carlo to estimate Levels Of gene eXpression from high-throughput expressed sequence data sets with multiple treatments or samples. LOXR can be used to run LOX from the R console, as well as to parse and plot LOX's output.

Faculty: Jeffrey Townsend, PhD

Download: LOX package

Platform: R

MACML

MACML is a program that clusters sequences into heterogeneous regions with specific site types, without requiring any prior knowledge, such as cluster count or length, etc. It features maximum likelihood estimation, model selection and model averaging and adopts a divide-and-conquer approach to hierarchically detect clusters within sequences.

Faculty: Jeffrey Townsend, PhD

Download: MACML package

Platform: R

Reference: doi.org (MACML)

MASS-PRF

MASS-PRF (Model Averaged Site Selection with Poisson Random Field theory): A program that quantifies heterogeneity of selection intensity across sites within coding sequences by using polymorphism and divergence data.

Faculty: Jeffrey Townsend, PhD; Michael Cappello, MD

Download: GitHub / MASS-PRF package

Platform: C++

Reference: doi.org (MASS-PRF)

massprf-protein-coloring

Maps the output of massprf into a file that can be used in Chimera to color proteins execute source("batchMASSPRF_To_Chimera.R") from R, which creates the batchMASSPRF_Chimera command in your local environment.

Faculty: Jeffrey Townsend, PhD

Download: GitHub / massprf-protein-coloring package

Platform: R

modSaRa

The modified Screening and Ranking algorithm (modSaRa) can detect chromosome copy number variants with high sensitivity and specificity. For a sequence of intensity values, the modified SaRa will process it by quantile normalization, search for change-point candidates, eliminate unlikely change-points, and then output the potential CNV segments by presenting the start point and end point by SNP or CNV marker index.

Faculty: Heping Zhang, PhD

Download: modSaRa package

Platform: R

Reference: doi.org (modSaRa)

modSaRa2

Although it has been shown that the widely used change-point based methods can increase statistical power to identify variants, it remains challenging to effectively identify CNVs with weak signals due to the noisy nature of genotyping intensity data. modSaRa2 is a novel improvement of our previously developed method modified Screening and Ranking algorithm (modSaRa) by integrating the relative allelic intensity with prior empirical statistics. modSaRa2 markedly improved both sensitivity and specificity over existing methods. The improvement for detecting weak CNV signals is the most substantial, while simultaneously improving stability when CNV size varies.

Faculty: Heping Zhang, PhD

Download: modSaRa2 package

Platform: R

Reference: doi.org (modSaRa2)

multiSaRa

This enhanced Screening and Ranking algorithm can detect chromosome copy number variants in multiple sequences based on the method introduced by Song, Min, and Zhang (Annuals of Applied Statistics, to appear).

Faculty: Heping Zhang, PhD

Download: multiSaRa package

Platform: R

MuSe-GNN

Learning Unified Gene Representation From Multimodal Biological Graph Data.

Faculty: Hongyu Zhao, PhD

Download: GitHub / MuSe-GNN package

Platform: Python

Reference: proceedings.neurips.cc (MuSe-GNN)

Pathscore

PathScore quantifies the level of enrichment of somatic mutations within curated pathways, applying a novel approach that identifies pathways enriched across patients. The application provides several user-friendly, interactive graphic interfaces for data exploration, including tools for comparing pathway effect sizes, significance, gene-set overlap and enrichment differences between projects.

Faculty: Jeffrey Townsend, PhD

Download: GitHub / Pathscore package

Platform: Website

Reference: doi.org (Pathscore)

PhyDesign

The PhyDesign web application consists of 3 components:

1) A form to upload information and choose an application to calculate the evolutionary rates for each alignment site,

2) A table listing the site rates produced by the analysis, including links for download, and

3) A user-friendly graphical interface to plot the phylogenetic profiles and calculate integration values. This latter component produces profiles of phylogenetic informativeness, calculates net and relative (per site) informativeness over specified time intervals, and offers the ability to integrate informativeness over epochs of interest.

Faculty: Jeffrey Townsend, PhD

Download: PhyDesign package

Platform: Website

Reference: doi.org (PhyDesign)

PhyInformR

Enables rapid calculation of phylogenetic information content using the latest advances in phylogenetic informativeness based theory. These advances include modifications that incorporate uneven branch lengths and any model of nucleotide substitution to provide assessments of the phylogenetic utility of any given dataset or dataset partition. Also provides new tools for data visualization and routines optimized for rapid statistical calculations, including approaches making use of Bayesian posterior distributions and parallel processing. Users can apply these approaches toward screening datasets for phylogenetic/genomic information content.

Faculty: Jeffrey Townsend, PhD

Download: CranR / PhyInformR package

Platform: R

Reference: doi.org (PhyInformR)

PRSweb

To facilitate scientific collaboration on polygenic risk scores (PRSs) research, we created an extensive PRS online repository for 35 common cancer traits integrating freely available genome-wide association studies (GWASs) summary statistics from three sources: published GWASs, the NHGRI-EBI GWAS Catalog, and UK Biobank-based GWASs. Our framework condenses these summary statistics into PRSs using various approaches such as linkage disequilibrium pruning/p value thresholding (fixed or data-adaptively optimized thresholds) and penalized, genome-wide effect size weighting. We evaluated the PRSs in two biobanks: the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and the population-based UK Biobank (UKB). For each PRS construct, we provide measures on predictive performance and discrimination. Besides PRS evaluation, the Cancer-PRSweb platform features construct downloads and phenome-wide PRS association study results (PRS-PheWAS) for predictive PRSs. We expect this integrated platform to accelerate PRS-related cancer research.

Faculty: Bhramar Mukherjee, PhD

Download: PRSweb package

Platform: R Shiny

Reference: doi.org (PRSweb)

P0PCorNS

P0PCorNS (Perturbation to 0 to Predict Correlated Network Stability) performs informative gene knockouts (One gene was in silico knocked out each time ), and uses BNW (Ziebarth et al. Bioinformatics, 2013.) to generate the Bayesian Networks, then calculates the Jensen-Shannon divergence (JSD) between networks. Finally, P0PCorNS can provide a order list of gene impacts. A gene that exhibits a higher informative impact would play an more importent role in in the gene regulatory networks (potentially function more upstream in a linear regulatory order or be the hub gene) , and will be ranked higher for gene manipulation verification experiments.

Faculty: Jeffrey Townsend, PhD; Meng Liu, PhD; Zheng Wang, PhD

Download: GitHub / P0PCorNS package

Platform: R

Reference: doi.org (P0PCorNS)

RAS

Genome-wide association studies (GWAS) are crucial for identifying numerous single nucleotide polymorphisms (SNPs) linked to various diseases. However, current methods struggle with regional associations due to small effects and the high number of variants, leading to suboptimal power and inflated type I error. To tackle these challenges, we propose a powerful and visualizable method which quantifies regional association strengths at individual SNPs, converts these into time series data, and uses change point detection algorithms to identify key association regions. Extensive simulations demonstrate that our method not only increases detection power but also maintains a significantly lower false positive rate compared to existing techniques, positioning it as a promising tool for regional association detection in GWAS.

Faculty: Heping Zhang, PhD; Yiran Jiang, PhD

Download: GitHub / RAS package

Platform: R

Reference: doi.org (RAS)

ResPAN

A powerful batch correction model for scRNA-seq data through residual adversarial networks.

Faculty: Hongyu Zhao, PhD

Download: GitHub / ResPAN package

Platform: Python

Reference: doi.org (ResPAN)

RTREE

Program that analyzes relative risk and conducts sib pair linkage analysis using tree-based methods. This program can be executed to automatically generate a tree structure or allow the user to construct a tree of his or her choice.

Faculty: Heping Zhang, PhD

Download: RTREE package

Platform: Unix

Reference: doi.org (RTREE)

scAAnet

Non-linear archetypal analysis of single-cell RNA-seq data by deep autoencoders.

Faculty: Hongyu Zhao, PhD

Download: GitHub / scAAnet package

Platform: Python

Reference: doi.org (scAAnet)

scDEC

scDEC is a computational tool for single cell ATAC-seq data analysis with deep generative neural networks. scDEC enables simultaneously learning the deep embedding and clustering of the cells in an unsupervised manner. scDEC is also applicable to multi-modal single cell data. We tested it on the PBMC paired data (scRNA-seq and scATAC-seq) from 10x Genomics (see Tutorials).

Faculty: Qiao Liu, PhD

Download: Liu Lab / scDEC package

Platform: Python

Reference: doi.org (scDEC)

SeroCall

SeroCall can identify and quantitate the capsular serotypes in Illumina whole-genome sequencing samples of S. pneumoniae, calculating abundances of each serotype in mixed cultures. The software is written in Python (compatible with Python 2 or 3), and is freely available under an open source GPLv3 license.

Faculty: James Knight, PhD, Daniel Weinberger, PhD

Download: GitHub / SeroCall package

Platform: Python

Reference: doi.org (SeroCall)

SeqPop

SeqPop: A program for computing population genetic statistics on sequence data, including Pn, Theta, Pi(i,j), Kst(*), Fst(*), and their Monte Carlo significance for population subdivision.

Faculty: Jeffrey Townsend, PhD

Download: SeqPop package

Platform: Unix

SUPERGNOVA

Local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits.

Faculty: Hongyu Zhao, PhD

Download: GitHub / SUPERGNOVA package

Platform: Python

Reference: genomebiology.biomedcentral.com (SUPERGNOVA)

SaRa

The Screening and Ranking algorithm can detect chromosome copy number variants fastly and accurately with computational complexity in the order of O(n). This program implements the methods described in: Niu and Zhang. The screening and ranking algorithm to detect DNA copy number variations. Ann. Appl. Stat. 6,1306-1326, (2012). Hao, Niu and Zhang. Multiple change-point detection via a screening and ranking algorithm. Statistica Sinica 23 (2013).

Faculty: Heping Zhang, PhD

Download: SaRa package

Platform: R

Reference: doi.org (SaRa)

simuRare

simuRare a regression-based algorithm that imputes rare variants in currently available SNP array data, and performs a resampling approach to simulate samples that contain both common and rare SNPs.

Faculty: Heping Zhang, PhD

Download: simuRare package

Platform: R

Reference: doi.org (simuRare)

SSSS

This package provides a fast nonparametric method for short segment detection.

Faculty: Heping Zhang, PhD

Download: SSSS package

Platform: R

STB-STC

STB-STC is a statistical method that can identify joint effects of microbes in human disease considering the sparsity issue and utilizing the hierarchical information of taxonomy annotation. STB and STC yield better detection performance in situations where microbes are highly correlated compared to state-of-the-art differential abundance analysis approaches. We distribute one core R function (SVB-SVC.R) related to STB and STC. It performs the method (STB or STC) on a group of microbes. The script utilizes.R includes all necessary codes to support the running of SVB-SVC.R

Faculty: Heping Zhang, PhD

Download: STB-STC package

Platform: R

subgxe

Classical methods for combining summary data from genome-wide association studies (GWAS) only use marginal genetic effects and power can be compromised in the presence of heterogeneity. 'subgxe' is a R package that implements p-value assisted subset testing for association (pASTA), a method developed by Yu et al. (2019) <doi:10.1159/000496867>. pASTA generalizes association analysis based on subsets by incorporating gene-environment interactions into the testing procedure.

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / subgxe package

Platform: R

Reference: doi.org (subgxe)

TARV

TARV is a tree-based method to explore the association between rare variants and complex diseases, and find potential genetic and environmental factors and their interactions.

Faculty: Heping Zhang, PhD

Download: TARV package

Platform: R

Reference: doi.org (TARV)

T-GEN

T-GEN (Transcriptome-mediated identification of disease-associated Genes with Epigenetic aNnotation) is a framework to identify disease-associated genes leveraging epigenetic information.

Faculty: Hongyu Zhao, PhD

Download: GitHub / T-GEN package

Platform: R

Reference: journals.plos.org (T-GEN)

Twin Analysis

This program uses SAS PROC NLMIXED and PROC MIXED to conduct twin analysis to estimate heritability of binary and quantitative traits.

Faculty: Heping Zhang, PhD

Download: Twin Analysis package

Platform: SAS

Reference: doi.org (Twin Analysis)

Willows

Willows is a software package that includes three classifiers: classification tree, random forest, and deterministic forest. This package is built on the basis of Heping Zhang's RTREE program with two distinctive features. First, the cumulation of data on single nucletide polymorphisms (SNPs) has created data so huge that we have to take specific steps to improve the memory use of the existing software. Willows implements the most efficient memory use for SNP data, while maintaining its general functionality. The second important feature of Willows is a friendly graphical user interface.

Faculty: Heping Zhang, PhD

Download: Willows package

Platform: Unix

Reference: doi.org (Willows)