# Chapter 1 Introduction

Viruses are arguably the simplest biological agents that may be considered as having some form of “life”, and yet they are capable of influencing our planet and impacting our lives in many different ways. Existing as very small particles with limited genetic code, viruses are incapable of surviving on their own and rely on resources and biological environments within other living organisms such as human beings. Once a virus infects a human cell, it hijacks the cell’s apparatus to make copies of itself and propagate, causing disruptions of various biological functions both inside and outside the cell. These disruptions can in turn lead to onset of disease, ranging in severity from being asymptomatic to life-threatening. On the other hand, the human immune system has evolved complex but extremely effective ways to counter such threats (Owen et al., 2013; Murphy and Weaver, 2016). Remarkably, the human immune system is capable of recognizing any virus that it has dealt with before by maintaining sets of trained immune cells specific to each virus — memory B and T cells that constitute immunological memory. When the immune system recognizes an invading virus, it quickly recalls the specific repertoire of cells that were trained to overcome that virus. It is precisely this characteristic of the immune system, known as ‘acquired/adaptive immunity’ (Owen et al., 2013), which has led to the concept of vaccination. By presenting some form or part of the virus to the immune system through a vaccine, specific anti-viral immune responses are generated and committed to memory which are then rapidly recalled whenever the virus attempts to invade in the future (Murphy and Weaver, 2016).

Vaccination has perhaps saved more lives than any other public health intervention and has helped to eradicate severe diseases (e.g., smallpox) or severely limit the spread of infectious diseases (e.g., polio and measles) in the human population. However, there are still no effective vaccines against a number of infectious viruses (e.g., HIV, Ebola, Zika, Dengue, SARS, MERS, etc.) (Maslow, 2019; Deng et al., 2020; Gaardbo et al., 2012), some of which regularly cause disease outbreaks and can lead to disastrous public health consequences, especially in areas with poor healthcare systems. Moreover, frequent outbreaks of such infectious diseases also carry a huge economic burden (Stanaway et al., 2016). Traditional approaches to vaccine design have not been successful when dealing with some of these viruses and thus, more innovative ways are needed to aid in rational design of vaccines (Chakraborty and Barton, 2017).

With advancements in both genetic sequencing technologies and immunological experiments, two main types of data are now being generated in large quantity which can directly inform rational design of vaccines against viruses: (i) genetic sequences of viruses that are extracted from patients, and (ii) genetic sequences of experimentally-resolved fragments of viruses that are targeted by the human immune system. As a result, a new paradigm of rational vaccine design leveraging genetic information is emerging, which is sometimes referred to as reverse vaccinology (Moxon et al., 2019) (or synthetic vaccine design (Palatnik-de-Sousa et al., 2018)).

Aligned with this paradigm and adapting different statistical methods, this thesis looks into how analyzing available genetic data can inform design of vaccines against three very different infectious viruses:

1. Human immunodeficiency virus (HIV),
2. Dengue virus (DENV), and
3. Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2).

The interplay between each of these viruses and the human immune system is unique, however, some basic principles remain the same which helps in the quest for identifying effective vaccine targets against these viruses in a unified framework.

In order to familiarize with these principles, a brief overview of the most relevant concepts in virology, human immunology and vaccine design are presented next.

## 1.1 Brief overview of relevant concepts in virology, human immunology and vaccine design

### 1.1.1 Viruses

Viruses are simple biological entities that depend upon the cellular machinery of other living organisms (hosts) to function and make copies of themselves. The newly formed viral copies exit the infected host cell as virus particles (also called virions) which then go on to infect other cells. All virus particles comprise two main components:

1. Genetic code - which is the information payload and is generally located at the core of the virus.

2. Outer shell - which serves as the packaging to protect the payload and mediates viral entry into the target host cell.

The genetic code of a virus exists in the form of a DNA or RNA molecule which has either a single or double stranded structure. These features of viruses help to classify them; for example, the three viruses that are addressed in this thesis—HIV, DENV and SARS-CoV-2—are all single stranded RNA viruses. The RNA molecules serve as blueprints for making all viral proteins inside the host cell, such that specific regions of the RNA are translated into different proteins that in turn assemble to form the complete virus particle, while encapsulating a copy of the viral RNA (Owen et al., 2013; Murphy and Weaver, 2016).

The outer shell of a virus is formed by assemblies of one or more viral proteins which gives each virus its typical shape. These shapes, such as the ‘spiky’ shape of SARS-CoV-2 and the ‘rounded’ shape of dengue, play an important role in targeting the specific set of host cells by interacting with specific molecules expressed on the surface of these cells (cell tropism). For example, the spike protein on the surface of SARS-CoV-2 enables it to bind with the ACE2 (angiotensin-converting enzyme 2) molecules on the surface of cells that are abundantly present in the human respiratory tract and lungs (Lan et al., 2020).

Each viral protein can be represented by the sequence of its constituent amino acid residues, and this representation is known as the protein’s primary structure, or simply, protein sequence. The sequence data is extensively studied to help understand different aspects of the viruses, including their 3D structure, function, pathogenicity as well as evolutionary histories.

### 1.1.2 Extraction and sequencing of viruses

Virus particles are first extracted from blood or other clinical samples obtained from patients. These samples are collected from different sites of the body depending upon the type and site of infection, e.g., nasal swabs are commonly used for extracting SARS-CoV-2 particles. From these samples, the viral RNA is then purified and the genetic code is sequenced using any of the various sequencing platforms (Fig. 1.1) (Murphy and Weaver, 2016).

The sequencing platforms provide the read out of the genetic code as a sequence of nucleotides (encoded by A, C, G, or T). Sets of three consecutive nucleotides (codons) represent one of the 20 amino acids into which the viral RNA is translated inside the host cell. These amino acids (precisely their residues) join together to form the protein’s primary structure. Therefore, in order to obtain the sequences of the viral proteins, codons within the nucleotide sequence read outs are artificially (i.e., in silico) translated using computers.

A large number of viral protein sequences are deposited by researchers from across the globe in various publicly available repositories, including NCBI Virus1, NIAID Virus Pathogen Database and Analysis Resource (ViPR)2 (Pickett et al., 2012), Los Alamos National Laboratory (LANL)3 and Global Initiative on Sharing All Influenza Data (GISAID)4.

These repositories serve as the source of one of the main types of data—viral genetic sequences of patient-derived viruses—analyzed in this thesis.

### 1.1.3 Human immune system

Human immune system is complex and comprises of various mechanisms which work together to tackle the large number of pathogens, including viruses, that try to invade the human body. These mechanisms are broadly classified into innate and adaptive immune systems. The innate immune system comprises first responders and can provide a non-specific defense against various types of viruses. The adaptive immune system, however, comprises specialized immune cells that can provide a virus-specific defense. It has the capacity to develop such immune cells, equipped to precisely recognize the invading virus, and to commit a subset of these cells to memory (Owen et al., 2013; Murphy and Weaver, 2016). Although the adaptive immune response can take some time to develop in response to a new virus, once it develops and trains the specialized virus-specific immune cells, it can then focus the immune attack on clearing the virus and, in addition to this, to quickly launch defenses whenever that virus strikes again in the future.

Thus, an effective vaccine against any virus aims to train the adaptive immune system to recognize the virus, such that it is well-equipped to tackle that virus – if and when it attacks naturally. Therefore, in the following we briefly review important concepts related to the human adaptive immune system.

#### 1.1.3.1 B and T cells

B and T cells are the two specialized immune cells that are directly involved in the adaptive immune system. Generally, B cells are involved in recognizing the virus when it is outside the host cell while the T cells are involved in recognizing the virus after it has entered a host cell (Owen et al., 2013; Murphy and Weaver, 2016). Both types of cells are further divided into:

• Effector cell subsets – which are short-lived and actively participate in immune defense against the virus, and
• Memory cell subsets – which are long-lived and generate effector cells upon subsequent encounters with the virus.

Effector B cells, also called plasma cells, generate proteins called antibodies which generally recognize structural fragments on the outer shell of virus particles and neutralize the virus by latching onto these fragments (Fig. 1.2A).

Effector T cells recognize linear fragments of viral proteins that are processed and presented by the host cells as small sequences – peptides. Different sub-types of T cells are able to recognize the presented peptides which depend upon the molecule on the cell surface that presents these peptides. These peptide-presenting molecules, known as human leukocyte antigen5 (HLA) molecules, are classified as either class I or class II, and the peptides they present to T cells are referred as HLA class I-restricted or HLA class II-restricted, respectively. One type of T cell, cytotoxic T cell (CTL)6, which plays an effective role in destroying the infected cells, interacts with the HLA class I molecules and is therefore able to recognize HLA class I-restricted peptides. Another type of T cell, helper T cell (HTL)7, which helps cytotoxic T cells as well as B cells, interacts with the HLA class II molecules and is therefore able to recognize HLA class II-restricted peptides (Fig. 1.2B).

A subset of B and T cells that successfully recognize a virus are stored by the immune system as memory B and memory T cells respectively. These memory cells help to mount a rapid and virus-specific immune response against the same virus in the future.

The interplay between different parts of the adaptive immune system is very complex. When B or T cells recognize a virus, they are activated and release various chemical signals to influence and recruit other immune cells. Moreover, B cells can present viral peptides to both cytotoxic T cells as well as helper T cells, and activated helper T cells help activate B cells for secreting antibodies. Nevertheless, looking beyond the complex intricacies of interactions within the immune system, the entity that is central in determining an immune response against a virus are the epitopes — specific fragments of virus which the B or T cells recognize.

#### 1.1.3.2 Epitopes

Epitopes are the fundamental determinants of viral recognition by the adaptive immune system. B and T cells recognize a virus by recognizing its epitopes (B cell epitopes and T cell epitopes, respectively) using specific molecules on their surface.

B cell epitopes are recognized by the B cells using B cell receptors (BCR) (Fig. 1.2A). The antibodies secreted by an activated B cell recognize the same epitope that the corresponding BCR recognizes. Generally, B cell epitopes are structural epitopes that help in recognizing the virus particles present outside the cells. The linear fragment of the virus which the B cell recognizes are referred to as linear or continuous B cell epitopes, while the precise sites (protein residues) on the virus surface that the BCR bind to are referred to as conformational or discontinuous B cell epitopes.

T cell epitopes are recognized by the T cells using T cell receptors (TCR) (Fig. 1.2B). All T cell epitopes are linear fragments of the virus that are intra-cellularly processed and presented on the cell surface by HLA molecules (Fig. 1.2B). Shorter epitopes (8-11 residues) are MHC class I-restricted peptides and bind to T cell receptors (TCR) on cytotoxic T cells and are therefore referred to as CTL epitopes8. Longer epitopes (13-17 residues) are MHC class II-restricted peptides and bind to the TCR on helper T cells and are therefore referred to as HTL epitopes9 (Sidney et al., 2020).

### 1.1.4 Immunological experiments to study adaptive immune responses

Human adaptive immune responses are generally studied using immune cells (white blood cells) that are extracted from blood samples drawn from patients or healthy donors. Specifically, peripheral blood mononuclear cells (PBMCs), which form a subset of white blood cells, are separated from the blood samples to study B and T cells. In samples drawn from healthy donors, PBMCs make up $$\sim$$ 2 $$\times$$ 106 cells per mL of blood and constitute10 $$\sim$$ 5–10% B cells, $$\sim$$ 25–60% CD4+ T cells and $$\sim$$ 5–30% CD8+ T cells (Owen et al., 2013; Murphy and Weaver, 2016). Generally, experiments conducted to characterize the adaptive immune responses following natural infection or vaccination, measure frequency of activated B and T cells, magnitude of their activation, and breadth of their response against viral proteins. In addition, detailed experimental studies are carried out to determine the viral epitopes (Fig. 1.3).

• B cell epitopes

• Discontinuous B cell epitopes are determined after resolving the 3D structure of antibody-bound virus particles. Methods employed to resolve the 3D structure of the complex include electron microscopy (EM), nuclear magnetic resonance (NMR) and X-ray crystallography (Vita et al., 2015). These methods yield the exact sites (residues) on the virus surface that bind to the antibodies.

• Linear B cell epitopes are determined using various methods that identify coarse-level linear regions of the virus that activate B cells or attract antibodies. Most of these methods are variants of the popular Enzyme-Linked Immunosorbent Assay (ELISA) (Vita et al., 2015). Other assays include Western blot, phage display, hemagglutination inhibition and chromatography. These methods generally use peptide libraries to screen for regions that bind to B cells or antibodies and identify such linear fragments of the virus.

• T cell epitopes

Broadly, there are 2 types of assays used to determine T cell epitopes:

1. HLA binding assays11 which quantitatively measure the binding strength of the epitope-HLA complex (Peters et al., 2020). However, these assays do not necessarily report if the epitope bound to the HLA molecule is able to activate a T cell.

2. T cell assays which include assays that detect activation of T cells by measuring the release of specific chemicals (such as interferon-gamma (IFN-$$\gamma$$), and various cytokines and interleukins) (Peters et al., 2020). Most of these methods use peptide libraries to screen for those that activate T cells and use Enzyme-Linked Immune absorbent Spot (ELISpot) assays to quantitatively measure the activation while others detect cell-based activation induced markers (AIM). Methods to ascertain the HLA restrictions of the epitopes include the use of HLA-multimers, mono-allelic cell line assays, and HLA haplotype information of donors that assist with statistical inference.

A large number of experimentally-determined epitopes, including both B cell and T cell epitopes derived from different viruses, are deposited by researchers from across the globe in publicly available repositories, including Immune Epitope Database (IEDB)12 (Vita et al., 2019), LANL13 and NIAID ViPR14 (Pickett et al., 2012).

These repositories serve as the source of one of the main types of data—genetic sequences of experimentally-determined virus epitopes—analyzed in this thesis.

### 1.1.5 Vaccines against viruses

There are about 29 human diseases against which one or more vaccines have been approved for use in humans, and at least 18 of these diseases are caused by viruses15. Most vaccines against viruses are designed as prophylactic vaccines, which aim to prevent viral infection (e.g., polio vaccine). However, designing of therapeutic vaccines, which aim to prevent disease progression and control viral infection at early stages, is also important against some viruses (e.g., against viruses that cause chronic infections such as HIV).

 Poliovirus Hepatitis E virus (HepE) Rabies virus Dengue virus (DENV) Rotavirus Yellow fever virus (YFV) Measeles virus Japanese encephalitis virus (JEV) Mumps virus Tick-borne encephalitis virus (TBEV) Rubella virus Human papillomavirus (HPV) Influenza virus Variola (smallpox) virus Hepatitis A virus (HepA) Varicella zoster (chickenpox) virus Hepatitis B virus (HepB) Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2)

We can broadly classify vaccine design approaches into three generations.

1. First-generation vaccine designs use whole viruses, which are either live or inactivated viruses. The live viruses used in such vaccines are able to replicate but are weakened to limit their infectivity, these vaccines are commonly referred to as live attenuated virus (LAV)-based vaccines. The inactivated virus-based vaccines comprise whole virus particles that are chemically inactivated (killed) to abolish their replication capacity. Most of the licensed human vaccines are first-generation vaccines (Palatnik-de-Sousa et al., 2018). MMR (measeles-mumps-rubella) vaccine is LAV-based, while the Salk polio vaccine is an inactivated virus-based vaccine.
2. Second-generation vaccine designs use sub-units of viruses as targets which comprise entire viral proteins, protein domains, or specific viral peptides (Nevagi et al., 2017). The viral sub-units are delivered through various different ways, e.g., as payload on the backbone of another virus, or encapsulated inside nanoparticles (Malonis et al., 2020). A few second-generation vaccines are approved for use, e.g., against HepB and HPV.
3. Third-generation vaccine designs use nucleic acids comprising parts of the viral genetic code (DNA/RNA) (Pardi et al., 2018). The only such vaccines approved for use in humans are against COVID-19. There are many others being actively developed against a number of viruses, e.g., Zika, Middle-East respiratory syndrome (MERS), and HIV.

All vaccines are designed with the aim to evoke potent adaptive immune responses against the targeted virus (Palatnik-de-Sousa et al., 2018). Thus, it is important that the specific B and T cell epitopes of the virus against which a vaccine elicits immune responses are effective targets (De Groot et al., 2020).

## 1.2 Challenges and opportunities

Determining the B and T cell epitopes of viruses which can serve as effective vaccine targets is challenging. Large numbers of experiments are required to test individual epitopes; e.g., for a 500 residue long protein, the number of possible 8–11 residue long epitopes to be tested would be 1966. Moreover, such experiments require considerable time and resources which can severely limit their capacity and slow-down the overall process of vaccine design.

Computational approaches can, however, help to narrow down the test set and guide these experiments, potentially speeding up the search for effective vaccine targets. High-throughput next-generation sequencing has enabled rapid collection of virus sequences at continuously decreasing costs. Large numbers of viral sequences collected from different regions of the world are now available in various repositories, such as NCBI Virus and GISAID. Furthermore, growing amount of data from immunological experiments cataloging virus-specific immune responses (being deposited in repositories such as IEDB (Vita et al., 2019)) are valuable in determining epitopes that may serve as effective vaccine targets (Fleri et al., 2017).

In addition to the experimental challenges, determining effective vaccine targets is also complicated by the nature of the diseases caused by different viruses:

1. Chronic infections, that are caused by highly mutable viruses enables them to escape from immune pressure owing to their genetic diversity (Sanjuán and Domingo-Calap, 2016). The HIV infection is one such case where coordinated mutations help HIV to tackle the immune responses (Goulder and Watkins, 2004). We attempt to better understand the underlying co-evolutionary networks of HIV Gag protein, by developing and applying a modified sectoring approach that is better equipped to identify constrained evolutionary networks.

This topic is addressed in Chapter 2 of the thesis. The developed computational framework helped to identify multi-dimensionally—across protein residues—constrained regions within HIV Gag protein that comprises set of residues that are immunologically vulnerable.
2. Closely related viruses, or sub-types, that co-circulate in the same human population may pose a unique challenge where immune response developed against one may play a role in worsening the disease caused by the other (Welsh et al., 2010; Agrawal, 2019; Screaton et al., 2015; Rathore and St. John, 2020). The severe form of dengue disease, caused by serial infections from different serotypes of dengue virus, is one such case. We attempt to address this challenge by analyzing the conservation profile of all known T cell epitopes of dengue virus.

This topic is addressed in Chapter 3 of the thesis. The computational analyses helped to identify multi-dimensionally—across virus sub-types—constrained epitopes that are likely to be robust vaccine targets.
3. Emerging infectious diseases which spread rapidly across the human population may be effectively controlled by vaccines, but designing such vaccines is hampered by the lack of knowledge about effective immune targets (Diamond and Pierson, 2020). The COVID-19 pandemic, caused by a new virus (SARS-CoV-2), is one such case which has rapidly spread across the world. We attempt to address this challenge for SARS-CoV-2, by analyzing the epitope data of SARS-CoV for genetic similarity with SARS-CoV-2 and identifying potentially effective immune targets.

This topic is addressed in Chapter 4 of the thesis. The computational analyses and the developed system helped to identify multi-dimensionally—across related viruses—constrained epitopes that are likely to be effective immune targets.

## 1.3 Methodological aspects

An overview of the methodological approaches adopted in this thesis to address the three challenges outlined above is presented next.

### 1.3.1 Data preliminaries

The two main types of data used in this thesis (protein sequences and B/T cell epitopes) are represented as ordered list of characters, where each character represents one of the 20 natural amino acids. For example, the string GCWQ represents four amino acid residues (Glycine, Cysteine, Trytophan and Glutamine respectively). Such data can be considered as either:

1. Strings of characters and analyzed using string-based algorithms or

2. Multi-dimensional categorical variables and (after encoding) analyzed using statistical methods.

In Chapter 2, we encoded the sequences as multi-dimensional binary variables (where the number of residues in a sequence determined the number of dimensions) to study the mutational patterns using correlations.

In Chapters 3 and 4, we applied string-matching methods to map epitopes against the protein sequences to determine the presence of an epitope within a sequence and to compute the conservation of epitopes.

### 1.3.2 Analysis of mutations

In order to analyze the patterns of mutations in viral proteins, we encoded the sequences as binary variables where at each position: ‘1’ represented presence of mutation and ‘0’ represented absence of mutation. Thus, the patterns of mutational correlations were studied using spectral decomposition of the sample correlation matrix.

However, uncovering correlation patterns in viral protein data is challenging specifically when there are not sufficient number of samples, as in the case for HIV proteins. This is because the dimension of data (i.e. number of residues) is comparable to the number of sequences and results in appearance of spurious correlations due to sampling noise. Broadly, concepts from random matrix theory have helped to address the effects of sampling noise for studying correlation patterns (Plerou et al., 2002; Laloux et al., 2000; Bouchaud and Potters, 2009). Notably, a sparse principal component analysis (SPCA)-based sectoring method, has been very useful in understanding coevolutionary networks within different viral proteins (Quadeer et al., 2018). However, this method adopts a strict criterion to determine the number of informative principal components by relying on aggressive thresholding of eigenvalues of the sample correlation matrix.

In Chapter 2, we present a principled modification of the SPCA-based sectoring method that identifies important information in the sub-dominant principal components of a correlation matrix. This modification was guided by detailed study of ground-truth models. Using such models, we demonstrated the capability of the modified method to better identify “sectors” enriched in negative correlations. We applied our modified sectoring method on HIV data, which led to improved identification of a constrained coevolutionary network of sites within HIV’s Gag protein.

Our modified “sectoring” approach provides a systematic way to determine the number of informative principal components that help to recover important correlation patterns that are likely to be missed by the dominant principal components and only represented by sub-dominant principal components. The fundamental observation, expounded in this analysis, that certain patterns of correlations enriched in negative correlations cannot be represented by a single principal component of a correlation matrix (described in Chapter 2) are likely to have important implications for correlation-based sectoring methods applied in different fields such as finance, evolutionary biology and protein engineering (Bouchaud and Potters, 2009; Halabi et al., 2009; Plerou et al., 2002; Pincus et al., 2017).

### 1.3.3 Mapping of epitopes

We adopted an exact string matching approach when mapping the epitopes against the protein sequences. We considered the set of epitopes as a dictionary of words and searched for their occurrences across each protein sequence using brute-force search.

In Chapter 4, we mapped all the previously known SARS-CoV epitopes against the SARS-CoV-2 sequences. We identified the SARS-CoV epitopes that were identically present within SARS-CoV-2 proteins – an important finding given the lack of SARS-CoV-2-specific epitope information during the early part of the pandemic and to guide pan-coronavirus research in the long run. We complemented this with an online platform that regularly updates the analyses incorporating the rapidly increasing SARS-CoV-2 sequence data. Moreover, we also developed a system to map SARS-CoV-2-specific T cell epitope data that helps with understanding of the emerging immunological landscape against COVID-19 across convalescent patients.

In Chapter 3, we mapped all the known DENV epitopes against all available DENV sequences. We determined the comprehensive conservation profiles for all DENV epitopes across the serotypes of DENV and identified distinct conservation patterns.

### 1.3.4 Addressing big data and computational challenges

The analysis of SARS-CoV-2 genetic sequences for regularly updating the web platform, COVIDep (see section 4.5), involves acquiring sequence data from GISAID server, processing the MSA, and mapping of epitopes. However, the increasing amount of data makes this task computationally demanding in particular with the large genome of SARS-CoV-2 (each sequence greater than 3x104 bases). Over the course of one year that COVIDep has been online, the number of sequences has increased by two orders of magnitude — from $$\sim$$ 6.7x104 sequences in May 2020, to more than $$\sim$$ 1.1x106 sequences in May 2021 (each raw data file is greater than 40 GB). This poses a computational challenge as the size of the MSA increases, the mapping of epitopes requires more resources. To address these challenges, the coding was thoroughly revised to align with principles of big data processing and implemented using parallel programming routines utilizing 24 CPUs. Moreover, to facilitate a seamless user experience, which involves interactive visualization, memory utilization was also optimized according to the available computing resources.

## 1.4 Outcome

### 1.4.1 Contributions

The main contributions of this thesis are:

1. Modification of a correlation-based sectoring method that recognizes useful information in the weaker (sub-dominant) principal components of a sample correlation matrix.

2. Improved identification of an evolutionarily constrained set of sites within HIV Gag protein by adopting the modified sectoring method. This identified sector, enriched with negatively correlated sites, is immunologically relevant and helps to identify potentially effective targets for a T cell-based vaccine against HIV.

3. Identification of a distinct pattern of conservation among the DENV T cell epitopes across the serotypes.

4. Recommendation of a set of cross-serotypically conserved DENV epitopes as target for a universal T cell-based vaccine. A large fraction of the global population is estimated to be able to target this set of epitopes.

5. Identification of potential T cell epitopes of SARS-CoV-2, by leveraging SARS-CoV data, to help guide experimental studies in determining epitopes and that may be recommended as universal vaccine targets.

6. Development of a web-based system to regularly monitor and processes new SARS-CoV-2 genetic sequence data, which is being shared rapidly from across the world. This system reports any mutations observed in SARS-CoV-2 sequences, tracks the conservation of epitopes, and helps to update the list of epitopes recommended as vaccine targets.

7. Comprehensive review and performance comparison of the state of the art epitope prediction methods that have been widely employed to assist experimental studies on SARS-CoV-2.

8. Compilation and meta-analysis of SARS-CoV-2-specific T cell epitope data from multiple immunological studies. This is complemented with development of a web platform which helps to describe the emerging landscape of SARS-CoV-2 T cell epitopes.

#### 1.4.1.1 Practical impact

The work on HIV Gag identified a set of networked epitopes which was later independently found to overlap with those based on structural analyses and experiments in mice (Gaiha et al., 2019). These epitopes have been incorporated in the design of an adenovirus-vectored vaccine against HIV which has been shown to elicit strong T cell response in monkeys (Murakowski et al., 2021).

The work on SARS-CoV-2 led to identification of a set of epitopes as potentially effective immune targets. These epitopes were later tested in a number of experimental studies and helped to identify immune targets in COVID-19 (see sections 4.7.4, 4.5.1). This work has influenced design of specific assays to measure T cell responses against SARS-CoV-2 developed by multiple different commercial entities (Mabtech16, Miltenyi Biotec17, and Anaspec18). The results from this work has also directly helped to guide design of vaccine candidates which are already in trials against COVID-19 (Gauttier et al., 2020; Guirakhoo et al., 2020) (namely UB-612 and CoVepiT from Vaxxinity19 and OSE Immunotherapeutics20, respectively).

### 1.4.2 Publications

The following is the list of publications produced from the research work during the PhD:

1. S. F. Ahmed, A. A. Quadeer, D. Morales-Jimenez, and M. R. McKay, “Sub-dominant principal components inform new vaccine targets for HIV Gag,” Bioinformatics, vol. 35, no. 20, Oct. 2019.

2. S. F. Ahmed, A. A. Quadeer, and M. R. McKay, “Preliminary identification of potential vaccine targets for the COVID-19 coronavirus (SARS-CoV-2) based on SARS-CoV immunological studies,” Viruses, vol. 12, no. 3, Feb. 2020.

3. S. F. Ahmed, A. A. Quadeer, and M. R. McKay, “COVIDep: a web-based platform for real-time reporting of vaccine target recommendations for SARS-CoV-2,” Nature Protocols, vol. 15, no. 7, Jul. 2020.

4. S. F. Ahmed, A. A. Quadeer, J. P. Barton, and M. R. McKay, “Cross-serotypically conserved epitope recommendations for a universal T cell-based dengue vaccine,” PLOS Neglected Tropical Diseases, vol. 14, no. 9, Sep. 2020.

5. M.S. Sohail, S. F. Ahmed, A. A. Quadeer, and M. R. McKay, “In silico T cell epitope identification for SARS-CoV-2: progress and perspectives,” Advanced Drug Delivery Reviews, vol. 171, Apr. 2021.

6. A. Quadeer, S. F. Ahmed, and M. R. McKay, “Landscape of epitopes targeted by T cells in 852 convalescent COVID-19 patients: meta-analysis, immunoprevalence and web platform,” Cell Reports Medicine, vol. 2, no. 6, June 2021.

7. A. Quadeer, S. F. Ahmed, and M. R. McKay, “Epitopes targeted by T cells in convalescent COVID-19 patients,” bioRxiv 2020.08.26.267724; doi: 10.1101/2020.08.26.267724 (preprint)