Chapter 2 Refined spectral analysis of HIV-Gag mutational correlation matrix informs design of novel vaccine immunogens

Abstract

Patterns of mutational correlations, learnt from patient-derived sequences of HIV proteins, are informative of biochemically-linked networks of interacting sites that may enable viral escape from the host immune system. Previous computational methods have partly identified such networks by examining the principal components (PCs) of the mutational correlation matrix of HIV Gag proteins. In this chapter, by using sequence data for HIV Gag, complemented by model-based simulations, we revealed that certain networks of interacting sites that appear important for vaccine design purposes are not accurately reflected by the dominant PCs. Rather, these networks are encoded jointly by both dominant and sub-dominant PCs. Thus, by incorporating information from the sub-dominant PCs, we identified a network of interacting sites of HIV Gag that associated very strongly with viral control. Based on this network we propose several new candidates for a potent T-cell-based HIV vaccine.

2.1 Introduction

The Joint United Nations Programme on HIV/AIDS (UNAIDS) estimates that more than 36 million individuals are currently living with the human immunodeficiency virus (HIV) (UNAIDS, 2018). Despite multiple efforts over the last three decades, an effective vaccine against HIV is still not available (Safrit et al., 2016). One of the major challenges for vaccine design is that HIV mutates and replicates at a high rate, with the resulting diversity enabling it to escape host immune responses (Safrit et al., 2016; Allen et al., 2005; Esparza, 2013). The observed escape mechanism in HIV often involves a network of interacting sites: escape mutations at sites targeted by the immune cells may be accompanied by mutations elsewhere in the protein that compensate for the fitness loss incurred by those escape mutations (Goulder and Watkins, 2004; Dahirel et al., 2011; Noviello et al., 2011; Schommers et al., 2016; Barton et al., 2016). These networks of interacting sites are quite complicated and not well-understood. For designing an effective HIV vaccine, accurately determining such networks is important to mitigate the possibility of viral escape.

Several studies have attempted to learn the networks of interacting sites in HIV proteins by studying the mutational correlation patterns observed in patient-derived sequence data (Liu et al., 2008; Dahirel et al., 2011; Quadeer et al., 2018). In (Liu et al., 2008), the study focused specifically on HIV-1 protease to differentiate the networks of interacting sites within drug-naïve patients and those undergoing therapy. In contrast, by performing a spectral analysis of the mutational correlation matrix of multiple individual HIV proteins (constructed from the sequence data of drug-naïve patients) and studying the structure embedded within the respective principal components (PCs), (Dahirel et al., 2011; Quadeer et al., 2018) identified distinct biochemically-linked groups of co-evolving sites, termed “sectors”, which apparently reflect underlying intrinsic networks of interacting sites in HIV proteins. An important feature of these methods (Dahirel et al., 2011; Quadeer et al., 2018), which drew upon earlier study of Halabi et al. (2009), was the use of ideas from random matrix theory (RMT) to distinguish true correlations from those which are seemingly due to the statistical noise caused by limited data.

For HIV Gag, a highly immunogenic polyprotein, one of the inferred sectors (Dahirel et al., 2011) was found to be enriched with sites that lie within the known protective epitopes which, when targeted by cytotoxic T lymphocytes (CTLs), are known to correlate with HIV control (Pereyra et al., 2010, 2014). An important feature of this sector, as compared to other inferred sectors (Dahirel et al., 2011), was the preponderance of site pairs whose mutations were negatively correlated; meaning that the frequency of simultaneous mutations at such site pairs is lower than the frequency that would be expected if the mutations at the individual sites were independent. This suggested that the sites within this sector may be collectively more constrained; i.e., mutating them simultaneously may incur a high fitness cost to the virus (evidenced in (Mann et al., 2014) by studying the effect of such mutations on in-vitro replication capacity of HIV), and thus represent potential candidates for effective immune targeting.

The sectoring inference method of (Dahirel et al., 2011) was improved in (Quadeer et al., 2018) by introducing a robust co-evolutionary analysis method (RoCA) that yielded more accurate biochemically-linked HIV Gag sectors by providing enhanced resilience to statistical noise in the estimation of the PCs. However, RoCA used a conservative approach which considered only the few dominant (strongest) PCs of the correlation matrix for sector inference, notwithstanding that some relevant mutational information could still be present in the sub-dominant (relatively weaker) PCs.

In our study presented in this chapter, by looking beyond the dominant PCs of the correlation matrix, we demonstrate that sub-dominant PCs may carry complementary biochemically and immunologically important information. Our study identifies a basic “sector splitting” phenomenon that can essentially affect all existing sector-based co-evolutionary analysis methods, which typically infer co-evolutionary sectors from individual principal components of mutational correlation matrices of proteins (Dahirel et al., 2011; Quadeer et al., 2014, 2018; Rivoire et al., 2016). Model-based tests show that networks involving negatively correlated sites (precisely those which appear most relevant for vaccine design (Dahirel et al., 2011)), tend to be under-represented by the dominant PCs, and that subdominant PCs play a critical role in accurately inferring such networks. To incorporate this information in the sector inference procedure, we present a principled modification of the approach of (Quadeer et al., 2018) which considers PCs of the correlation matrix beyond the few dominant ones and, where relevant, combines information from multiple PCs to form sectors. This leads us to identify a refined sector for HIV Gag having a much stronger association with protective epitopes than that reported in previous studies (Dahirel et al., 2011; Quadeer et al., 2018). Based on the sites of this refined sector, we identify new Gag candidate immunogens as potential targets for an effective T-cell-based HIV vaccine.

2.2 Acquisition and pre-processing of data

We obtained the patient derived sequence data for the HIV-1 clade B Gag protein from the LANL HIV sequence database21. One sequence per patient was downloaded while excluding any sequence that was marked as problematic. The HIV Gag sequences, including the HXB2 reference sequence were downloaded as an alignment from the Los Alamos National Laboratory database. Using an in-house code, the alignment was reduced to the length of the HXB2 sequence by removing all positions at which there was a gap in the HXB2 sequence.

We downloaded a total of 2069 sequences from the database. To control sequence quality, we used a PCA plot of the associated similarity matrix (Halabi et al., 2009; Quadeer et al., 2018) to identify outlying sequences (Fig. 2.1), potentially resulting from non-uniform sampling or misclassification. Specifically, a group of 172 sequences appeared distinct from the rest in the PCA plot and was thus removed. After removing the outlying sequences, N=1897 sequences remained in the MSA. The accession numbers of sequences from both these sets are deposited in the online repository22.

(ref:CAPgagS6) PCA plot of the similarity matrix associated with the 2069 sequences of HIV Gag downloaded from HIV Los Alamos database. A total of 172 outlying sequences (represented by red circles) were excluded from further analysis.

(ref:CAPgagS6)

Figure 2.1: (ref:CAPgagS6)

To control site (position) quality, the sites in the MSA which were 100% conserved and those which had gaps in more than 12.5% of the sequences were removed, resulting in a total of M=451 sites. Then, the amino acid MSA matrix was converted to a binary MSA matrix as follows:

  • The most-frequent residue (wild-type) at each site was encoded as ‘0’, and

  • Any mutation (a gap or an amino acid residue other than the wild-type at each site) was encoded as ‘1’.

Correlation in the data arising from phylogenetic effects (common ancestry) was partly removed by using linear regression (see Section 4 in the Supporting Information of Dahirel et al. (2011) for details). The resulting data matrix was then used to compute the Gag sample correlation matrix for sector inference.

2.3 Study of ground-truth models

For our model-based ground-truth tests, correlation matrices were constructed from two synthetic data models: (i) a simple model comprising a single network of 3 interacting sites and (ii) a larger more complex model comprising two distinct networks of 14 and 43 interacting sites respectively. We tested multiple ground-truth models to understand the phenomenon of information splitting to multiple PCs. In this section, we describe the procedure we followed to construct these models.

For a single-network model, we constructed different (3 × 3) correlation matrices by systematically varying the magnitude for all the pairwise correlations (representative cases are shown in Fig. 2.2) corresponding to each of the four distinct cases Figs. 2.2(A-D). Different correlation matrices were modeled, shown in Figs. 2.2(A-D), such that: (i) all pairs of sites were positively correlated (top left panels), (ii) two pairs were negatively while one pair was positively correlated (top right panels), (iii) two pairs were positively while one pair was negatively correlated (bottom left panels), and (iv) all pairs of sites were negatively correlated (bottom right panels). In the first two cases (top panels), the 3 interacting sites of the network were fully represented in PC1 (corresponding to the largest eigenvalue); this contrasts with the remaining two cases (bottom panels), where the network was only partially represented in PC1. Thus, the main finding from these ground-truth models, that the dominant PC may not sufficiently represent all sites within a sector involving negative correlations, is quite robust to specific model details.

The specific correlation values that we used were chosen according to the correlations observed in the Gag data. Note that these magnitudes of correlations were varied such that the resulting correlation matrix was valid (positive semi-definite). We considered all the sites within the sector to be equally important and first modeled this by assigning the same magnitude for all the pairwise correlations. We generated models to test for various strong as well as weak correlations (representative cases with specific values of the correlations shown in Figs. 2.2A-B) . Then, we slightly varied the magnitudes of individual correlations (\(\pm 25\%\)) in each of these models and noted that this does not affect the results qualitatively (Figs. 2.2C-D).

(ref:CAPgagS1) Representations of model-based correlation matrices and their PCs capturing a network of 3 interacting sites in proteins. (A) Strong pairwise correlations with same magnitudes. (B) Weak pairwise correlations with same magnitudes. (C) Strong pairwise correlations with different magnitudes. (D) Weak pairwise correlations with different magnitudes.

(ref:sCAPgagS1) Representations of model-based correlation matrices and their PCs capturing a network of 3 interacting sites in proteins.

(ref:CAPgagS1)

Figure 2.2: (ref:CAPgagS1)

For the two-network model, we constructed the correlation matrix by using the pairwise correlations observed in the Gag data. Specifically, the frequencies of mutations occurring at individual sites and at pairs of sites in the first network were modeled according to Gag sector 2 (Fig. 2.19) which comprised 14 sites (listed in Table 7.1) that were mutually positively correlated, while those in the second network were modeled according to Gag sector 3 (Fig. 2.19) which comprised 43 sites (listed in Table 7.1) and included both positively and negatively correlated pairs of sites.

2.4 Methods

In this section, we describe the methodological details of determining the informative principal components (PCs) of the Gag sample correlation matrix followed by the proposed sector inference approach.

Spectral analysis of the correlation matrices was carried out to identify networks of interacting sites by estimating the PCs using Corr-ITSPCA (Correlation-based iterative thresholding sparse PCA), a key component of the RoCA method (Quadeer et al., 2018). However, in contrast to that method, which involved a conservative threshold and yielded six dominant PCs for the correlation matrix of HIV Gag (Quadeer et al., 2018), we progressively relaxed the threshold. In (Quadeer et al., 2018), this threshold was set according to the maximum eigenvalue observed in a large ensemble of randomized MSAs, \(\lambda_{max}^{rnd}\), to distinguish PCs reflecting true correlations from those which were supposedly corrupted by statistical noise.

To investigate if the PCs corresponding to the sub-dominant eigenvalues captured further information about the underlying networks of interacting sites, especially the ones involving negatively correlated pairs which were suggested to be important for vaccine design (Dahirel et al., 2011), we introduced an approach that successively included multiple sub-dominant PCs in the Corr-ITSPCA algorithm until the resulting estimates of the dominant PCs became unstable; i.e., they were largely distorted by statistical noise.

2.4.1 Determining the informative PCs

The \((i,j)^{th}\) element of the Gag sample correlation matrix, \(C_{ij}\), was defined as

\[ C_{ij} = \frac{f_{ij} - f_{i} f_{j}}{\sqrt{(f_i (1 - f_{i}) f_{j} (1-f_{j})}} \tag{2.1} \]

where \(f_i\) is the fraction of the sequences in the MSA having a non-wild-type residue at the \(i^{th}\) position and \(f_{ij}\) is the fraction of the sequences in the MSA having non-wild-type residues simultaneously at positions \(i\) and \(j\). The most frequent residue at any position was considered as the wild-type residue for that position. By definition, \(-1 \leq C_{ij} \leq 1\). The correlation matrix, C, was decomposed as

\[ \mathbf{C} = [\mathbf{u}_1 \; \mathbf{u}_2 \cdots \mathbf{u}_M ] \; \lambda \; [\mathbf{u}_1\; \mathbf{u}_2 \cdots \mathbf{u}_M ]^T \]

where the superscript \(T\) denotes vector transposition, \(\Lambda\) is a diagonal matrix containing the M sample eigenvalues (\(\lambda_1 > \lambda_2 > \cdots > \lambda_M\)) of C on the diagonal, and \(\{u_i\}_{i=1}^M\) are the corresponding sample eigenvectors (sample PCs).

Due to the limited number of available Gag sequences, many of the sample eigenvalues and their corresponding sample PCs are corrupted by statistical noise. Thus, it is important to discriminate informative eigenvalues (representing true correlations) and their corresponding PCs from those which are corrupted by noise. In (Quadeer et al., 2018), this discrimination was done based on a conservative threshold, \(\lambda_{max}^{rnd}\), that was set according to the maximum eigenvalue observed across the spectral decomposition of a large ensemble of randomized MSAs. Six dominant eigenvalues were obtained following this procedure. However, it was not clear if and to what extent the eigenvalues below the threshold used in (Quadeer et al., 2018) \(\{\lambda_i\}_{i=7}^M < \lambda_{max}^{rnd}\) and their corresponding PCs \(\{u_i\}_{i=7}^M\) carry useful information. To understand this systematically, we progressively included PCs corresponding to the sub-dominant eigenvalues and determined if they revealed any additional information about the networks of Gag’s interacting sites.

In the following, we present the iterative procedure followed to determine informative PCs of the Gag sample correlation matrix which are subsequently used for sector inference. As mentioned above, the sample PCs (\(u_i\)’s) are also corrupted by statistical noise. Thus, we employ Corr-ITSPCA (Quadeer et al., 2018) in each iteration of this procedure for obtaining robust estimates of the PCs (\(v_i\)’s) of the sample correlation matrix. We use \(v_i^k\) to denote the estimated PC corresponding to the \(i^{th}\) eigenvalue (\(\lambda_i\)) when \(k\) strongest PCs (i.e., corresponding to the eigenvalues \(\{\lambda_i\}_{i=1}^k\)) are estimated using Corr-ITSPCA. The stopping criterion for the algorithm depends upon the metric \(\mu(k)\) defined below, which measures the averaged similarity between the 6 dominant PCs estimated at the initial step, \(\{v_i^6\}_{i=1}^6\), and the corresponding 6 dominant PCs estimated in each iteration, \(\{v_i^6\}_{i=1}^6 \;\;\; (k = 7, 8, \cdots)\).

The steps involved in the algorithm are as follows:

Inputs: M sample eigenvalues, \(\{\lambda_i\}_{i=1}^M\), and their corresponding sample PCs, \(\{u_i\}_{i=1}^M\).

Output: The set of \(k\) estimated PCs to be used for sector inference, \(\{v_i^k\}_{i=1}^k\).

1: Initialization: \(k=6\);

2: Obtain robust estimates of the 6 dominant PCs, \(\{v_i^6\}_{i=1}^6\), using Corr-ITSPCA;

3: repeat

4: \(\;\;\;\;\) Increment: \(k = k+1\);

5: \(\;\;\;\;\) Obtain robust estimates of the \(k\) strongest PCs, \(\{v_i^k\}_{i=1}^k\), using Corr-ITSPCA;

6: \(\;\;\;\;\) Compute: \(\mu(k) = \frac{1}{6} \sum_{i=1}^6 | \langle v_i^6, v_i^k \rangle |\);

7: while \(\mu(k)>0.9\)

Note that the 6 dominant PCs (corresponding to \(\lambda_1, \lambda_2, \cdots, \lambda_6\)) are re-estimated after adding a sub-dominant eigenvalue (\(\lambda_k\)) and its corresponding PC (\(\mathbf{u}_k\)) in each iteration at step 5. By incorporating sub-dominant eigenvalues in the estimation process, one expects the effect of noise to increase and subsequently corrupt the PC estimates. Thus, the stopping criterion ensures that sub-dominant eigenvalues are only included as long as the estimates of the 6 dominant PCs remain largely intact (see section 2.4.3 for details on the effectiveness of this criterion).

(ref:CAPgagS7) Comparison of the averaged similarity, \(\mu(k)\), between the 6 dominant PCs estimated initially (for k=6) and the corresponding 6 dominant PCs estimated for k=7,8,…,12.

(ref:CAPgagS7)

Figure 2.3: (ref:CAPgagS7)

Note that \(\mu(k)\) drops significantly beyond PC 9 (Fig. 2.3) indicating that PCs beyond PC9 may not be informative. It is also important to note that variations of the original criterion to set the relevant eigenvalue threshold do not give similar results. For example, if we set the threshold as the mean of the maximal eigenvalues obtained from correlation matrices of randomized MSAs, \(\lambda_{mean}^{rnd}\), the obtained number of dominant eigenvalues is very large. Compared to 6 eigenvalues which were greater than the threshold (\(\lambda_{max}^{rnd}\)) used in (Quadeer et al., 2018), there are 34 eigenvalues above \(\lambda_{mean}^{rnd}\), with most of them lying very close to or within the bulk of the eigenvalues (Fig. 2.4). Moreover, due to the distribution of the maximal eigenvalues being heavily skewed, setting the median or 75th percentile of the maximal eigenvalues as the threshold also returned more than 30 dominant eigenvalues. Based on RMT concepts, we expect the weak eigenvalues and their corresponding PCs to be severely corrupted by sampling noise, and thus uninformative for sector inference. This is corroborated by the observation that the noise contribution from weaker PCs (i.e., those beyond the strongest 9 PCs), severely distorts the estimates of the 6 dominant PCs (Fig. 2.3), which were previously shown in (Quadeer et al., 2018) to be informative.

(ref:CAPgagS8) Eigenvalue distribution of the Gag sample correlation matrix. The two dashed lines indicate the threshold based on the mean of the maximal eigenvalues obtained from randomized MSAs (\(\lambda_{mean}^{rnd}\)) and that based on the maximum of these maximal eigenvalues (\(\lambda_{max}^{rnd}\)) adopted by (Quadeer et al., 2018)

(ref:sCAPgagS8) Eigenvalue distribution of the Gag sample correlation matrix.

(ref:CAPgagS8)

Figure 2.4: (ref:CAPgagS8)

Note that we used the Corr-ITSPCA method which yields estimates of the PCs that are robust to sampling noise compared to the simple sample PCs (shown in Fig. 2.5). This is consistent with the extensive tests based on ground-truth models performed in (Quadeer et al., 2018) that demonstrated the robustness (to sampling noise) of the Corr-ITSPCA method (see Supplementary Text S2 and Supplementary Figure S11 in (Quadeer et al., 2018) for further details).

(ref:CAPgagS9) Comparison of the 9 leading sample PCs and their robust estimates obtained using Corr-ITSPCA (Quadeer et al., 2018) for the Gag correlation matrix.

(ref:sCAPgagS9) Comparison of the 9 leading sample PCs and their robust estimates

(ref:CAPgagS9)

Figure 2.5: (ref:CAPgagS9)

Furthermore, to demonstrate that the sub-dominant PCs that we incorporated (PC7, PC8 and PC9) based on our criterion \(\mu(k)\) are informative, we performed additional statistical tests. Specifically, we examined the structure of the sub-dominant PCs that we included in the HIV Gag analysis and compared this with the one expected for PCs arising due to sampling noise only. We performed the comparison based on two metrics: (i) the distribution of elements in the PCs, and (ii) the inverse participation ratio (IPR) for the PCs. IPR, often used in localization theory (Guhr et al., 1998), quantifies the reciprocal of the number of PC elements that contribute significantly and indicates the extent of sparsity within a PC. Specifically, for an (M \(\times\) 1) PC, \(\boldsymbol{v}\), the IPR is computed as, IPR = \(\sum_{i=1}^M{v_i}^4\), where \(v_i\) is the \(i^{th}\) element of \(\boldsymbol{v}\). From random matrix theory, it is known that for a PC reflecting only spurious correlations (arising from sampling noise), the elements conform to a standard Gaussian distribution (Guhr et al., 1998), while the value of its IPR is expected to be close to that obtained for sample PCs of a random correlation matrix as they lack any structure (Plerou et al., 2002; Cocco et al., 2013; Laloux et al., 1999). We computed the random correlation matrix from an (M \(\times\) N) matrix (where M = 451 and N = 1897, similar to the Gag data) of independent and identically distributed Gaussian elements with zero mean and unit variance. From these comparisons (Figs. 2.6 and 2.7), it is evident that, similar to the six dominant PCs (PC1-6), the three sub-dominant PCs (PC7, PC8 and PC9) that were selected based on our criterion (\(\mu(k) > 0.9\)) are significantly distinct from PCs which result due to noise, with this difference most pronounced for PC7 and PC8. In contrast, the much weaker sub-dominant PCs are very similar to those resulting from noise only (Figs. 2.6 and 2.7).

(ref:CAPgagS10) Quantile-Quantile (Q-Q) plots for the elements of the sample PCs of the Gag correlation matrix, in comparison with those expected for sample PCs due to random noise only i.e., PCs of a random correlation matrix. For the three sub-dominant PCs (PC7, PC8, and PC9) of Gag which were identified as informative using our criterion, the elements deviate from those expected for PCs arising due to random noise, similar to those for the six dominant PCs (PC1-6). In contrast, for the much weaker sub-dominant PCs (PC100, PC200 and PC300), the elements are distributed as expected for PCs which are due to random noise. For a fair comparison, the elements of the sample PCs are scaled by a factor of \(\sqrt{M}\). Blue color represents elements of sample PCs corresponding to the Gag correlation matrix and light grey color represents elements of sample PCs corresponding to a random correlation matrix. The theoretical quantile values are those expected from a standard Gaussian distribution.

(ref:sCAPgagS10) Quantile-Quantile (Q-Q) plots for the elements of the sample PCs of the Gag correlation matrix, in comparison with those expected for sample PCs due to random noise only.

(ref:CAPgagS10)

Figure 2.6: (ref:CAPgagS10)

(ref:CAPgagS11) Inverse participation ratio (IPR) values for the sample PCs of Gag correlation matrix compared to those obtained from a random correlation matrix. For the three sub-dominant PCs (PC7, PC8, and PC9), which were identified as informative using our criterion, the IPR values are larger than those expected from the PCs of a random correlation matrix, similar to those for the six dominant PCs (PC1-6). In contrast, for the much weaker sub-dominant PCs (PC100, PC200 and PC300) the IPR values align with those from a random correlation matrix. For a PC of length M=451, the value of its IPR equals \(\frac{1}{M}=0.0022\) when all of its elements contribute identically (Plerou et al., 2002) while being greater when fewer elements contribute significantly.

(ref:sCAPgagS11) Inverse participation ratio (IPR) values for the sample PCs of Gag correlation matrix compared to those obtained from a random correlation matrix.

(ref:CAPgagS11)

Figure 2.7: (ref:CAPgagS11)

2.4.2 Modified sector inference

Following the procedure outlined above, we identified that the first 9 estimated PCs \(\{v_i^9\}_{i=1}^9\) seemingly encoded true correlations present in the Gag correlation matrix, since including additional PCs significantly distorted the estimates of the 6 dominant PCs (Fig. 2.3). Here we present the details of the procedure followed for inferring sectors from the 9 informative PCs obtained above. In contrast to (Quadeer et al., 2018) in which the top 6 PCs that were used to form Gag sectors were largely distinct, we observed multiple overlapping PCs among our selected set of PCs of the Gag correlation matrix (see section 2.6.3). From our model-based studies, a significant amount of overlap among the PCs was found to be suggestive of these being representative of the same underlying network of interacting sites (see 2.6.2 in Results). Thus, we proposed to form sectors from the unique combination of large-magnitude indices of PCs that had a significant amount of overlap. The details of the sector inference procedure are discussed as follows.

First, the set of estimated PCs obtained using Corr-ITSPCA were not strictly sparse and had several non-zero indices with very small magnitude due to residual noise inherent in the procedure (see (Quadeer et al., 2018) for details). However, the large-magnitude indices of the 9 estimated PCs of the Gag correlation matrix were clearly distinguishable from these (very small-magnitude) noisy indices. These sets of indices for each PC were obtained as

\[V_i := \{ m \in \{1, \cdots, M\} : |v_i^9(m)| > \gamma\} \;\;\; i=1,2,...,9,\] where \(v_i^9(m)\) is the \(m^{th}\) index of \(v_i^9\) and \(\gamma\) is a threshold which is selected such that it excludes only the very small-magnitude indices of the estimated PCs. We used a value of \(\gamma = \frac{1}{\sqrt M}\) , similar to (Quadeer et al., 2018).

We then calculated the overlap \(\Omega\) among the estimated PCs of the correlation matrix. Specifically, the overlap between the \(i^{th}\) and \(j^{th}\) estimated PC was computed as

\[\Omega(i,j) = \frac{ \#(V_i \cap V_j)} {min(\#(V_i), \#(V_j)} \;\;\;\;\;\; i,j=1,2, \cdots, 9\] where \(\#(V_i)\) represents the cardinality of the set \(V_i\). For any two estimated PCs with significant overlap (greater than 50%), the union of their large-magnitude indices, \(\{V_i \cup V_j\}\), was used to form a single sector. For each of the remaining estimated PCs (i.e., those without significant overlap with another PC), its large-magnitude indices were used to form a sector.

(ref:CAPgagS2) Percentage of overlap, \(\Omega\), among the 9 PCs estimated using the Corr-ITSPCA algorithm.

(ref:sCAPgagS2) Percentage of overlap among the PCs estimated using the Corr-ITSPCA algorithm.

(ref:CAPgagS2)

Figure 2.8: (ref:CAPgagS2)

By following the above outlined procedure, we found that PC7 significantly overlapped with PC3 (Fig. 2.8). Thus, we formed sector 3 (see Fig. 2.19) from the indices in \(\{V_3 \cup V_7\}\). Analysis of this sector showed that it reflected a biochemically-linked network of interacting sites of Gag (Fig. 2.9) which also appeared immunologically important (Fig. 2.20). Similarly, PC9 and PC6 were also found to overlap, although the amount of overlap was noticeably less than that between PC7 and PC3 (Fig. 2.8). Nonetheless, we formed sector 6 from the indices in \(\{V_6 \cup V_9\}\). Further analysis of this sector, which comprised of weakly correlated sites (Fig. 2.10), suggested that it may be associated with a population-averaged footprint of CTL immune pressure: reflected by the set of sites harboring polymorphism associated with specific HLA alleles in HIV patients (Brumme et al., 2009) (Fig. 2.11). The remaining estimated PCs (PC1, PC2, PC4, PC5, and PC8) were largely distinct (Fig. 2.8) and five sectors were formed from the indices in the corresponding sets \(\{V_i\}\) for \(i \in\{1,2,4,5,8\}\) (see Fig. 2.19).

(ref:CAPgagS12) Association of the inferred sectors with the sites involved in forming the P24 intrahexamer interface critical for stability of HIV capsid. This interface was defined based on a contact distance of less than 9 Å between carbon-alpha atoms on adjacent chains of the interface (PDB ID:3GV2). The red dashed line represents the standard threshold of statistical significance (p-value = 0.05).

(ref:sCAPgagS12) Association of the inferred sectors with the sites involved in forming the P24 intrahexamer interface of HIV capsid.

(ref:CAPgagS12)

Figure 2.9: (ref:CAPgagS12)

(ref:CAPgagS3) Magnitude of statistically significant pairwise correlations among sites within Sector 3 and Sector 6 that were inferred from overlapping PCs of the Gag correlation matrix. These pairwise correlations were identified as being statistically significant based on thresholds for positive and negative pairwise correlations (\(\sigma^{+}\) and \(\sigma^{-}\)). These thresholds were defined such that the pairwise correlations with magnitudes larger than \(\sigma^{+}\) and smaller than \(\sigma^{-}\) appeared with probability less than \(10^{-2}\) in an ensemble of randomized MSAs (Dahirel et al., 2011; Quadeer et al., 2014).

(ref:sCAPgagS3) Magnitude of statistically significant pairwise correlations among sites within Sector 3 and Sector 6.

(ref:CAPgagS3)

Figure 2.10: (ref:CAPgagS3)

(ref:CAPgagS4) Association of the inferred sectors with immune footprints of HIV-infected patients. The red dashed line represents the standard threshold of statistical significance (p-value = 0.05).

(ref:sCAPgagS4) Association of the inferred sectors with immune footprints of HIV-infected patients.

(ref:CAPgagS4)

Figure 2.11: (ref:CAPgagS4)

Lastly, although we had excluded the fully conserved sites in the preprocessing step, these appear useful from a vaccine design perspective; the fact that no mutations are observed at these sites in the MSA is suggestive of their deleterious nature for viral fitness. Given the propensity of the neighboring sites in the primary structure to interact with each other, we incorporated any fully conserved site into the above-formed sectors if it was in the immediate neighborhood of any sector site (see Table 7.2 for the complete list of sites within the inferred sectors). The incorporation of fully conserved neighboring sites within the sectors did not affect the qualitative results (Fig. 2.12). We note that fully conserved neighboring sites were also incorporated in sectors inferred in previous related reports (Dahirel et al., 2011; Quadeer et al., 2018). As discussed therein, given the lack of information regarding their potential interactions with other residues (since no mutations have been observed) and considering the tendency of neighboring residues in the primary structure to interact with each other, incorporation of any fully conserved residue in the immediate neighborhood of a sector residue into that sector seems reasonable. The numbers of fully conserved sites included in each inferred sector are provided in Table 7.3.

(ref:CAPgagS13) Robustness of results to incorporation of neighboring conserved sites into the inferred sectors. (A) Association of the sectors with the set of sites that lie within the known protective epitopes derived from Gag (Pereyra et al., 2010, 2014). (B) Association of the sectors with the sites involved in forming the P24 intrahexamer interface critical for stability of HIV capsid. (C) Association of the sectors with HLA associated polymorphic sites of Gag (Brumme et al., 2009). The p-values measuring the statistical significance of associations were computed using Fisher’s exact test. The red dashed line represents the common threshold of statistical significance (p-value = 0.05).

(ref:sCAPgagS13) Robustness of results to incorporation of neighboring conserved sites into the inferred sectors.

(ref:CAPgagS13)

Figure 2.12: (ref:CAPgagS13)

2.4.3 Effectiveness of the adopted criterion in the inference of sectors

We conducted an additional ground-truth study to demonstrate the effectiveness of our criterion \(\mu(k)\) in identifying sub-dominant PCs that are informative and lead to accurate inference of sectors in the presence of sampling noise.

The simulation in this study involved generating correlated binary samples from a two-network model, with network 1 comprising 14 sites and network 2 comprising 43 sites similar to the model illustrated in Fig. 2.18. Motivated by the network structure observed in HIV Gag, we constructed networks that give rise to multi-rank correlation structures, i.e., those that require multiple PCs to explain the complete correlation structure. Specifically, the pairwise correlations within the two networks were adjusted to yield a specific rank in the respective blocks of the correlation matrix. These ranks were set as follows: network 1, which involved only positively correlated site-pairs, was modeled as a rank-one block (corresponding to a single PC) while network 2, which involved both positively and negatively correlated site-pairs, was modeled as a rank-two block (corresponding to 2 PCs). We studied this model under sampling conditions similar to those observed in the Gag data (i.e., with similar ratio of the number of sequences N to the protein length M). Specifically, we used the Emrich-Piedmonte method (Emrich and Piedmonte, 1991) to generate synthetic N=460 binary sequences of length M=120 with pairwise correlations as specified by the model. We then computed the associated sample correlation matrix and applied our proposed sector inference approach as follows.

First, we determined the number of informative PCs of the correlation matrix based on our metric \(\mu(k)\) by following the steps outlined in section 2.4.1. Here, since the correlation matrix involved 2 dominant PCs by construction (corresponding to the two networks), we initialized with \(k=2\) and computed the metric \(\mu(k)\) as

\[ \mu(k) = \frac{1}{2} \sum_{i=1}^2 |\langle \boldsymbol{v}_i^2, \boldsymbol{v}_i^k \rangle | \;\;\; k = 3,4 \]

Due to its multi-rank construction, we focused only on network 2, since network 1, being of rank one, was represented by a single dominant PC (PC1). Therefore, for each value of k, we examined the overlap between the dominant PC2 (representing network 2) and the incorporated sub-dominant PCs to infer sectors according to the procedure described in section 2.4.2. For assessing the performance of our approach in accurately identifying a multi-rank network of interacting sites, we employed the standard accuracy metric, \[ \textrm{Accuracy} = (\textrm{TP} + \textrm{TN})/M \] where \(\textrm{TP}\) (true positive) represents the number of network 2 sites that were included in the corresponding inferred sector and \(\textrm{TN}\) (true negative) represents the number of sites not belonging to network 2 that were not included in the corresponding inferred sector. We computed sample correlation matrices for 500 different realizations of N binary sequences and, using each matrix, we computed: (i) the metric \(\mu(k)\), and (ii) the accuracy in detecting network 2 for \(k = 2,3,4\).

(ref:CAPgagS14) Effect of including sub-dominant PCs on the similarity metric and on the accuracy of sector inference for (multi-rank) network 2, including both positively and negatively correlated pairs of sites. Boxplots for the metrics, \(\mu(k)\) and accuracy, computed from 500 sample correlation matrices under the specified ground-truth model: (A) Averaged similarity between the 2 dominant PCs estimated initially (k=2) and after incorporating one (k=3) or two (k=4) sub-dominant PCs. (B) Accuracy in the inference of network 2 from k estimated PCs. In each box plot, the bold horizontal line indicates the median, the edges of the box represent the first and third quartiles, and whiskers extend to span a 1.5 inter-quartile range from the edges.

(ref:sCAPgagS14) Effect of including sub-dominant PCs on the similarity metric and on the accuracy of sector inference for (multi-rank) network, including both positively and negatively correlated pairs of sites.

(ref:CAPgagS14)

Figure 2.13: (ref:CAPgagS14)

As expected, we observed that the metric \(\mu(k)\) remained close to one (greater than 0.9) for \(k=3\) (Fig. 2.13A), suggesting that the sub-dominant PC3 is informative of the higher rank correlations of network 2. For k=4 the median value of \(\mu(k)\) dropped below 0.9, indicating that including the next sub-dominant PC (PC4) does not add much information, but significantly distorts the estimation of the dominant PCs. Importantly, including the sub-dominant PC3 in the sector inference (\(k=3\)) led to a more accurate identification of network 2 as compared to inferring the sector using only the dominant PC (\(k=2\)) (Fig. 2.13B). Thus, the additional higher-rank information of network 2, captured by PC3, helps in accurately inferring the corresponding interacting sites. In contrast, including the sub-dominant PC4 (\(k=4\)) adversely affected the inference of network 2 (Fig. 2.13B), as also suggested by the drop in the \(\mu(k)\) value (Fig. 2.13A). Similar results were observed when network 2 was modeled with only positively correlated site-pairs (Fig. 2.14); however, the improvement in inferring this network by incorporating the informative sub-dominant PC3 was less pronounced. Taken together, these tests demonstrate that our criterion \(\mu(k)>0.9\) is effective in identifying informative sub-dominant PCs for networks that give rise to multi-rank correlation structures and in enabling more accurate inference of the interacting sites involved in such networks.

(ref:CAPgagS15) Effect of including sub-dominant PCs on the similarity metric and on the accuracy of sector inference for (multi-rank) network 2, including only positively correlated pairs of sites. Boxplots for the metrics, \(\mu(k)\) and accuracy, computed from 500 sample correlation matrices under the specified ground-truth model: (A) Averaged similarity between the 2 dominant PCs estimated initially (k=2) and after incorporating one (k=3) or two sub-dominant PCs (k=4). (B) Accuracy obtained in inferring network 2 from k estimated PCs. In each box plot, the bold horizontal line indicates the median, the edges of the box represent the first and third quartiles, and whiskers extend to span a 1.5 inter-quartile range from the edges.

(ref:sCAPgagS15) Effect of including sub-dominant PCs on the similarity metric and on the accuracy of sector inference for (multi-rank) network, including only positively correlated pairs of sites.

(ref:CAPgagS15)

Figure 2.14: (ref:CAPgagS15)

The conclusion drawn from these ground-truth tests is consistent with our observation for Gag sector 3 (formed using the dominant PC3 and the sub-dominant PC7), for which the statistical significance of both its biochemical and immunological associations increased by several orders of magnitude as compared to the case when it was formed using only the dominant PC3 (Fig. 2.15A-B). This implies that the sub-dominant PC7, identified using our criterion, was informative and helped to more accurately detect the underlying network of interacting sites. Similar improvement in immunological significance was observed for sector 6 (formed using the dominant PC6 and the sub-dominant PC9) as compared to the case when it was formed using only the dominant PC6 (Fig. 2.15C).

(ref:CAPgagS5) Comparison of the statistical significance of respective biochemical and immunological associations of Sector 3 and Sector 6, inferred using a single PC versus using multiple PCs. (A) Association of the inferred sectors with the set of sites that lie within the known protective epitopes derived from Gag (Pereyra et al., 2010, 2014). (B) Association of the inferred sectors with the sites involved in forming the P24 intrahexamer interface critical for stability of HIV capsid (C) Association of the inferred sectors with immune footprints of HIV-infected patients (Brumme et al., 2009). The p-values measuring the statistical significance of associations were computed using Fisher’s exact test. The red dashed line represents the standard threshold of statistical significance (p-value = 0.05).

(ref:sCAPgagS5) Comparison of the statistical significance of respective biochemical and immunological associations of Sector 3 and Sector 6, inferred using a single PC versus using multiple PCs.

(ref:CAPgagS5)

Figure 2.15: (ref:CAPgagS5)

2.4.4 Mutual independence of the estimated PCs and of the inferred sectors

To examine the mutual independence of the PCs that form the same network (PC 3 and PC7) and to further contrast it against the mutual independence of other PC pairs, we inferred a separate sector from each of the 9 estimated PCs (using the sets of indices corresponding to the large-magnitude elements; for details see section 2.4.2) and computed values of their pairwise normalized entropy deviation (NED), a measure introduced in (Quadeer et al., 2018) to quantify the independence between pairs of sectors. NED is a non-negative number which is zero only when the two sectors are independent and becomes larger as the dependency between sectors increases. For all pairs of resulting sectors, except for the pair of PC3 and PC7, inter-sector NED values were close to those obtained from a randomized MSA (null case) while being much smaller than the maximum intra-sector NED value of any sector within the considered pair (Fig. 2.16A). For PC3 and PC7, the inter-sector NED value was large and close to the corresponding intra-sector NED value, implying that sectors inferred separately from PC3 and PC7 are mutually dependent; thereby indicating the that the sites identified by PC3 and PC7 should form a single network of interactions (sector), rather than 2 separate ones. Thus, it would be reasonable to infer a sector of co-evolving sites jointly from these two PCs. This agrees with our conclusion, initially motivated by the observation of a highly substantial overlap (78%) between PC3 and PC7 (Fig. 2.8).

(ref:CAPgagS16) Measure of statistical independence of sectors using the NED metric. (A) Sectors are formed separately from each PC (see section 2.4.2). (B) Sectors are formed jointly from PC3 and PC7 (sector 3), PC6 and PC9 (sector 6), and separately otherwise.

(ref:CAPgagS16)

Figure 2.16: (ref:CAPgagS16)

In addition, among all remaining pairs of sectors, the inter-sector NED value corresponding to the pair PC6-PC9 was also noticeably closer to their maximum intra-sector NED value (Fig. 2.16A). This suggested that sectors inferred separately from PC6 and PC9 may not be as mutually independent as sectors inferred from other PCs. This is also in line with our results where a sector was formed jointly from PC6 and PC9, based on the observed overlap between these PCs (35%), which was substantially noticeable as compared with others (Fig. 2.8).

For completeness, we also examined the mutual independence of the 7 sectors inferred in this study, where we had jointly inferred sector 3 from PC3 and PC7 and sector 6 from PC6 and PC9, by computing the NED values for all the resulting pairs of sectors (Fig. 2.16B). The inter-sector NED for all these pairs of sectors is small and generally close to that obtained from a randomized MSA, while being much smaller than the maximum intra-sector NED of any sector within a pair. This implies that all the inferred sectors are mutually nearly independent.

2.4.5 Cleaning of the Gag sample correlation matrix for visualization

In Fig. 2.19, the Gag sample correlation matrix is depicted as a heatmap. Since the sample correlation matrix is corrupted by statistical noise due to the limited availability of Gag sequence data. Thus, we attempted to “clean” this noisy matrix to more clearly visualize the strong correlations within the inferred sectors. This cleaning of the correlation matrix was performed by retaining the \(k=9\) strongest eigenvalues and setting the remaining eigenvalues (which apparently do not reflect “true” correlations) to a constant (Bouchaud and Potters, 2009). This step is motivated theoretically from RMT-based results for models where the covariance spectrum is reminiscent of spiked models (Donoho et al., 2018).

Specifically, the cleaned Gag correlation matrix was obtained as \[ \mathbf{C}_{clean} = \sum_{i=1}^9 \lambda_i \left[\mathbf{u}_i^9\right] \left[\mathbf{u}_i^9\right]^T + \zeta \sum_{i=10}^M \left[\mathbf{u}_i^9\right] \left[\mathbf{u}_i^9\right]^T \], where \(\zeta\) is set to a constant value such that the trace of \(\mathbf{C}_{clean}\) is equal to the trace of C. Note that \(\mathbf{C}_{clean}\) is not a standard correlation matrix as its diagonal elements (\(\mathbf{C}_{clean}(i,i)\)) are not equal to 1. Using \(\mathbf{C}_{clean}\) to visualize correlations may potentially lead to confusion as one would expect a constant ‘1’ across the diagonal of a correlation matrix. Thus, to avoid this, we preferred to standardize \(\mathbf{C}_{clean}\) before visualization (Fig. 2.19) and we computed the standardized cleaned correlation matrix as \[ \mathbf{\tilde{C}}_{clean} = \mathbf{D}^{-\frac{1}{2}} \mathbf{C}_{clean} \mathbf{D}^{-\frac{1}{2}} \] where \(\mathbf{D}\) is a diagonal matrix with \(\mathbf{D}(i,i) = \mathbf{C}_{clean}(i,i)\).

2.4.6 Summary

In a previous study (Quadeer et al., 2018), sectors were formed from the 6 dominant PCs by selecting the sites that corresponded to large-magnitude indices in these PCs; these indices were largely distinct across the PCs and, thus, each PC yielded a distinct sector. However, the sets of large-magnitude entries may show significant overlap, particularly when sub-dominant PCs are considered.

In this study, guided by insights from our model-based ground-truth tests, which suggested that multiple overlapping PCs (i.e., PCs having common large-magnitude indices) may in fact represent the same network of interacting sites (see Results), we proposed a modified sector inference approach. Specifically, in addition to forming sectors from distinct PCs based on their large-magnitude indices, we proposed, where appropriate, to form sectors based on the large-magnitude indices of multiple PCs having significant overlap (see section 2.4.2 for details). The improved accuracy of the proposed method in the identification of complex interaction networks was confirmed using ground-truth models (section 2.4.3).

2.5 Design of immunogen

In section 2.6.4, based on our findings of a refined HIV-control-associated sector, we propose candidate immunogens for an effective T-cell-based HIV vaccine for a large fraction of the European Caucasian population in the United States. We describe in this section the detailed procedure of selecting the candidate immunogens. Specifically, for this design procedure, we compiled information from two additional sources.

First, information about the HLA haplotypes present in the selected population was obtained from the high-resolution haplotype frequencies reported in the United States donor registry (Gragert et al. (2013)). We focused only on the three HLA loci: HLA-A, HLA-B and HLA-C, as alleles at these loci are involved in the processing and presentation of epitopes targeted by CTLs. For ease of computation, we limited our analysis to the 25 most-frequent HLA haplotypes that represent approximately 48% of the selected population. Second, in order to construct immunogens composed of Gag-specific CTL epitopes that are presented by the selected population, we downloaded a complete list of “best-defined” HIV CTL epitopes from the LANL database23. From this list, we selected the HIV clade B Gag-specific CTL epitopes that are restricted by the HLA alleles found in the selected 25 most-frequent haplotypes of the European Caucasian population (listed in Table 2.1). This procedure yielded 45 distinct epitopes.

To determine potentially effective combinations of these 45 epitopes to include in a vaccine immunogen, we were guided by insights from our analysis of the Gag correlation matrix. Specifically, our inferred sector 3 was found to be strongly associated with HIV control as it was enriched with sites that lie within the known protective epitopes of HIV Gag (Pereyra et al., 2010, 2014) (see Fig. 2.20). Moreover, in line with a previous related study (Dahirel et al., 2011), this sector comprised a large proportion of sites with negatively correlated mutations (see Fig. 2.19), suggesting that simultaneously targeting multiple sites within this sector may limit the number of mutational escape pathways available to the virus. Therefore, our proposed immunogen design objective involved maximizing the proportion of negatively correlated sites belonging specifically to sector 3 across different epitopes forming the immunogen. Moreover, the design objective also involved minimizing the proportion of positively correlated sites across the immunogen epitopes, since the virus appears to mutate such sites with little or no loss in fitness (Dahirel et al., 2011) thereby providing seemingly easy immune escape pathways. Note that in an attempt to restrict escape pathways as much as possible (i.e., as any pair of positively correlated protein sites can potentially lead to immune escape), we did not limit ourselves to only sector 3 sites when minimizing the proportion of positively correlated sites within candidate epitopes. Furthermore, in line with conventional vaccine design approaches, e.g. (Wedemeyer et al., 2009; Heiny et al., 2007; Rolland et al., 2007; Kothe et al., 2006), our design objective also sought to maximize the proportion of fully conserved sites within the immunogen epitopes, since no observed mutation at these sites is suggestive of their functional/structural importance (Rolland et al., 2007).

Our immunogen design objective can be represented mathematically as follows:
\[ \textrm{L} = \textrm{PCP} - \textrm{PPCP} + \textrm{PNCP} \] where for a particular immunogen, \(\textrm{PCP}\) represents the percentage of pairs of sites for which at least one site is 100% conserved, \(\textrm{PPCP}\) represents the percentage of inter-epitope positive pairwise correlations, and \(\textrm{PNCP}\) represents the percentage of inter-epitope negative pairwise correlations among sites belonging to sector 3.

Candidate immunogens (comprising 5 epitopes each) were determined and ranked as follows:

  1. All possible 5 epitope-based candidate immunogens (total of \(C_5^{45} = 1,221,759\) combinations) were screened and ranked according to the \(L\) score.

  2. The top 1% candidate immunogens were chosen and ranked according to their double coverage (DCov) – the fraction of the target population that comprise HLA alleles which elicit immune response against at least two epitopes present within the candidate immunogen (see Table 2.3 for the top 20 candidate immunogens). In the case when multiple candidate immunogens had the same DCov, these were ranked according to the corresponding \(L\) score.

Note that, in general, the number of epitopes to be considered for a T-cell-based vaccine is a design parameter which is governed by several factors including challenges in manufacturing specific peptides, properties of vaccine adjuvant etc. (Testa et al., 2012). Our proposed framework can be applied for designing an immunogen with any number of epitopes. In section 2.6.4, we specifically consider groups of 5 epitopes focusing on the “top 5” candidate immunogens to facilitate a direct comparison with the corresponding candidate immunogens previously proposed in (Dahirel et al., 2011).

2.6 Results

2.6.1 Sub-dominant PCs may provide information about networks of biologically interacting sites of HIV Gag

In (Quadeer et al., 2018), six sectors reflecting networks of interacting sites were identified for HIV Gag, with each sector uniquely associated to a distinct structural or functional domain. For example, one of the inferred sectors was found to strongly associate with the membrane-binding domain of p17, while sites forming another sector strongly associated with the intrahexamer and intrapentamer interface of p24. Due to the conservative threshold adopted for identifying the number of informative PCs of the Gag sample correlation matrix (Fig. 2.17A), it was not clear if useful information was being missed by excluding the sub-dominant PCs that corresponded to eigenvalues close to the specified threshold. Therefore, we first examined the PC corresponding to the seventh largest eigenvalue (\(\lambda_7\))—situated very close to the threshold in (Quadeer et al., 2018) (Fig. 2.17A)—which revealed a distinct pattern to that reflected by the first six PCs. Specifically, while the sets of large-magnitude indices of the first six PCs were mutually quite distinct, the set of large-magnitude indices of PC7 significantly overlapped (78%) with that of another PC (PC3) (Fig. 2.17B).

(ref:CAPgag1) Spectral analysis of HIV Gag sequence data. (A) Eigenvalue distribution of the Gag sample correlation matrix. The dashed line indicates the threshold \((\lambda^{rnd}_{max})\) adopted by (Quadeer et al., 2018). (B) Illustration of PC3 (top) and PC7 (bottom) corresponding to \(\lambda_3\) and \(\lambda_7\), respectively. The PC indices are numbered according to the HXB2 reference sequence (https://www.hiv.lanl.gov/content/sequence/HIV/REVIEWS/HXB2.html). (C) Crystal structures showing the P24 hexamer (PDB ID: 3GV2) (left panel) and the P24 pentamer (PDB ID:3P05) (right panel) involved in the formation of HIV capsid. The sites on the structure are colored according to the color scale shown in (B).

(ref:CAPgag1)

Figure 2.17: (ref:CAPgag1)

The sector previously inferred from PC3 in (Quadeer et al., 2018), based on its large-magnitude indices, had the important characteristic of comprising a large proportion of negatively correlated pairs of sites. This sector was also shown to be strongly associated with the P24 intrahexamer and intrapentamer interface sites, known to be important for the structural stability of the viral capsid (Pornillos et al., 2011, 2009). Interestingly, the additional set of sites indicated by PC7 (indices having large-magnitude in PC7 and not in PC3) mostly belonged to the core of the P24 hexamer/pentamer structure (Fig. 2.17C), which is also known to be critical for the stability of the viral capsid (Rihn et al., 2013; Manocheewa et al., 2013; Campbell and Hope, 2015; Schommers et al., 2016). The fact that sites within the P24 core and those within the intrahexamer and intrapentamer interfaces are associated to the same function suggests that PC3 and PC7 may together correspond to a single underlying network of functionally-linked interacting sites.

2.6.2 Model-based studies reveal that sub-dominant PCs can capture distinct information for networks involving negatively correlated pairs of sites

To help explain the empirical observations above, we constructed systematic tests to investigate whether and under what circumstances a network of interacting sites may not be inferable from a single PC of the correlation matrix, but rather, would require multiple PCs. Specifically, we first constructed a simple single-network model for a protein comprising three interacting sites. Multiple test cases were considered, with each case involving a different proportion of negatively and positively correlated pairs within the network (Fig. 2.18A). For simplicity, we assumed similar magnitude for all pairwise correlations (for details on the model construction, see section 2.3). We noticed that, while in some cases the most-dominant PC (PC1) assigned a large magnitude to the indices corresponding to the three interacting sites (Fig. 2.18A top panel), this was not always true when some sites in the network were negatively correlated (Fig. 2.18A bottom panel). This is a consequence of an intrinsic limitation of the PC representation; in particular, the number of pairwise negative correlations that can be reflected in a single PC is limited. For example, a PC can represent a negatively correlated pair of sites by having opposite signs on its corresponding indices. However, if all pairs of sites in a network are mutually negatively correlated, it is impossible for a single PC to represent these pairwise interactions (see Fig. 2.18A bottom right panel for example), since that would require all indices of a PC to have distinct signs. As a result, for such cases, some of the interacting sites may be represented by indices with small magnitude in a dominant PC; such indices can then be easily missed when forming sectors (Quadeer et al., 2018). Interestingly, our tests suggested that a sub-dominant PC compensates for the under-representation of sites in the dominant PC, such that a site represented by an index with small magnitude in PC1 was strongly represented—by an index with large magnitude—in PC2 (Fig. 2.18A bottom panel). Thus, the two PCs (PC1 and PC2) together were required to identify all interacting sites of the underlying network. Note that this result is robust to specific values of correlations used to construct the networks (Fig. 2.2).

(ref:CAPgag2) Representations of model-based correlation matrices and their PCs capturing networks of interacting sites in protein models. (A) A single network model with 3 interacting sites and different correlation structures: (i) all pairs of sites are positively correlated (top left panel), (ii) two pairs are negatively while one pair is positively correlated (top right panel), (iii) two pairs are positively while one pair is negatively correlated (bottom left panel), and (iv) all pairs of sites are negatively correlated (bottom right panel). In the first two cases (top panel), the 3 interacting sites are fully represented in PC1 (corresponding to the largest eigenvalue); this contrasts with the last two cases (bottom panel), where the network is only partially represented in PC1. (B) A two-network model comprising pairwise correlations close to those observed in the Gag data. The first network (corresponding to the upper left block) comprises 14 sites involving only positively correlated pairs of sites while the second network (corresponding to the lower right block) comprises 43 sites involving a combination of both positively and negatively correlated pairs of sites. PCs corresponding to the 3 largest eigenvalues of the correlation matrix (namely, PC1, PC2 and PC3) are shown after removing the weak PC indices (see section 2.4.2 for details). On the bottom panel, the magnitudes of the PC indices are represented by the color of the corresponding cells. The same color scale was used to represent all the PCs in (A) and (B) which is shown at the bottom of panel (B).

(ref:sCAPgag2) Representations of model-based correlation matrices and their PCs capturing networks of interacting sites in protein models.

(ref:CAPgag2)

Figure 2.18: (ref:CAPgag2)

These insights were further corroborated for more complex interaction networks. We constructed a ground-truth model for a protein comprising two distinct, larger networks of interacting sites (Fig. 2.18B); one comprising sites with only positive pairwise correlations, the other comprising sites with both positive and negative pairwise correlations (see section 2.3 for details). Of the inferred PCs, PC2 uniquely represented the network comprising all positively correlated pairs of sites, however, two overlapping PCs (PC1 and PC3) jointly represented the network comprising both positive and negative pairwise correlations (Fig. 2.18B bottom panel). This is in line with the aforementioned splitting of information of the interacting sites over multiple PCs for a network comprising negative pairwise correlations. The two PCs (PC1 and PC3) overlapped significantly (67%), suggesting that a large overlap among PCs of the correlation matrix may be a good marker for identifying multiple PCs that reflect the same underlying network.

These ground-truth results reveal an important message: by forming sectors based on only the dominant PCs, as in previous methods(Dahirel et al., 2011; Quadeer et al., 2014, 2018), one may potentially miss some sites of networks that involve negatively correlated interactions—precisely those suggested to be most important for vaccine design (Dahirel et al., 2011).

2.6.3 Sector inference incorporating weaker PCs reveals a HIV Gag sector with enhanced immunological significance

Guided by insights from our ground-truth studies, we modified the procedure of (Quadeer et al., 2018) for selecting the number of informative PCs and applied this modified approach to infer new sectors for HIV Gag. Specifically, after progressively relaxing the eigenvalue threshold to incorporate additional PCs—while keeping the additional statistical noise to a level such that the estimates of the 6 dominant PCs (demonstrated previously (Quadeer et al., 2018) to be informative of Gag biochemical domains) are not significantly distorted—the 9 strongest PCs of the Gag correlation matrix were selected for sector inference. We used a modified sector inference procedure that considers the PCs with large overlaps to infer sectors comprising all the sites indicated by the overlapping PCs (see section 2.4.2 for details). Among the 9 PCs, PC7 overlapped substantially (78%) with PC3, and a significant overlap (35%) was observed between PC6 and PC9, while the remaining five PCs (namely PC1, PC2, PC4, PC5 and PC8) were largely distinct (Fig. 2.19 and Fig. 2.8). PCs in each overlapping pair (PC3-PC7 and PC6-PC9) were also found to be statistically mutually dependent as opposed to all other possible pairs of PCs (see section 2.4.4 for details). Five sectors were uniquely inferred from the distinct five PCs, of which four corresponded one-to-one with sectors previously inferred in (Quadeer et al., 2018) while one sector, inferred from PC8 and comprising 12 sites, was largely distinct (Fig. 2.19). In addition, two new sectors (sectors 3 and 6) were inferred by combining sites indicated by the overlapping PCs: PC7 with PC3 (sector 3) and PC9 with PC6 (sector 6), respectively (Fig. 2.19). Unlike the others, these two sectors comprised negatively correlated pairs of sites (Fig. 2.19), which is consistent with our findings from the ground-truth tests; i.e., the information of interacting sites in a network involving negative pairwise correlations can split across multiple PCs.

(ref:CAPgag3) Representation of the cleaned Gag sample correlation matrix and its informative PCs. PCs corresponding to the 9 largest eigenvalues of the correlation matrix (namely, PC1 through PC9) are shown after removing the weak PC indices (see section 2.4.2 for details). Rows and columns of the matrix as well as PC indices are ordered such that the inferred sectors appear clearly as blocks, e.g. PC7 is placed after PC3 since they are jointly used to infer a single sector. On the bottom panel, the magnitudes of the PC indices are represented by the color of the corresponding cells. Note that negative pairwise correlations appear only within two sectors: sector 3 inferred jointly from PC3 and PC7, and sector 6 inferred jointly from PC6 and PC9. (For details on matrix cleaning, see section 2.4.5).

(ref:sCAPgag3) Representation of the cleaned Gag sample correlation matrix and its informative PCs.

(ref:CAPgag3)

Figure 2.19: (ref:CAPgag3)

The presence of negatively correlated sites within sectors 3 and 6 pointed to the potential immunological importance of these sectors. Further analysis revealed that sector 3 was strongly associated with the sites that lie within the known protective epitopes (Pereyra et al., 2010, 2014), while sector 6 was reminiscent of a quasi-sector reported in (Dahirel et al., 2011), comprising weakly correlated sites (see Fig. 2.10) that appear to represent an averaged footprint of the diverse immune pressure exerted by the HIV-infected population (Brumme et al., 2009) (see Fig. 2.11). Notably, the statistical significance of the association of sector 3 with protective epitopes was nearly two orders of magnitude higher (Fig. 2.20) than that reported for the suggested immunologically vulnerable sectors in previous studies (Dahirel et al., 2011; Quadeer et al., 2018). Specifically, by incorporating PC7, six additional sites (149,150,151,152,153,335) were included in sector 3 which lied within at least one of the known protective epitopes. Moreover, five of these six sites are involved in the structural core of the P24 intrahexamer and intrapentamer interfaces (Fig. 2.17B-C) where mutations have been found to result in defective capsid assembly and impairment of infective capacity of the virus (Jacques et al., 2016; Burdick et al., 2017; López et al., 2011; Manocheewa et al., 2013). Thus, incorporating these additional sites into sector 3 increased the statistical significance of both its biochemical and immunological associations by several orders of magnitude (see Fig. 2.15).

The fact that sector 3 was enriched with negatively correlated pairs of sites suggested that it represented a multi-dimensionally constrained network within Gag, wherein, simultaneous mutations on pairs of sites are less tolerated by HIV due to their seemingly deleterious effect on viral fitness (Mann et al., 2014). Hence, eliciting an immune response against multiple sites within this sector may potentially restrict viral escape as it would force the virus to mutate in order to avoid being recognized by the immune system, with the resulting mutant viruses likely having a severely compromised fitness (Mann et al., 2014; Ferguson et al., 2013). Thus, sites within this sector may serve as important targets for an effective T-cell-based HIV vaccine.

(ref:CAPgag4) Association of the inferred sectors with the set of sites that lie within the known protective epitopes. These epitopes are derived from Gag (Pereyra et al., 2010, 2014). The p-values measuring the statistical significance of associations were computed using Fisher’s exact test. The red dashed line represents the common threshold of statistical significance (p-value = 0.05).

(ref:sCAPgag4) Association of the inferred sectors with the set of sites that lie within the known protective epitopes.

(ref:CAPgag4)

Figure 2.20: (ref:CAPgag4)

2.6.4 T-cell-based HIV vaccine candidates

For designing a T-cell-based vaccine that specifically targets multiple sites in the inferred sector 3, we considered as a case study a target population of European Caucasian descent having one of the 25 most-frequent haplotypes across the three common HLA loci (A, B, and C) involved in the processing and presentation of CTL epitopes (Gragert et al., 2013). The selected haplotypes cumulatively cover nearly 48% of the targeted population. To form candidate immunogens, the list of immunogenic epitopes of Gag that are restricted by the HLA alleles present within the targeted population was obtained from the “best-defined” CTL epitopes list for HIV-1 in the LANL Molecular Immunology database24 (see Table 2.1).

(ref:CAPgagST1) List of epitopes from the LANL “best-defined” Gag epitopes that are targeted by the 25 most-frequent haplotypes of the European-Caucasian population (Gragert et al., 2013).

(ref:sCAPgagST1) List of epitopes from the LANL “best-defined” Gag epitopes.

Table 2.1: (ref:CAPgagST1)
Haplotypes
Epitopes targeted by the corresponding HLA
Rank HLA-A HLA-B HLA-C HLA-A HLA-B HLA-C
1 A*01:01 B*08:01 C*07:01 24-32, 74-82, 260-267, 329-337
2 A*03:01 B*07:02 C*07:02 18-26, 20-28, 20-29 148-156, 180-188, 355-363
3 A*02:01 B*44:02 C*05:01 77-85, 77-86, 433-442 294-304, 306-316
4 A*02:01 B*07:02 C*07:02 77-85, 77-86, 433-442 148-156, 180-188, 355-363
5 A*02:01 B*40:01 C*03:04 77-85, 77-86, 433-442 92-101, 176-184, 481-489 296-304
6 A*29:02 B*44:03 C*16:01 78-86 78-86
7 A*01:01 B*57:01 C*06:02 162-172
8 A*03:01 B*35:01 C*04:01 18-26, 20-28, 20-29 36-44, 124-132, 216-224, 253-262, 254-262
9 A*11:01 B*35:01 C*04:01 84-91, 349-359 36-44, 124-132, 216-224, 253-262, 254-262
10 A*02:01 B*15:01 C*03:04 77-85, 77-86, 433-442 269-277 296-304
11 A*02:01 B*15:01 C*03:03 77-85, 77-86, 433-442 269-277 296-304
12 A*24:02 B*07:02 C*07:02 28-36 148-156, 180-188, 355-363
13 A*02:01 B*08:01 C*07:01 77-85, 77-86, 433-442 24-32, 74-82, 260-267, 329-337
14 A*25:01 B*18:01 C*12:03 145-155, 203-212 293-302
15 A*02:01 B*57:01 C*06:02 77-85, 77-86, 433-442 162-172
16 A*01:01 B*07:02 C*07:02 148-156, 180-188, 355-363
17 A*26:01 B*38:01 C*12:03 167-175
18 A*30:01 B*13:02 C*06:02 34-44 135-143, 226-236, 429-437
19 A*23:01 B*44:03 C*04:01 78-86
20 A*01:01 B*37:01 C*06:02
21 A*03:01 B*14:02 C*08:02 18-26, 20-28, 20-29 298-306, 405-413 180-188
22 A*31:01 B*40:01 C*03:04 92-101, 176-184, 481-489 296-304
23 A*02:01 B*18:01 C*07:01 77-85, 77-86, 433-442 293-302
24 A*02:01 B*13:02 C*06:02 77-85, 77-86, 433-442 135-143, 226-236, 429-437
25 A*24:02 B*15:01 C*03:03 28-36 269-277 296-304
Note:
Sites within epitopes are numbered according to the HXB2 reference sequence.

We identified 45 distinct Gag epitopes and, among these, all possible groups of 5 epitopes were evaluated according to our design objective (section 2.5). In particular, to identify candidate immunogens that would likely establish viral control, we used a suitably adapted version of the procedure described in (Dahirel et al., 2011; Quadeer et al., 2014) for screening groups of 5 epitopes that: (i) maximize the proportion of sites that are fully conserved, (ii) maximize the proportion of negatively correlated pairs of sector 3 sites across epitopes, and (iii) minimize the proportion of positively correlated pairs of sites across epitopes (see section 2.5 for further details). The top five candidate immunogens (out of \(\sim 10^6\) combinations) that maximize this design objective (\(L\) score) as well as the double coverage (DCov)—the fraction of the target population that comprise HLA alleles which elicit immune response against at least two epitopes present within the candidate immunogen—are presented in Table 2.2. Each of the top five candidates differ by at least four out of five epitopes compared with the top five candidate immunogens proposed in (Dahirel et al., 2011). Specifically, our top five candidates are composed of 7 distinct epitopes, of which only one is present among the top five candidate immunogens proposed in (Dahirel et al., 2011). Furthermore, the DCov is greater than that reported for the candidates in (Dahirel et al., 2011) (see Table 2.3 for the list of top 20 candidate immunogens).

(ref:CAPgagT1) List of Gag derived CTL epitope-based candidate immunogens (top 5)

Table 2.2: (ref:CAPgagT1)
Epitope1 Epitope2 Epitope3 Epitope4 Epitope5 DCov L score
148-156 269-277 294-304 355-363 433-442 0.404 0.39
148-156 269-277 306-316 355-363 433-442 0.404 0.38
148-156 180-188 269-277 294-304 433-442 0.404 0.37
180-188 269-277 294-304 355-363 433-442 0.404 0.37
180-188 269-277 306-316 355-363 433-442 0.404 0.34
Note:
All epitopes are numbered according to the HXB2 reference sequence.

Overall, our analysis shows that refined identification of networks of interacting sites in HIV Gag, incorporating sub-dominant PCs, reveals new candidate immunogens with potentially enhanced efficiency. Our study can guide further experimental work designed to test the robustness of immune responses elicited by these candidate immunogens.

(ref:CAPgagST2) List of Gag derived CTL epitope-based candidate immunogens (top 20)

(ref:sCAPgagST2) Top 20 Gag derived CTL epitope-based candidate immunogens

Table 2.3: (ref:CAPgagST2)
Epitope1 Epitope2 Epitope3 Epitope4 Epitope5 DCov score
148-156 269-277 294-304 355-363 433-442 0.404 0.39
148-156 269-277 306-316 355-363 433-442 0.404 0.38
148-156 180-188 269-277 294-304 433-442 0.404 0.37
180-188 269-277 294-304 355-363 433-442 0.404 0.37
180-188 269-277 306-316 355-363 433-442 0.404 0.34
148-156 180-188 269-277 306-316 433-442 0.404 0.33
148-156 176-184 269-277 296-304 355-363 0.374 0.35
148-156 269-277 296-304 355-363 481-489 0.374 0.35
148-156 180-188 269-277 296-304 481-489 0.374 0.33
180-188 269-277 296-304 355-363 481-489 0.374 0.32
148-156 269-277 296-304 355-363 433-442 0.358 0.30
148-156 180-188 269-277 296-304 433-442 0.358 0.42
180-188 269-277 296-304 355-363 433-442 0.358 0.39
148-156 269-277 294-304 306-316 355-363 0.346 0.39
180-188 269-277 294-304 306-316 355-363 0.346 0.37
148-156 180-188 269-277 294-304 306-316 0.346 0.32
148-156 176-184 269-277 355-363 433-442 0.343 0.31
148-156 269-277 355-363 433-442 481-489 0.343 0.34
148-156 180-188 269-277 433-442 481-489 0.343 0.34
180-188 269-277 355-363 433-442 481-489 0.343 0.32
Note:
Sites within epitopes are numbered according to the HXB2 reference sequence.

2.6.5 Correspondence with recent results from literature

In a recent study (Gaiha et al., 2019), authors developed a structure-based network analysis using protein structure data and network theory to quantify the topological importance of each amino acid residue to tertiary and quaternary protein structure. This method was developed to systematically include the structural constraints imposed by the residues and assigns higher network centrality scores to more constrained residues. They found that residues with high network scores had reduced tolerance to mutations and identified a set of highly networked residues within HIV Gag protein which distinguishes CTL epitopes associated with protective, and risk-associated HLA alleles. Remarkably, their method indicates the Gag epitope 147ISPRTLNAW155 as comprising constrained residues and this epitope significantly overlaps the epitope 148SPRTLNAWV156 which is among the epitopes proposed to be included in candidate immunogens based on our modified sectoring approach (Table 2.3).

In another recent work (Murakowski et al., 2021), the authors designed a single long peptide immunogen comprising parts of HIV proteins and was expressed in adenovirus vectors and used to immunize monkeys. The immunized monkeys exhibited strong T cell responses directed toward regions contained in the immunogen that encompass multiple epitopes proposed in our candidate immunogens as potentially effective vaccine targets e.g., 176SEGATPQDL184 and 269LNKIVRMY277.