ResProx - Resolution by Proxy - protein model validation

Figure 1. An outline of the ResProx algorithm. ResProx starts by assessing multiple parameters of protein quality using sub-programs such as VADAR (Willard et al. 2003), MolProbity (Chen et al. 2010), RosettaHoles (Sheffler and Baker 2009) and PROSESS (Berjanskii et al. 2010). The resulting quality scores are used to predict equivalent resolution with a support vector regression model, which was trained on a set of high-quality X-ray structures. Additionally, mean values and standard deviations of the quality parameters for a database of high-resolution structures are used to generate Z-scores, which are consequently converted to equivalent resolution value via a Z-Mean protocol. Finally, a decision making module selects one of the two equivalent resolution values as the final result, based on the difference between the predicted values and raw scores of protein quality.

Figure 2. Correlation between ResProx equivalent resolution and X-ray experimental resolution for the ResProx training and testing sets. A) Final ResProx values for the ResProx training set. B) Final ResProx values for the ResProx testing set. C) Z-Mean equivalent resolution for the ResProx training set. D) Z-Mean equivalent resolution for the ResProx testing set. E) SVR predictions for the ResProx training set. F) SVR predictions for the ResProx testing set. R and Err parameters indicate Pearson correlation coefficient and absolute mean error of resolution prediction, respectively.

Figure 3. Correlation between equivalent resolution and X-ray experimental resolution as calculated by Procheck-NMR, MolProbity, and RosettaHoles2. (A) Procheck-NMR equivalent resolution for the ResProx training set. (B) Procheck-NMR equivalent resolution for the ResProx testing set. (C) RosettaHoles2 S_RESL equivalent resolution for the ResProx training set. (D) RosettaHoles2 S_RESL for the ResProx testing set. (E) MolProbity score for the ResProx training set. (F) MolProbity score for the ResProx testing set. R and Err parameters indicate Pearson correlation coefficient and absolute mean error of resolution prediction, respectively.

Figure 4. Correlation between completeness of experimental information (distance restraints) and equivalent resolution of ubiquitin. (A) ResProx score. (B) Procheck-NMR equivalent resolution. (C) RosettaHoles2 S_RESL. (D) MolProbity score. Different measures of the completeness of the distance restraints was achieved by randomly removing 5 distance restraints from the total restraint set. Distance restraints consisted of NOE-based distance restraints and hydrogen bond distance restraints of the ubiquitin NMR ensemble 1D3Z.

Figure 5. Correlation between equivalent resolution and the ensemble precision of ubiquitin. (A) ResProx score. (B) Procheck-NMR equivalent resolution. (C) RosettaHoles2 S_RESL. (D) MolProbity score. Ensemble precision was assessed by calculating backbone RMSD of ubiquitin NMR ensembles with MolMol (Koradi et al. 1996). Spearman rank-order correlation coefficient is 0.95, 0.69, 0.84, and 0.90 for ResProx, Procheck-NMR, MolProbity, and RosettaHoles2, respectively.

Figure 6. Correlation of equivalent resolution with backbone proton chemical shifts (A) ResProx score. (B) Procheck-NMR equivalent resolution. (C) RosettaHoles2 S_RESL. (D) MolProbity score. The agreement between ubiquitin models and backbone proton chemical shifts was assessed by predicting the chemical shifts from different NMR models with ShiftX2 (Han et al. 2011) and calculating the mean absolute difference between predicted and experimentally measured chemical shifts. Spearman rank-order correlation coefficient is 0.95, 0.73, 0.85, and 0.95 for ResProx, Procheck-NMR, MolProbity, and RosettaHoles2, respectively.

Figure 7. Correlation between equivalent resolution of ubiquitin and the number of distance violations. (A) ResProx score (B) Procheck-NMR equivalent resolution. (C) RosettaHoles2 S_RESL. (D) MolProbity score.

Figure 8. Correlation between the equivalent resolution of ubiquitin and model accuracy. (A) ResProx resolution (B) Procheck-NMR equivalent resolution. (C) RosettaHoles2 S_RESL. (D) MolProbity score. Model accuracy was measured by calculating backbone RMSD of ubiquitin models with respect to the ubiquitin X-ray structure 1UBQ. NMR models of ubiquitin with different distance restraint violations were analyzed (see text for details).

Table 1. Correlation coefficients and mean absolute errors of ResProx, Procheck-NMR, MolProbity, and RosettaHoles2 for obsolete and current PDB entries of NMR structures..

Protein	Version	PDB	ResProx (Å)	Procheck (Å)	MolProbity (Å)	RosettaHoles2 (Å)
AbrB N-terminal domain	Obsolete	1EKT	5.14	3.20	4.73	3.58
AbrB N-terminal domain	Current	1Z0R	2.68	1.95	3.76	2.62
Ets-1	Obsolete	1ETC	6.29	3.00	5.03	3.63
Ets-1	Current	1R36	2.77	1.78	3.53	2.74
CcmE	Obsolete	1LIZ	4.91	2.42	3.24	2.20
CcmE	Current	1SR3	2.22	2.40	2.94	2.14
Domain IV from the YbbR	Obsolete	2KPS	3.22	2.05	2.90	2.66
Domain IV from the YbbR	Current	2L3U	2.86	1.75	2.78	2.62
SH3 of phospholipase C-gamma	Obsolete	1HSP	5.80	2.90	4.40	2.94
SH3 of phospholipase C-gamma	Current	2HSP	4.78	3.13	4.16	2.85
MRF-2 DNA-Binding Domain	Obsolete	1BMY	5.05	3.13	4.62	3.47
MRF-2 DNA-Binding Domain	Current	1IG6	1.80	1.60	1.88	2.34
E. coli thioredoxin	Obsolete	1TRX	2.03	1.50	2.07	2.25
E. coli thioredoxin	Current	1XOB	1.41	1.35	1.27	1.8

Table 2. Improvements in the quality of water refined models - Comparison between ResProx values and DRESS Z-scores.

Protein	PDB	Refined	DRESS Z-score	ResProx (Å)
Intestinal fatty acid-binding protein	1A57	-	-4.46	5.50
Intestinal fatty acid-binding protein	1A57	+	-2.72	2.56
Designed protein G core variant	1FD6	-	-1.4	2.49
Designed protein G core variant	1FD6	+	0.33	1.42
Rho GDP-dissociation inhibitor	1AJW	-	-2.79	2.90
Rho GDP-dissociation inhibitor	1AJW	+	-1.24	2.03
Nudix enzyme hydrolase	1F3Y	-	-2.27	3.19
Nudix enzyme hydrolase	1F3Y	+	-1.31	2.29
MTH1175	1EO1	-	-3.31	3.69
MTH1175	1EO1	+	-1.63	2.42

Table 3. Structure quality parameters used in the calculation of ResProx's equivalent resolution.

Score Name	Correlation Coefficient¹	Logarithm form²	Lower Bound³	Upper Bound⁴	Z-score for Z-Mean⁵	Source	Description⁶
Standard deviation of χ1 pooled	0.78	Yes	0	25	Both	Vadar	Standard deviation of the χ1 angles among all 3 (gauche-, gauche+, and trans) configurations.
Clash score	0.77	Yes	0	250	Positive	MolProbity	Number of non-hydrogen bond atomic overlaps > 0.4 Å per thousand atoms.
Percentage of < 1% side-chain rotamer outliers	0.77	Yes	0	1200	Not used	MolProbity	Percentage of residues with side-chain rotamers that lie outside of 99% of side-chain rotamer distribution in Richardson penultimate rotamer library (Lovell et al. 2000).
Ramachandran outside most favored	0.77	Yes*	0.5	None	Positive	GeNMR	Percentage of residues outside of the most favored regions of the Ramachandran plot.
Ramachandran outliers	0.75	Yes	0	500	Positive	MolProbity	Fraction of residues in the Ramachandran plot that are stereo-chemically not allowed or not observed in high quality structures.
RosettaHoles score	0.71	No	None	None	Negative	Rosetta	A measure of underpacking in the protein core.
Mean trans χ1 angle	0.68	Yes	145	180	Not used	Vadar	Average of χ1 angles in trans configuration.
Deviation of Θ angles	0.68	Yes	10	35	Positive	PROSESS	Standard deviation of angle between the C-O bond vector of the H-bond acceptor and theO-H(N) bond vector.
Rama score	0.65	Yes	0	10	Not used	GeNMR	Fraction of residues in the most favored regions of the Ramachandran plot multiplied by a weighting coefficient.
Radius gyration score	0.53	Yes	0	900	Positive	GeNMR	Scaled difference between the expected radius of gyration and the observed one. The expected radius of gyration is determined using: Rg = 0.395N*0.6 + 7.257.
χ1 score	0.43	No	-1.3	-0.6	Positive	PROSESS	Scaled difference between the standard deviation of the observed χ1 angles and the expected one obtained from high quality protein structures.
Score Name	Correlation Coefficient	Logarithm form	Lower Bound	Upper Bound	Z-score for Z-Mean	Source	Description
Percentage of 95% buried residues	0.42	No	0	2	Both	Vadar	Percentage of residues with fractional accessible areas < 0.05. This score reports the extent of residue burial. Most globular proteins must have a fraction >0.05 to be stable. Divided by the expected value.
Bump score	0.35	Yes	0	1	Positive	GeNMR	The bump score is calculated from the total number of non-bonded atom contacts below 1.3 Å, divided by the total number of non-bonded contacts in the protein.
Mean gau- χ1 angle	0.34	Yes	40	90	Not used	Vadar	This is the average χ1 angle for residues (excluding Proline) having χ1 angles that are closest to -60° (the gauche- conformation). Higher quality structures have χ1 angles very close to the canonical -60°, +60° and 180° values.
Mean H-bond energy	0.34	No	-2.5	-0.5	Both	Vadar	The average hydrogen bond energy is calculated using the H-bond energy function used in DSSP program.
Mean Κ (kappa) angle	0.33	Yes	15	50	Not used	PROSESS	Kappa angle measures the angle between the plane of the C=O peptide bond of the H-bond acceptor and the vector formed by the H-O bond of the H-bond donor. The closer the Kappa angle is to 25°, the better.
Percentage of packing defects	0.33	Yes	0	800	Positive	Vadar	This is the percentage of residues with fractional residue volumes greater than 1.20 or less than 0.80. Packing defects indicate the presence of cavities or compressions that are not natural.
Percentage of bad bond angles	0.29	Yes	0	45	Positive⁷	MolProbity	This parameter is calculated as the number of bond angles (divided by the total number of bond angles in the polypeptde) that exceed, by more than 5 standard deviations, the typical bond angles seen in high resolution, high quality structures.
Score Name	Correlation Coefficient	Logarithm form	Lower Bound	Upper Bound	Z-score for Z-Mean	Source	Description
Mean gau+ χ1 angle	0.28	No	-80	-50	Not used	Vadar	This is the average χ1 angle for residues (excluding Proline) having χ1 angles that are closest to 60° (the gauche+ conformation). Higher quality structures have χ1 angles very close to the canonical -60°, +60° and 180° values.
Percentage ogenerously allowed Ω angles	0.25	Yes	0	30	Positive	Vadar	This corresponds to the percentage of residues having Ω (omega) angles within 15° to 20° of the ideal trans (180°) and cis (0°).
Percentage oburied charges	0.15	Yes	0	45	Not used	Vadar	Percentage of charged residues that have fractional accessible areas below 0.05.
Deviation of Κ (kappa) angles	0.08	Yes	0	14	Not used	PROSESS	This parameter reports standard deviation of the angle between the plane of the C=O peptide bond of the H-bond acceptor and the vector formed by the H-O bond of the H-bond donor.
Percentage of disallowed Ω angles	0.08	Yes	0	20	Not used	Vadar	This corresponds to the percentage of residues having Ω (omega) angles more than 20° from the ideal trans (180° and cis (0°) values. Structures with a high proportion of residues with disallowed omega angles have poor geometry and stereo-chemistry.
Percentage of bad bond lengths	0.01	Yes	0	30	Positive⁶	MolProbity	This is calculated as the number of backbone bond lengths (divided by the total number of backbone distances in the polypeptde) that exceed, by more than 5 standard deviations, the typical bond lengths seen in high resolution, high quality structures.
Percentage of Ω angles < 90°	0.01	Yes	0	12	Not used	Vadar	Percentage of Ω (omega) angles below 90°. This identifies the fraction of residues that have a cis-peptide bond.

¹ - Coefficient of correlation between the score and X-ray resolution for ResProx training set.

² - This column specifies whether scores were used in its logarithm form ("Yes") or not ("No"). Star (^*) indicates the scores, whose

logarithm was taken 16 times.

^3,4 - Lower and upper bounds indicate the minimal and the maximal values, respectively, that scores were allowed to have in ResProx calculations.

⁵ - This column specifies whether a score Z-value was used for Z-Mean calculations and, if so, what score Z-value were considered: only positive, only negative, or both positive and negative (see text for more details).

⁶- More information about scores can be found in corresponding publications and/or on websites of RosettaHoles (Sheffler and Baker 2009), PROSESS (Berjanskii et al. 2010), GeNMR(Berjanskii et al. 2009), and MolProbity (Chen et al. 2010; Davis et al. 2007).

⁷- The percentages of bad bond lengths and bad bond angles are used only when their values exceed 4 standard deviatio

Figure 9. Resolution histogram of ResProx training/testing set. Proteins were grouped in 0.25Å bins. At least, 100 structures per resolution bin were placed in each bin, spanning the range between 1.0 Å and 3.75 Å.

Figure 10. Relationship between X-ray resolution and several ResProx protein quality scores for the ResProx training set. (A) Standard deviation of χ1 pooled from VADAR. (B) Clash Score from MolProbity; (C) Percent of <1% side-chain rotamer outliers from MolProbity.(D) RAMA score from GeNMR. (E) Ramachandran outliers from MolProbity. (F) RosettaHoles score. (G) Deviation of Kappa angles from PROSESS. (H) Percentage of disallowed Ω angles from VADAR.

Figure 11. Curve-fitting of a plot of X-ray resolution vs. average absolute Z score. Only the linear part of the plot, spanning the range of mean absolute Z-scores from 0 to 1.2 was used for curve-fitting. The curve-fitting was done with QtiPlot (Vasilief 2011).

Figure 12. GeNMR-based threshold for detecting poor-quality protein structures. The total GeNMR knowledge-based score, excluding radius of gyration score, is shown with blue diamonds for 50000 protein structures from the PDB. The solid line indicates selected threshold that separates 99.9% of the structures from a few poor-quality outliers.

Figure 13. Equivalent resolution of "intact" and "broken" models of obsolete NMR ensemble of the E. coli heme chaperone CcmE, 1LIZ. (A) "Intact" model 1 of 1LIZ. (B) "Broken" model 3 of 1LIZ. The misplaced Glu105 residue is colored green. Vectors of broken bonds between Glu105 and adjacent residues are shown with red lines. The figure was generated using MolMol (Koradi et al. 1996).

Figure 14. Histogram of ResProx equivalent resolution for NMR models and experimental resolution for X-ray structures. 500 NMR ensembles and 500 X-ray structures were randomly selected from the PDB.

References:

Berjanskii M, Liang Y, Zhou J, Tang P, Stothard P, Zhou Y, Cruz J, MacDonell C, Lin G, Lu P, Wishart DS (2010) PROSESS: a protein structure evaluation suite and server. Nucleic Acids Res 38 (Web Server issue):W633-640

Berjanskii M, Tang P, Liang J, Cruz JA, Zhou J, Zhou Y, Bassett E, MacDonell C, Lu P, Lin G, Wishart DS (2009) GeNMR: a web server for rapid NMR-based protein structure determination. Nucleic Acids Res 37 (Web Server issue):W670-677

Chen VB, Arendall WB, 3rd, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW, Richardson JS, Richardson DC (2010) MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr D Biol Crystallogr 66 (Pt 1):12-21

Davis IW, Leaver-Fay A, Chen VB, Block JN, Kapral GJ, Wang X, Murray LW, Arendall WB, 3rd, Snoeyink J, Richardson JS, Richardson DC (2007) MolProbity: all-atom contacts and structure validation for proteins and nucleic acids. Nucleic Acids Res 35 (Web Server issue):W375-383

Koradi R, Billeter M, Wuthrich K (1996) MOLMOL: a program for display and analysis of macromolecular structures. J Mol Graph 14 (1):51-55, 29-32

Lovell SC, Word JM, Richardson JS, Richardson DC (2000) The penultimate rotamer library. Proteins 40 (3):389-408

Sheffler W, Baker D (2009) RosettaHoles: rapid assessment of protein core packing for structure prediction, refinement, design, and validation. Protein Sci 18 (1):229-239

Vasilief I (2011) QtiPlot - Data Analysis and Scientific Visualisation. http://soft.proindependent.com/qtiplot.html, 0.9.8.4 edn.,

Willard L, Ranjan A, Zhang H, Monzavi H, Boyko RF, Sykes BD, Wishart DS (2003) VADAR: a web server for quantitative evaluation of protein structure quality. Nucleic Acids Res 31 (13):3316-3319