ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

How do eubacterial organisms manage aggregation-prone proteome?

[version 1; peer review: 2 approved]
PUBLISHED 27 Jun 2014
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Machine learning: life sciences collection.

Abstract

Eubacterial genomes vary considerably in their nucleotide composition. The percentage of genetic material constituted by guanosine and cytosine (GC) nucleotides ranges from 20% to 70%.  It has been posited that GC-poor organisms are more dependent on protein folding machinery. Previous studies have ascribed this to the accumulation of mildly deleterious mutations in these organisms due to population bottlenecks. This phenomenon has been supported by protein folding simulations, which showed that proteins encoded by GC-poor organisms are more prone to aggregation than proteins encoded by GC-rich organisms. To test this proposition using a genome-wide approach, we classified different eubacterial proteomes in terms of their aggregation propensity and chaperone-dependence using multiple machine learning models. In contrast to the expected decrease in protein aggregation with an increase in GC richness, we found that the aggregation propensity of proteomes increases with GC content. A similar and even more significant correlation was obtained with the GroEL-dependence of proteomes: GC-poor proteomes have evolved to be less dependent on GroEL than GC-rich proteomes. We thus propose that a decrease in eubacterial GC content may have been selected in organisms facing proteostasis problems.

Introduction

Eubacterial organisms have genomes that vary largely in their nucleotide compositions. In this kingdom, the GC content varies from 20% to 70% of the genome and this large variation has been documented in a number of reports that have aimed to explain it13. The amino acid compositions are also different in eubacterial proteomes due to the variation of GC content4. It has been reported that these difference of amino acid compositions alter the characteristics of proteomes and as a consequence, proteins of GC-poor genomes are more prone to misfolding and aggregation compare to GC-rich genomes5,6. It has been hypothesized that GroEL plays a major role, if not an essential role, in the evolution of GC-poor organisms by buffering deleterious mutations that are fixed due to population bottlenecks79. This has been supported by the observation that many of the small GC-poor endosymbionts tend to overexpress GroEL1012.

However, the proposed chaperone dependence of GC-poor organisms does not explain why some of the GC-poor endosymbionts of the mycoplasma group have lost the groEL copy from their genome13. It is notable that these are the only known eubacterial organisms to have lost this gene. This observation led us to test the proposed relationship of GC poorness of genome with the aggregation propensity of the encoded proteome.

Obtaining information on the aggregation propensity of proteins from different organisms is a challenging task. However, there has already been a careful characterization of the aggregation propensity of different Escherichia coli proteins that was conducted in a high-throughput manner1416. Kerner et al. classified the GroEL substrates into Class I, II or III based on the interaction strength and on the stringency of their requirement for GroEL. Class III (C3) substrates were completely dependent on GroEL for folding, whereas Class II (C2) substrates were partially dependent. Class I (C1) proteins interacted weakly with GroEL and were able to fold spontaneously. In a trivial approach, homologs of GroEL-dependent proteins may be identified in other organisms13,17. This approach however fails to predict the evolution of protein dependence on GroEL correctly, as the sequence differences between species have the potential to introduce or remove kinetic traps from folding pathways, thereby altering their dependence on GroEL. In addition to the solubility of the E. coli proteome in a chaperone-free system, substrates of another chaperone DnaK were also identified by two independent research groups18,19. Applications developed primarily on machine learning algorithms to classify soluble or GroEL substrates16,18,2024 are already available. However, these classifiers have not been trained with curated data prepared from multiple experimental results14,15,18,19. In this study, we have constructed a more reliable training dataset to build classifiers to determine the aggregation propensity and GroEL dependency in 1132 eubacterial proteomes, based solely on the amino acid sequences. We show a distinct trend in the aggregation propensity of proteins of an organism in relation to the GC content. Surprisingly, aggregation propensity decreased with lower GC content independent of symbiotic characteristics, suggesting that GC-poor organisms have indeed evolved a proteome that is devoid of aggregation-prone proteins.

Materials and methods

Data source

The aggregation-prone proteins of the eSOL database18,25 are dependent on the chaperone network of E. coli to get their three dimensional native structure. GroEL and DnaK are two important components of this network and their substrates have been extensively studied via different experimental methods14,15,19,26. The integration of all the available information reveals that about half (457) of the soluble or chaperone-independent proteins identified by Niwa et al. were found to be GroEL- or DnaK-dependent18 (Figure 1). To construct a more reliable training set, we removed these proteins from the soluble set. Thus, proteins identified as chaperone-dependent by more than one study, were only considered as aggregation-prone proteins. Furthermore, the proteins which were more than 30% (amino acid) sequence similarity among the remaining proteins were removed using CD-HIT27 clustering program. Therefore the final training set comprised of 502 aggregation prone and 475 soluble proteins.

a14c495d-5c06-46c0-a875-8870f7e3451d_figure1.gif

Figure 1. Integration of independent studies.

A Venn diagram of proteins of E. coli identified by different experimental studies shows that ~45% of soluble proteins reported by Niwa et al. overlap with GroEL/S or DnaK substrates (soluble proteins are defined as having solubility >70% and aggregation-prone proteins have solubility <30%).

Classifier building

The classifiers in this study were built with Pro-Gyan28 software. Pro-Gyan builds classifiers directly from training data set given in FASTA format by selecting relevant features from a large number of unbiased features. Following metrics which are useful to evaluate performance of machine learning classifiers were reported by Pro-Gyan.

Accuracy(Acc)= (TP + TN)/(TP + TN + FP + FN)

Sensitivity or Recall (Sn) = TP/(TP + FN)

Sencificity (Sp) = TN/(FP + TN)

Matthews correlation coefficient (MCC) = (TP*TN–FP*FN)/{(TP+FP)*(TN+FN)*(TP+FN)*(TN+FP)}

where TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative predicted by the classifier.

Additionally, receiver operating characteristic (ROC) curves and area under this curve (AUC)29 were also generated.

Analysis on microbial genomes

The protein sequences of microbial genomes were downloaded from the Microbial Genome Database30 (archive no. mbgd_2011-01). To identify the chaperonins in the microbial organisms, chaperonin homologs were searched for using BLAST (e-value 1*10-4) against a chaperonin database cpnDB31 downloaded on June 2011. The 16S rRNA nucleotide sequence of E. coli was acquired from SILVA32 and homologous were searched for in other microbial organisms using BLAST (e-value 1*10-4). GC contents for microbial genomes were calculated using following equation

GC content = (G + C)/(total bases),

where G = number of guanosine and C = number of cytosine.

Statistical analysis

The Kendall correlation and analysis of covariance were performed in R33 statistical computing environment using the package ‘stats’ version 2.15.3. To account the effect of evolution on different traits of bacterial genomes, we performed phylogenetic independent contrast through the PDAP34 module on Mesquite35 application.

Results and discussion

Development of machine learning tool to identify aggregation-prone proteins

Recently protein solubility has been carefully measured in a chaperone-free system and the information has been made available through the eSol database18. Few classification models developed on this database can segregate soluble proteins from chaperone-dependent proteins2224. However, these web-based classifiers are not suitable to classify large numbers of proteomes, and their soluble or negative training dataset (proteins not aggregation-prone or soluble) are not carefully curated, as most of the soluble proteins from eSol database are substrates of DnaK19 or GroEL14,15 (Figure 1). Therefore we built a classifier containing a curated list of aggregation-prone proteins and soluble proteins. The classifier was built using Pro-Gyan28 which generates 5038 different features from a set of class labelled protein sequences and selects the “maximum relevant minimum redundant” feature subset. Finally, the tool built a support vector machine (SVM)36 classifier by five-fold cross validation. The classifier attained an accuracy of 83.21% with 0.66 MCC (Table 1). Although Pro-Gyan generated classifier was trained with a rigorously curated training data set, it performs equivalent to Fang et al.’s classifier and better than others2224. The receiver operating characteristic (ROC) curves of the classifier are shown in Figure 2. For interested users, the classifier is available in ZENODO (https://zenodo.org/record/10442).

Table 1. Comparison of previous classifiers with our classifier.

MethodSensitivitySpecificityAccuracyAUCMCC
SVM2580
J48 (decision tree algorithm)23720.72
VTJ48 (visually tuned J48)23760.81
Fang et al.2282.0085.00840.910.67
SolubEcoli.pgc* 86.25 80.00 83.21 0.88 0.66

* Built on a curated training data set.

a14c495d-5c06-46c0-a875-8870f7e3451d_figure2.gif

Figure 2. Receiver operating characteristic (ROC) curves.

ROC curves of the soluble protein classifier (SolubEcoli.pgc) and the GroEL obligate protein classifier (GDP1.pgc). The areas under the curves (AUC) are given in the legend.

Discriminating features of aggregation prone proteins

To build the classifier, Pro-Gyan28 selected 24 relevant features through an automated process. The top ten significant (by Mann-Whitney test) features were the sequence patterns, the pseudo amino acid composition37 of phenylalanine (F), aspartic (D) and glutamic (E) acid, the distribution of positively charged amino acids, the features calculated from FoldIndex38 and the auto-correlation of hydrophobicity and relative mutability (Table 2). The remaining selected features (Table 2) were enriched with auto-correlation measurement of amino acid indices such as steric parameter, free energy, accessible surface area, polarizability, residue volume etc. The features which represent patterns of physico-chemical properties encrypted in protein sequences were unique to SolubEcoli.pgc when compared to earlier methods.

Table 2. Selected features of proteins used to build the “SolubEcoli.pgc” classifier.

Serial
no.
Feature id†Descriptionp-value*
1SW_SOC2Quasi-sequence-order calculated from
physicochemical distance matrix50.
2.20E-16
2PPRDistribution of positively charged amino
acids in sequence pattern51.
2.20E-16
3H(8)MAmino acid pair composition of histidine to
methionine with 8 gaps52.
2.33E-15
4M-B(Hydr)1Moreau-Broto auto correlation (lag 1) of
amino acid index; hydrophobicity53.
2.24E-08
5PseAAC_T1_3Pseudo amino acid composition of aspartic
acid (D)37.
9.45E-06
6PseAAC_T1_5Pseudo amino acid composition of
phenylalanine acid (F)37.
6.87E-05
7FI_16_psavglAverage length of folded segments of
proteins according to FoldIndex38.
8.14E-05
8PseAAC_T1_4Pseudo amino acid composition of glutamic
acid (E)37.
0.000542
9Dstrbu_Pol_2:3Distribution of amino acids according to
polarizability54.
0.001289
10M-B(mutblty)6Moreau-Broto auto correlation (lag 6) of
amino acid index; relative mutability53.
3.65E-03
11TComposition of amino acid Threonine53.5.00E-03
12Mrn(vlum)27Moran auto correlation (lag 27) of amino
acid index; residue volume53.
0.00926
13Mrn(Polar)22Moran auto correlation (lag 22) of amino
acid index; polarizability53.
0.013
14M-B(mutblty)9Moreau-Broto auto correlation (lag 9) of
amino acid index; relative mutability53.
0.01988
15Geary(sterc)4Geary auto correlation (lag 4) of amino acid
index; steric parameter53.
0.03536
16M-B(mutblty)24Moreau-Broto auto correlation (lag 24) of
amino acid index; relative mutability53.
0.05416
17M-B(Hydr)12Moreau-Broto auto correlation (lag 12) of
amino acid index; hydrophobicity53.
5.92E-02
18Mrn(RsdAcc)24Moran auto correlation (lag 24) of amino
acid index; residue accessible surface area
in tripeptide53.
0.1077
19Mrn(Hydr)23Moran auto correlation (lag 23) of amino
acid index; hydrophobicity53.
0.2106
20Geary(Free)13Geary auto correlation (lag 13) of amino
acid index; free energy53.
0.3271
21Comp_Vol_2Composition of normalized van der Waals
volume of amino acids of range 2.95–4.053.
0.4631
22Geary(vlum)20Geary auto correlation (lag 20) of amino
acid index; residue volume53.
4.95E-01
23Geary(Free)14Geary auto correlation (lag 14) of amino
acid index; free energy53.
0.499
24M-B(vlum)30Moreau-Broto auto correlation (lag 30) of
amino acid index; residue volume53.
0.9559

†Internal feature id of the Pro-Gyan application.

Genome wide prediction of aggregation prone proteins

From the analysis of features, it was noticed that the compositions of amino acids are significantly different within aggregation prone and soluble proteins. Sequence features of amino acids have been used to understand protein overexpression related to toxicity39. Additionally, it has been also shown that the amino acid composition is drastically altered in organisms with GC-poor genomes4,40. There are multiple amino acids that change in frequency as a function of GC content (Figure 3) and this change that has been attributed to the difference in the GC content in the codons of these amino acids. On the basis of these differences, it has been reported that proteins encoded by GC-poor organisms should be more prone to aggregation than proteins encoded by GC-rich organisms5,6. However, the GC composition of the training data showed that aggregation-prone proteins were significantly more GC-rich than the soluble proteins (Figure 4, Mann-Whitney test p-value = 1.3e-15). Subsequently, we sought to verify the fraction of aggregation-prone proteins across different bacterial proteomes. We used the SolubEcoli.pgc classifier to predict aggregation-prone proteins in 1132 eubacterial species. Our prediction on bacterial genomes showed that the fAg (aggregation prone proteins as fraction of proteome) of a genome correlates positively with the GC composition (Kendall tau=0.38 p-value < 2.2e-16) (Figure 5A). We further examined the correlation, with respect to phylogenetic ancestry, using the Mesquite system35, because the Kendall correlation assumes that observations are independent even if organisms are linked through common ancestors41. The required phylogenetic tree was constructed from the 16S rRNA gene sequences of 570 bacteria42. We found a significant correlation (0.4, p-value < 2.2e-16) between independent contrasts of GC content and fAG (Figure 5B). This corroborated well with the difference seen between soluble and aggregation-prone proteins in E. coli (Figure 4). Thus the increase in the GC composition of a genome may encode proteome that harbours a higher fraction of aggregation-prone proteins.

a14c495d-5c06-46c0-a875-8870f7e3451d_figure3.gif

Figure 3. Composition of basic amino acids over ~1100 eubacterial genomes.

The x-axis of each subplot shows for GC composition of each genome whereas y-axis shows corresponding amino acid composition.

a14c495d-5c06-46c0-a875-8870f7e3451d_figure4.gif

Figure 4. Aggregation-prone proteins are richer in GC-content than soluble proteins.

In E. coli, aggregation-prone proteins contain higher GC-content than soluble proteins. Mann-Whitney test p-value (*) is 1.3e-15.

a14c495d-5c06-46c0-a875-8870f7e3451d_figure5.gif

Figure 5. GC content is associated with fAg.

(A) GC content of the genome correlates with the fraction of proteome that is aggregation-prone (fAg) (analysis of 570 bacterial genomes using the classifier). Rank-based correlation is provided along with the p-value. The black line shows a linear regression model. (B) The relationship between GC content and fAg was obtained through a phylogenetically independent contrast method (570 bacteria). A positive correlation (0.4) was identified between GC content and fAg (p-value < 2.1e-16).

This is in contrast to previous reports hypothesizing that GC-poor organisms have unstable and aggregation prone proteomes. Notably, the earlier hypothesis that GC poorness is associated with GroEL-dependent aggregation-prone proteomes was based on the observation that GroEL is overexpressed in GC-poor organisms. Therefore, to segregate GroEL-dependent proteins from aggregation-prone proteomes, we developed another classifier (ZENODO, https://zenodo.org/record/10442/) trained with 475 curated soluble and 83 GroEL obligate (Class 3 or C3) proteins14. The classifier achieved an accuracy of 92.29% with MCC of 0.69. We used GDP1.pgc to identify the C3 proteins within aggregation-prone proteins (predicted by SolubEcoli.pgc) to examine the evolution of the GroEL-dependent proteome with GC composition. Indeed we found that the fC3 (fraction of C3 proteins) of bacterial proteome are more correlated with GC content than the fAg fraction (Figure 6A). The phylogenetically independent contrasts of fC3 and GC also correlated strongly (0.7, p-value < 2.2e-16, Figure 6B). The phylum Tenericutes, members of which have GC-poor genomes, was predicted to encode less GroEL-dependent proteins. Mycoplasma and Ureaplasma are the main genera of the phylum Tenericutes and many species of these groups lack GroEL43. In our analysis, we also observed that the Tenericutes without GroEL (red dots in Figure 6A) had very few fC3 proteins. This motivated us to investigate the effect of groEL copy number on misfolded proteins. Interestingly, there was a strong correlation between the groEL copy number and the fraction of genome coding for C3 proteins (Figure 6C). Due to the presence of noise in the experimental data, we tried to benchmark the classifiers. Fujiwara et al. reported that five C3 homologs of groEL-lacking Ureaplasma urealyticum are soluble in GroEL depleted cells26. Hence, we also examined the tolerance of our classifiers by predicting the GroEL dependency of these homologs. Four of these homologs were predicted to be GroEL independent with a high confidence score (Table 3). Overall, the results indicated that C3 proteins and in general aggregation-prone proteins do decrease with the GC content of genomes.

a14c495d-5c06-46c0-a875-8870f7e3451d_figure6.gif

Figure 6. Decrease in GC content is associated with decrease in fC3.

(A) Correlation of GC content with the fraction of the proteome that is GroEL obligate (fC3) over 570 bacterial genomes. Members of the phylum Tenericutes with and without the groEL gene are coloured in blue and red, respectively. Rank-based correlation is provided along with p-value. The black line shows a logarithmic regression model. (B) A positive correlation (0.7) was identified between independent contrast of GC content and fC3 with respect to phylogenetic information of bacterial genomes (570 bacteria, p-value < 2.2e-16). (C) The organisms were classified based on the number of groEL genes present in the genome. fC3 exhibited a significant increase with an increase in the number of genome-encoded groEL copies. The p-values were calculated by Mann-Whitney test using two-sided hypothesis.

Table 3. Evaluation of classifiers on five C3 homologous proteins of groEL-lacking Ureaplasma urealyticum.

The homologous were found in U. urealyticum by NCBI BLAST at a threshold of E value of 1e45. Then the aggregation propensity and GroEL dependency of these proteins were classified by SolubEcoli.pgc and GDP1.pgc.

C3 homologous proteins
in Ureaplasma urealyticum
E valueAccessionIs aggregation prone?
(classifier: SolubEcoli.pgc)
Is GroEL dependent?
(classifier: GDP1.pgc)
UuMetK2e-99YP_002284849.1Yes (0.739)No (0.804)
UuDeoA2e-80WP_004026878.1Yes (0.586)No (0.938)
UuCsdB4e-62D82890Yes (0.884)Yes (0.665)
UuGatY8e-46H82870Yes (0.672)No (0.973)
UuYcfH7e-41E82944Yes (0.518)No (0.881)

Correlation of GC content with protein solubility is independent of the population bottleneck

Endosymbionts are crucial to this study as the literature suggests that these organisms have undergone bottlenecks during evolution44. It is hypothesized that these organisms have accumulated more deleterious mutations compared to non-endosymbionts8. If this were true then endosymbionts should show a greater aggregation propensity or dependence on GroEL than that predicted by the GC content of free-living eubacterial species. To measure the impact of a symbiotic relationship on C3 proteins, we performed an analysis of covariance ANCOVA on 570 eubacterial species42. There was no significant effect of a symbiotic relationship on fAG/fC3 (p-value 0.24/0.65, Data set) or significant interaction (p-value=0.36/0.38) with GC composition (Figure 7). Thus we were unable to obtain proof for any association of a bottleneck in evolutionary history with protein aggregation propensity. Therefore we rule out the possibility of bottleneck evolution as the reason for the evolution of GroEL-independent proteomes like Ureaplasma and GroEL-independent mycoplasma species.

a14c495d-5c06-46c0-a875-8870f7e3451d_figure7.gif

Figure 7. fAG and fC3 are correlated to the GC content independent from the species habitat.

The ANCOVA test on 570 organisms showed that a symbiotic relationship has no significant effect or interaction with GC content on the aggregation propensity or GroEL-dependency of the proteins of an organism.

NameGenome Size (MB)GC ContentGroEL copy numberfC3fAg
A.arabaticum3.20568661.89208810.260.56
A.aeolicus1.59079143.3016027910.210.55
A.ferrooxidans_DSM103313.91819259.2717508530.160.48
A.ferrooxidans_ATCC232702.31303542.2309649410.160.47
A.fermentans2.46959636.6312951610.070.32
A.asiaticus1.88436435.0450337610.180.53
A.ferrooxidans2.20466544.2856851310.140.45
A.cellulolyticus5.22664862.4298594420.150.52
A.phosphatis5.35277268.5323043810.270.59
A.bacterium5.65036858.3838433210.220.59
A.baumannii_AB307-02943.76098139.0436963110.150.50
A.baumannii_ACICU3.99676138.9330135110.150.49
A.baumannii_SDF3.47799639.1196252110.230.54
A.baumannii_AB00574.05924239.1949038810.140.49
Acinetobacter_ADP13.12014354.727780110.230.55
A.dehalogenans2.34125127.0451566310.060.41
A.baumannii_AYE4.04873539.3159097830.150.51
A.baumannii4.00145738.9209105610.130.46
A.vinosum4.15254338.7320733310.140.50
A.mirum2.4435466.9073557220.200.58
A.mediterranei4.9808765.9585975920.140.48
A.ehrlichei3.59862140.429570110.160.52
A.laidlawii1.49699231.9277591310.120.48
A.cryptum1.20680649.9849188710.210.56
A.colombiense1.98059245.3068577520.210.59
Anaeromyxobacter_Fw109-55.02932974.7164880220.260.60
A.radiobacter3.9630867.0952642920.270.62
A.dehalogenans_2CP-15.01347974.9052703720.280.62
A.acidocaldarius2.15706759.4651441120.290.58
Acinetobacter_DR13.27594467.5312520610.330.57
A.tumefaciens4.30877659.5080830410.210.56
A.pleuropneumoniae2.88503858.8504900110.240.55
A.oremlandii2.84674641.7767514210.200.50
A.metalliredigens2.32976955.8427037220.230.58
A.haemolyticum2.15815768.2893320620.190.55
A.pleuropneumoniae_AP762.98239758.7724571910.220.51
A.butzleri5.2779973.5314011620.280.61
A.pleuropneumoniae_JL034.74444861.5484035240.230.58
A.arilaitensis1.98615453.1332414310.140.50
Acidovorax_JS424.58515466.0875512610.250.61
V.fischeri_MJ113.66907464.186140720.280.56
A.vitis1.19768749.7626675410.200.56
A.succinogenes4.4489844.885400710.140.46
A.aurescens10.23671571.2927926640.160.53
A.centrale1.20243549.7744992510.210.57
A.chlorophenolicus8.24814473.7088731730.170.50
A.marina8.36159946.9558932440.120.41
A.degensii4.92956636.8220244920.110.41
A.muciniphila2.66410255.7623544410.280.62
Cyanothece_PCC74247.21178941.2683316230.120.48
A.nitrofigilis5.06163274.840683820.270.61
Anaeromyxobacter_K3.19223528.3608506310.080.43
A.prevotii3.12355836.2573065710.100.45
A.salmonicida2.34543541.2262970410.160.51
Azospirillum_B5102.75352748.8537065410.150.47
A.marginale_Florida1.47128241.6359338310.120.40
A.actinomycetemcomitans2.24206241.2328472610.150.51
A.hydrophila2.27448241.2995134710.160.52
A.xylosoxidans5.30613363.9692597210.270.60
A.flavithermus1.99863335.6389091910.100.36
A.pseudotrichonymphae1.22491932.9478112410.160.60
A.marginale3.34024953.0995144410.240.57
Arthrobacter_FB241.54380545.6865990210.120.48
A.phagocytophilum7.273359.874224930.210.57
A.parvulum5.07047865.3884505630.160.52
A.aphrophilus5.04053658.1809553630.220.54
A.salmonicida_LFI12382.31966344.9185937810.170.56
B.amyloliquefaciens2.93166235.1916421520.120.50
A.excentricus5.67425859.0411468810.190.54
Cyanothece_ATCC511427.10575241.4097480540.130.50
A.caulinodans6.32094657.4707488420.210.57
V.fischeri5.36531865.6755480320.290.56
A.avenae7.35914665.778991220.240.63
Phytoplasma_AYWB0.7239726.8117463410.070.35
A.pasteurianus5.36977267.3190221120.220.61
B.bacilliformis7.59973867.6127913930.250.60
Diaphorobacter4.3760467.9164495830.270.61
A.vinelandii0.61837925.3415785510.090.61
Azoarcus_BH727.64253666.3917186730.230.57
B.cavernae2.08964559.1843351410.180.53
B.cereus4.16826643.2192907110.140.40
B.afzelii1.23250327.7032185710.060.39
B.anthracis_CI5.50676335.2466594310.100.35
B.anthracis_Ames05815.50392635.2429520310.100.37
B.anthracis_Sterne5.48664935.2531025810.110.39
B.ambifaria_MC40-67.52856766.7748324540.230.58
B.anthracis_A02485.22729335.3768384510.100.37
B.anthracis3.98019946.0823441210.150.43
A.macleodii0.64212226.2856591110.110.61
B.anthracis_CDC6845.50392635.2429520310.100.36
B.aphidicola_5A0.64145425.330577110.120.60
B.atrophaeus5.22866335.3794459510.100.38
B.aphidicola_Bp0.64189526.2924621610.110.60
B.avium3.73225561.5825821110.230.59
B.amyloliquefaciens_DSM73.91858946.4756574410.160.44
B.bacteriovorus3.7829550.6455015310.190.56
B.cereus_03BB1026.29643647.2735528510.150.46
B.animalis_lactis_DSM101402.18688262.7584387310.160.44
B.quintana1.44502138.2386830410.180.51
Blattabacterium_Blattella_germanica0.6368527.1417131210.160.66
B.bifidum2.21465662.6664366810.170.47
B.bronchiseptica5.33917968.0770208320.270.63
B.suis_ATCC234458.49351364.8061997460.210.55
B.burgdorferi1.51985628.1706622210.060.35
B.burgdorferi_ZS71.34549428.2670900110.070.42
B.cereus_NVH5.41903635.3035669110.100.38
B.floridanus0.42243420.2000785910.070.55
B.cereus_AH1875.42708335.2883491910.100.38
B.cereus_Q15.73682335.0154955110.100.37
B.pertussis7.7028466.7883793530.230.58
B.licheniformis4.30387144.7533162610.150.45
B.petrii7.97138966.5950940330.230.58
B.parapertussis7.27911666.9251458620.230.58
B.clausii5.50620735.4679727810.100.39
B.cereus_G98425.59985735.5105139310.100.37
B.tribocorum3.31276957.2426571210.190.52
B.bifidum_PRL20104.66918373.1165816420.130.45
B.cereus_B42645.44930835.3326330610.100.37
B.cereus_ZK4.09415935.8669997910.130.42
B.halodurans5.84323535.1272026520.110.40
B.dentium2.63636758.5367287610.170.48
B.duttonii1.57488128.0115767510.080.43
B.longum3.61499272.0457749320.180.48
B.aphidicola_Cc0.70555727.3831880310.120.62
B.fragilis5.3109943.2042990120.190.55
B.fragilis_NCTC93435.241743.1141805120.200.58
B.garinii1.23053328.1697443310.070.47
Burkholderia_3837.88485863.2667703130.240.56
Burkholderia_CCGE10027.04359563.2456721330.240.60
B.cenocepacia7.28468367.9263325530.270.59
B.indica2.3695238.0373240110.180.49
B.licheniformis_DSM134.20235243.685512310.180.44
B.japonicum1.93104738.2303486110.180.51
B.hermsii0.92230729.8277037910.110.54
B.recurrentis3.03663427.0007185610.050.38
Bradyrhizobium_BTAi14.41861656.9829557520.220.57
Bradyrhizobium_ORS2789.10582864.0593804370.230.58
B.adolescentis1.93369560.4903565510.200.49
B.longum_longum_JDM3012.26594359.9460357110.150.47
B.animalis_lactis1.93870960.4821301210.190.49
B.megaterium_QM_B15514.22264546.1947428710.160.44
B.megaterium_DSM3194.22259746.1943917510.160.43
B.longum_infantis_ATCC156972.38952660.1638986110.150.45
B.faecium2.47783859.8148870110.180.51
B.longum_longum_BBMN682.83274859.8631434910.160.45
B.longum_DJO10A2.26026660.1273478410.160.50
B.animalis_lactis_Bl-041.93848360.4821656910.190.49
B.cenocepacia_HI24245.83552768.4855198220.250.54
B.grahamii3.28644557.2193357910.190.51
B.henselae3.28393657.2216693610.200.55
B.pseudofirmus5.09744738.1316372710.120.40
B.subvibrioides3.29493157.2224425910.200.55
B.abortus3.27830757.2218526210.190.51
B.abortus_S193.31121957.2212227610.190.52
B.mallei_NCTC102297.0088166.686341930.230.57
B.cenocepacia_MC0-35.74230368.4673205220.240.51
B.cepacia5.8483868.4845974410.230.50
B.pumilus5.52319237.9264381910.110.40
B.canis3.33736957.2543371710.190.52
B.microti3.31517557.2513064910.190.50
B.ovis3.32460757.2052275710.190.51
B.mallei7.00862266.6860903630.240.57
B.glumae5.23240168.403797810.230.50
B.melitensis3.2755957.1945512110.200.54
B.mallei_NCTC102474.77355168.0999532620.270.64
B.selenitireducens4.40488640.0084701410.130.44
B.pseudomallei_1710b7.04040368.2891590220.220.47
B.mallei_SAVP14.08618967.7205826720.310.65
B.subtilis4.24924839.8605588620.130.37
B.multivorans_T8.67656262.2924033720.230.57
Blattabacterium_Periplaneta_americana0.64044228.2122346810.210.68
B.pseudomallei7.08924968.2579776820.220.46
B.pseudomallei_1106a7.30805467.9781512320.310.58
B.hyodysenteriae2.58644327.8969998610.060.40
B.pseudomallei_6684.09857667.7087603120.210.47
B.phytofirmans7.24754768.0587238730.270.59
B.multivorans5.2879565.4828241620.250.60
B.subtilis_spizizenii3.70446541.2869064820.150.42
B.phymatum8.21465862.2868292310.230.57
B.melitensis_Abortus1.58138438.7968387210.210.59
C.crescentus7.45658765.5103869940.200.55
B.murdochii1.24216327.5256548510.080.47
B.pseudomallei_MSHR3463.75013860.7079702110.220.46
B.turicatae3.24180427.7514309910.060.40
B.melitensis_ATCC234573.44526368.4227009710.150.49
B.thuringiensis3.59248748.6699603910.160.40
B.thuringiensis_BMB1714.02767643.8906704520.150.41
B.thuringiensis_AlHakam4.21560643.5144081310.140.41
B.proteoclasticus5.64305135.1764674810.110.39
B.rhizoxinica6.72397267.6281221940.280.59
B.thetaiotaomicron6.29339942.8602890110.220.60
B.weihenstephanensis5.31479435.3652371110.110.40
B.brevis5.3130335.4337355510.120.44
B.suis2.64240438.8222240110.170.47
A.thermophilum3.38476659.1141307810.270.56
B.pilosicoli0.9173329.1189648210.120.53
B.aphidicola_Sg0.65572526.3648633210.110.59
B.xenovorans8.67627766.2724461240.230.58
B.thailandensis8.3910765.7380524830.230.54
This is a portion of the data; to view all the data, please download the file.
Dataset 1.Application of SolubEcoli.pgc and GDP1.pgc classifiers.
Proteome wide prediction of GroEL obligatory protein fraction (fC3) and aggregation prone protein fraction (fAg) in 1132 eubacterial genomes with their genome size, GC content and GroEL copy number.

Conclusions

Several machine learning (ML) classifiers have been developed to predict aggregation-prone or GroEL-dependent proteins, but very few of them used data sets generated and curated from multiple experimental studies. Our classifiers were based on curated data from multiple studies and performed well also against the false positive C3 homologs of Ureaplasma, showing accuracy and noise tolerance. According to previous theories, GC-poor organisms might have evolved through population bottlenecks. This allows mildly deleterious mutations to be fixed in the population with a high probability2,44. It has been hypothesized that the GC-poor genomes that accumulated a large number of deleterious mutations in the course of evolution, through population bottlenecks and hence harbour proteins that are aggregation-prone. Although overexpressions of chaperones are observed in GC-poor organisms that have reduced genomes, there are also other GC-poor organisms that lack GroEL. Our work provides strong evidence that the general stability of the proteome increases with the decrease in GC content of eubacterial genomes. Decrease in GC content restricts the amino acid space that the organism can sample, thereby compromising protein evolution. We hypothesise that, even with this limited amino acid space, GC-poor organisms are still abundant as growth is facilitated under conditions that compromise protein folding capacity. This antagonism between ability to evolve and folding advantage could be crucial in facilitating protein evolution in the presence of chaperones and other folding machineries4548.

Our work suggests that organisms facing continuous proteostasis problems would tend to shift towards a more GC-poor genome. This is supported by findings of Xia et al.49 who have reported that the preponderance of GC to AT conversions during high temperature laboratory adaptation of Pasteurella multocida. Further in vitro evolution experiments will be required to demonstrate whether laboratory adaptation to low GC content may provide folding advantage.

Data availability

F1000Research: Dataset 1. Application of SolubEcoli.pgc and GDP1.pgc classifiers, 10.5256/f1000research.4307.d2962455.

ZENODO: Training data of protein classifier SolubEcoli.pgc and GDP1.pgc, doi: 10.5281/zenodo.1044256.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 27 Jun 2014
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Das Roy R, Bhardwaj M, Bhatnagar V et al. How do eubacterial organisms manage aggregation-prone proteome? [version 1; peer review: 2 approved]. F1000Research 2014, 3:137 (https://doi.org/10.12688/f1000research.4307.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 27 Jun 2014
Views
557
Cite
Reviewer Report 01 Oct 2014
Annalisa Pastore, MRC National Institute for Medical Research, London, UK 
Approved
VIEWS 557
The genesis of this paper is the proposal that genomes containing a poor percentage of guanosine and cytosine (GC) nucleotide pairs lead to proteomes more prone to aggregation than those encoded by GC-rich genomes. As a consequence these organisms are ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Pastore A. Reviewer Report For: How do eubacterial organisms manage aggregation-prone proteome? [version 1; peer review: 2 approved]. F1000Research 2014, 3:137 (https://doi.org/10.5256/f1000research.4611.r6283)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
553
Cite
Reviewer Report 15 Jul 2014
Amnon Horovitz, Department of Structural Biology, Weizmann Institute of Science, Rehovot, Israel 
Approved
VIEWS 553
In this study, the authors describe machine-learning classifiers for predicting aggregation propensities of proteins.  A novel aspect of this work is that the classifiers are based on experimental data obtained from different sources regarding chaperone dependence (GroEL or DnaK) and ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Horovitz A. Reviewer Report For: How do eubacterial organisms manage aggregation-prone proteome? [version 1; peer review: 2 approved]. F1000Research 2014, 3:137 (https://doi.org/10.5256/f1000research.4611.r5273)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 27 Jun 2014
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.