Proline cis/trans Isomerization in Intrinsically Disordered Proteins and Peptides

Background : Intrinsically disordered proteins and protein regions (IDPs/IDRs) are important in diverse biological processes. Lacking a stable secondary structure, they display an ensemble of conformations. One factor contributing to this conformational heterogeneity is the proline cis/trans isomerization. The knowledge and value of a given cis/trans proline ratio are paramount, as the different conformational states can be responsible for different biological functions. Nuclear Magnetic Resonance (NMR) spectroscopy is the only method to characterize the two co-existing isomers on an atomic level, and only a few works report on these data. Methods : After collecting the available experimental literature findings, we conducted a statistical analysis regarding the influence of the neighboring amino acid types ( i ± 4 regions) on forming a cis -Pro isomer. Based on this, several regularities were formulated. NMR spectroscopy was then used to define the cis-Pro content on model peptides and desired point mutations. Results : Analysis of NMR spectra prove the dependence of the cis-Pro content on the type of the neighboring amino acid—with special attention on aromatic and positively charged sidechains. Conclusions : Our results may benefit the design of protein regions with a given cis -Pro content, and contribute to a better understanding of the roles and functions of IDPs.


Introduction
Despite the lack of a stable secondary structure, intrinsically disordered proteins/protein regions (IDPs/IDRs) play fundamental roles in many biological processes. One factor contributing to the hindrance of secondary structure formation is the high content of proline residues. Proline is the only naturally occurring amino acid that can exist in two conformations (Fig. 1), as in this case, the free energy difference between the trans and cis conformers is lower than in all other non-prolyl bonds (with typical values of 20 kcal/mol) [1]. In proteins, the peptide bonds are predominantly in the trans-state (>99.5%) considering X-non-Pro connections, and it is only around 95% in the case of the X-Pro bond [1,2]. In folded proteins-due to the relatively high steric hindrances between X-Cα and Pro-Cδ atomic environments-the cis-isomer is less frequent [3]. However, spontaneous isomerization can occur in highly flexible proteins such as IDPs, and as a consequence both conformers are present in the solution. The situation may become more complicated as IDPs are generally abundant in proline residues [4].
Therefore, it is important to determine the conformation of the proline residues. However, it is not straightforward to detect and characterize the isomeric ratio. As IDPs are highly mobile systems, atomic resolution studies using X-ray crystallography or cryo-EM cannot be used [13]. Therefore, Nuclear Magnetic Resonance (NMR) spectroscopy is the only method of investigation capable of discerning between the two conformations. In addition, it is possible to detect several conformations co-existing in the solution. NMR spectra peak multiplications indicate this phenomenon. In the trans and cis-Pro isomers, the chemical environment is different, and the exchange is slow (10 −3 -10 −2 s −1 ) [14]. Thus, two separate peaks are detected for the Pro as well as for the neighboring residues.
Using 2D 1 H, 1 H-NOESY measurements, the two proline isomers can be distinguished based on the intensity of the Hα-Hα NOE peak between the Pro and the preceding residue [15]. However, the Pro 13 Cβ-13 Cγ chemical shift difference is a more reliable indicator of the Pro isomer form. This method is commonly used for 13 C/ 15 N labeled proteins samples, where specific 3D experiments can detect the Pro sidechain Cβ, Cγ peaks by: hCCCONH [16] in 1 H N , Pro-(H)CBCGCAHA in 1 H α [17] and 3D (H)CCCON in 13 C-detected approaches [18]. In the case of samples with natural isotopic abundance, the approach based on the Pro sidechain 13 Cβ, 13 Cγ chemical shift determination is rarely used. However, the 2D 1 H, 13 C-HSQC spectrum with appropriate signal-to-noise ratio can be used, even for the low concentration minor form.
Possible cis-Pro peak assignment methods for proteins are Pro-Ala mutations [19] or site-specific labeling [20]. Furthermore, the application of proline analogs has gained popularity [12]. The fluorinated amino acids are widely utilized analogs for assessment of cis/trans-Pro presence since 19 F NMR is more advantageous due to less signal overlap [21].
As a consequence of these experimental difficulties, there are relatively few publications regarding the characterization of the cis-Pro isomers in IDPs. Previous studies showed that the cis-trans proline isomer ratio depends on the sequence of the Pro neighboring regions in IDPs [22,23]. In our earlier study, we performed a statistical analysis using the available experimental data to determine the effect of the amino acid type of Pro neighboring residues on forming a cis-Pro conformer [17]. This analysis was based on 10 IDPs containing 101 Pro neighboring regions (i ± 3 regions, i representing the Pro). Three main groups were considered according to the cis-Pro amount: >5%, >10%, and <5% cis-Pro containing amino acid sequences.
It was shown at p = 0.1 significance level that high (>10%) cis-Pro content is favored if aromatic residues (Phe, Tyr, Trp) are present in the i ± 1 positions or negatively charged residues (Asp, Glu) are located i-2, i-1 and i + 3 positions. Positively charged residues in i-3, i-1 positions can indicate decreased cis-Pro content (<5%).
As a continuation of this study, here we propose a more extensive investigation of the Pro occurrence in IDPs from DisProt database (Database of Protein Disorder, https://disprot.org/) and further characterize the amino acid composition of the Pro and Pro-Pro neighborhood [24,25].
As DisProt does not contain information on the amount of cis-Pro isomers, we updated and expanded our previous dataset [17] to the i ± 4 proline neighbors to test whether long-range interactions affect cis-Pro formation. While the regularities obtained from the statistical analysis seemed valid for our studied p53 1−60 region, experimental proof by investigations on well-designed mutations has not been performed yet. Therefore, in this study, we fill this research gap, and using short, 12-15 residue long peptide sequences, we use NMR to analyze the variation of cis-proline amounts with well-designed amino acid mutations. For this purpose, we used peptides from the lysine-rich K-segments of Early Response to Dehydration 14 (ERD14) dehydrin from Arabidopsis thaliana and the C-terminal region of metastasisassociated human S100A4. We showed earlier that these peptides are capable of cell penetration and might be suitable candidates for drug delivery [26].

Statistical Analysis
Amino acid composition of intrinsically disordered proteins was collected from DisProt database (release version 2022_12) and was analyzed using in-house built Python scripts and Microsoft Excel. The DisProt database amino acid composition was determined for the whole dataset (10,544 records) and a filtered dataset (4158 records), from which duplicates based on the sequence and region ID were removed (Supplementary Table 1).

Peptide Synthesis and Purification
The designed peptides (Table 1) were synthesized using solid-phase peptide synthesis, applying the Fmoc/tBu strategy using a CEM microwave-assisted fully automated peptide synthesizer. The syntheses were carried out at a 0.25 mmol synthesis scale using a TentaGel S Ram resin (Rapp Polymere GmbH, Tübingen, Germany) with a loading of 0.23 mmol/g amino function. The crude peptides were detached from the solid support using TFA (90%) in the presence of water (5%), 1, 4-dithiothreitol (DTT, 2.5%), and TIS (2.5%). The crude peptides were purified

Nuclear Magnetic Resonance Experiments and Data Analysis
Typical NMR samples contained 1 mM peptide in 10% D 2 O and 0.05 mM 3-(trimethylsilyl)-1propanesulfonic acid sodium salt (DSS), and the pH was adjusted to 3.0. All NMR spectra were recorded on a Bruker Avance III 700 spectrometer (Bruker GmbH, Ettlingen, Germany) (700.05 MHz for 1 H; 70.94 MHz for 15 N; 176.03 MHz for 13 C) using a Prodigy TCI H&F-C/N-D, 5 mm z-gradient probe head. The measurements were conducted at 298 K. 1 H chemical shifts were referenced to the internal DSS standard, whereas 15 N and 13 C chemical shifts were referenced indirectly via the gyromagnetic ratios.
Resonance assignment and sequential connectivities were determined from classical 2D homonuclear 1 H, 1 H-TOCSY (mixing time: 80 ms) and 1 H, 1 H-NOESY (mixing time: 250 ms) measurements. The typical spectral resolution was 2048 × 512, and the measurements were acquired with 8 and 16 transients, respectively. 1 H, 15 N SOFAST-HMQC, and 1 H, 13 C-HSQC measurements were performed on peptides with a naturally abundant isotope content. The 1 H, 15 N SOFAST-HMQC spectra were acquired with 2048 × 128 resolution, and the number of scans varied between 160-480 to detect the lowly populated minor forms. 1 H, 13 C-HSQC were acquired using a 2048 × 256 resolution with 32 transients. All spectra were processed using TopSpin 3.6.0. (Bruker GmbH, Ettlingen, Germany) Peak assignment was completed using Sparky (University of California, San Francisco, CA, USA) [37].

Proline Neighboring Residues in DisProt
Proline, with 7.41% occurrence, is the 5th most frequent amino acid in IDPs according to DisProt (release version 2022_12) (Fig. 2, Supplementary Table 1). In order to confirm the amino acid preference, the proline preceding and succeeding neighboring residue type was collected for the Pro residues and Pro-Pro motifs (Fig. 2, Supplementary Tables 3,4). Statistical analysis of these data shows that the distribution of amino acid type in the Pro ± 1 positions differs significantly from the DisProt amino acid distribution (Table 2). This signifies that the number of aliphatic and Pro residues significantly increases in these positions, whereas charged and aromatic residues are significantly less frequent. The number of Gly and polar residues are significantly different in the Pro preceding position, but no significant differences in occurrence can be found for i + 1 position. The composition of X-Pro-Pro and Pro-Pro-X motifs differ as well: in both cases, the number of aliphatic residues and Pro is increased, and interestingly the number of negatively charged residues shows significant deviation from the DisProt database (Supplementary Table 5).
It is important to note that prolines are often situated in proline-rich regions with repetitive Pro containing motifs that often form polyproline II-type helices. While the number of polyproline (containing consecutive proline residues) sequences longer than 20 residues in the UniProt database (https://www.uniprot.org/) is more than 6000, these motifs are underrepresented in the DisProt database, as here the longest polyproline sequence is only 13 residues long (Dis-Prot ID: DP02591r001).

Proline Neighboring Residues in Intrinsically Disordered Proteins with Known cis-Pro Content
In order to determine the sequence dependence of the cis-Pro amount (calculated as [cis]/([cis] + [trans])), we updated our previously published dataset of IDPs and expanded our previous dataset to the Pro neighboring i ± 4 range [17]. The updated dataset now contains 15 IDPs (Supplementary Table 2) with 167 central proline residues. The cis-Pro content depends on the peptide se- Based on a two-sided binomial test with a 0.1 significance level.
quence length. Therefore peptides (less than 20 residues) are not included in the dataset [34]. Note that polyproline sequences with several consecutive prolines are also excluded due to ambiguous peak assignment.
The i ± 4 Pro neighboring sequences contain a total of 1312 amino acids (not considering Pro in the i position), and the amino acid type occurrence slightly differs from DisProt ( Fig. 3, Supplementary Table 6); thus, we use the amino acid composition of our overall dataset as a reference.
The Pro neighboring sequences were divided into two groups according to the cis-Pro isomer content (Fig. 3A). For cis-Pro content <10%, the amino acid type occurrence is similar to the overall dataset. However, significant differences are found for the sequences with >10% cis-Pro content: the positively charged residues are significantly less. In contrast, prolines occur significantly more than expected (Fig. 3A, Supplementary Table 7).
In the individual positions for the complete dataset (Fig. 3B), some residue types deviate significantly in the occurrence. The largest differences can be found for Pro and polar residues, where more than a 12% deviation between the highest and the least populated positions can be observed. In addition, two-sided binomial tests at a significance level of 0.05 were conducted to investigate which amino acid types (Fig. 3B) alter in the individual positions. We found that the distant positions (i-3, i-4) do not deviate significantly from the reference. Negatively charged residues are significantly more frequent in the i-2 position. There are significantly more polar residues in i-1 and aliphatic amino acids in i + 1 position. Prolines occur significantly more in i + 4 position and less in i ± 1 positions. Aromatic amino acids are more frequent in i + 2, and less in i + 3 positions.
In sequences with more than 10% cis-Pro contentconsidering a 0.05 significance level-the positively  charged residues are significantly less frequent in i-3 and i-1 positions (Fig. 4). In the i-1 position, aromatic and polar amino acids occur significantly more often, whereas the number of Pro is reduced. Aromatic residues are common as well in the i + 1 position. Note, Pro occurrence is significantly higher in i + 1, i + 2, and even in i + 4 positions, and the number of Gly is also increased at i ± 4.
Compared to our previous work, the general rules hold [17]: The number of aromatic amino acids adjacent to Pro (i ± 1) is increased. In Pro-Pro motifs, the cis-Pro amount is higher for cis-Pro-trans-Pro than for trans-Pro-cis-Pro. Positive charges at i-3 and i-1 indicate a decreased cis-Pro content.
The increased number of Asp and Glu in i-2, i-1, and i + 3 positions does not hold at 0.05 significance level, only at p = 0.1.

Effect of Mutations on the cis-Pro Content
To validate our findings, model peptides with designed mutations were synthesized (Table 1). All peptides were 12-15 residues long and were enriched in Lys and Arg residues to test the effect of the positive charge. Since aromatic residues in the i ± 1 position of the Pro has the strongest effect, Peptide I.A contains two Phe in the Pro vicinity. To test the effect of a positive charge, instead of the aromatic sidechain on the cis-Pro content, a Phe5Arg mutation (Peptide I.B) was designed. Peptide II.A contains two prolines: in the case of Pro6, two Phe residues are located in the Pro preceding region: in the i-1 and the more distant i-4 position. In this case, the i + 1 residue is a negatively charged Asp. In the Pro10 neighboring region, there are no aromatic rings, the i-4-i-1 region contains residues that do not significant affect the cis-Pro ratio, and the i +1 Prolines are shown in bold; phenylalanines are shown in italics.
that should not influence the cis-Pro content. NMR measurements were performed to test these assumptions. Peak assignment was performed using 2D homo-and heteronuclear measurements on peptide samples with natural isotope abundance. Since the concentration of the minor form is low, approximately 50-200 µM for 1 mM protein, and 13 C isotope abundance is~1%, measurement times are lengthy. Pro isomerization results in a peak multiplication for the Pro neighboring residues (Fig. 5, Supplementary Fig. 3). For quantitative cis-Pro determination, the Pro Hδ2-Cδ and Hδ3-Cδ cross peaks were integrated on the 1 H, 13 C-HSQC spectrum, since these peaks could be unambiguously assigned as signal overlap rarely occurs in this region (Table 3).
In Peptide I.A Pro6 has phenylalanine in i ± 1 po-sition; therefore, the cis-Pro content is increased by 26%. If the disadvantageous Phe to Arg mutation occurs at i-1 position (Peptide I.B), the cis-Pro ratio decreases to 9% (Fig. 5C).
Peptide II.A contains two prolines where two sets of minor peaks of different signal intensities are detected.
Here, the major Pro6 and Pro10 Hδ2-Cδ and Hδ3-Cδ cross peaks overlap on the 1 H, 13 C-HSQC spectrum. Therefore, Pro-Ala mutants were synthesized for unambiguous peak assignment. The Pro6Ala and Pro10Ala mutations cause the absence of the corresponding cis-Pro isomer, indicating that the high-intensity cis-Pro peak (20%) belongs to Pro6, while the cis-P10 content is significantly lower (9%). This agrees with our previous observations: the phenylalanine in i-1 position indicates an increased minor content, whereas sequential enrichment in positively charged residues hinders the cis-Pro formation. Peptide III. contains two proline residues. Pro6 succeeds an aromatic phenylalanine which is advantageous for a high cis/trans ratio (Supplementary Fig. 3B). However, the several positively charged lysines (i-4, i-3, i + 2, i + 4) reduce this effect, resulting in a 16% cis-Pro amount. The Pro14 neighboring sequence lacks aromatic residues, and the unfavorable positive charge sidechain is located in the neutral i-2 position producing a 10% minor isomer.

Conclusions
Considering the DisProt database, this study investigated the residue types and their distribution in the Pro neighborhood (i-4 to i + 4 region). Pro ± 1 positions have significantly higher Pro content, despite polyproline sequences being underrepresented in DisProt. Furthermore, a dataset of IDPs was collected to investigate the effect of the residue types in the i ± 4 regions on the cis-Pro content. We found that our earlier formulated observations are also valid for the extended dataset. Moreover, the i + 4 and i-4 positions are significantly enriched in prolines, and the glycine occurrence is higher.
In order to bring experimental proof to our observations based on the statistical analysis, synthetic peptides were designed. The cis-Pro content was determined by NMR spectroscopy. We found that the sidechain of the amino acids placed in the i ± 1 positions has the largest effect on the cis-Pro formation. Aromatic residues have a larger effect on the cis-Pro content if they are situated in the proline preceding rather than the succeeding position. Positively charged residues shift the equilibrium to a larger trans-Pro content. However, they have no/moderate effect in other positions. We note that even in the case of short, mobile peptides, the initial cis-Pro amount is higher than the values found in a protein; the regularities still should hold.
In conclusion, we prove that rationally designed mutations give rise to a desired increase or decrease of cis-Pro content. Our results can greatly benefit biotechnological purposes in the design of preferred proline conformations for functional tests.

Availability of Data and Materials
The data presented in this study are available on request from the corresponding author.

Author Contributions
AB designed the research study. FS, JS, NP, GT and AB performed the research. FS and AB analyzed the data. FS, GT and AB wrote the manuscript. All authors contributed to editorial changes in the manuscript. All authors read and approved the final manuscript. All authors have participated sufficiently in the work and agreed to be accountable for all aspects of the work.

Ethics Approval and Consent to Participate
Not applicable.