2 Cluster analysis of UNRES simulation results
3 ---------------------------------------------
12 3. Functions of the program
16 5. Running the program
18 6. Input and output files
20 6.2. The main input file
23 6.2.3. Energy-term weights and parameter files
25 6.2.4.1. Sequence information
26 6.2.4.2. Dihedral angle restraint information
27 6.2.4.3. Disulfide-bridge data
28 6.2.5. Reference structure
29 6.3. Main output file (out)
30 6.4. Output coordinate files
31 6.4.1. The internal coordinate (int) files
32 6.4.2. The Cartesian coordinate (x) files
34 6.4.3.1. CLUST-UNRES runs
35 6.4.3.2. CLUST-WHAM runs
36 6.4.3.2.1. Conformation family files
37 6.4.3.2.2. Average-structure file
38 6.5. The conformation-distance file
39 6.6. The clustering-tree PicTeX file
46 * This software is provided free of charge to academic users, subject to the
47 condition that no part of it be sold or used otherwise for commercial
48 purposes, including, but not limited to its incorporation into commercial
49 software packages, without written consent from the authors. For permission
50 contact Prof. H. A. Scheraga, Cornell University.
52 * This software package is provided on an "as is" basis. We in no way warrant
53 either this software or results it may produce.
55 * Reports or publications using this software package must contain an
56 acknowledgment to the authors and the NIH Resource in the form commonly
63 The program incorporates the hierarchical-clustering subroutine, hc.f written
64 by G. Murtagh (refs 1 and 2). The subroutine contains seven methods of
65 hierarchical clustering.
67 [1] F. Murtagh. Multidimensional clustering algorithms; Physica-Verlag:
68 Vienna, Austria, 1985.
69 [2] F. Murtagh, A. Heck. MultiVariate data analysis; Kluwer Academic:
70 Dordrecht, Holland, 1987.
71 [3] A. Liwo, M. Khalili, C. Czaplewski, S. Kalinowski, S. Oldziej, K. Wachucik,
73 Modification and optimization of the united-residue (UNRES) potential
74 energy function for canonical simulations. I. Temperature dependence of the
75 effective energy function and tests of the optimization method with single
76 training proteins. J. Phys. Chem. B, 2007, 111, 260-285.
77 [4] S. Oldziej, A. Liwo, C. Czaplewski, J. Pillardy, H.A. Scheraga.
78 Optimization of the UNRES force field by hierarchical design of the
79 potential-energy landscape. 2. Off-lattice tests of the method with single
80 proteins. J. Phys. Chem. B., 2004, 108, 16934-16949.
82 3. FUNCTIONS OF THE PROGRAM
83 ---------------------------
85 The program runs cluster analysis of UNRES simulation results. There are two
86 versions of the program depending on the origin of input conformation:
88 1) CLUST-UNRES: performs cluster analysis of conformations that are obtained
89 directly from UNRES runs (CSA, MCM, MD, (M)REMD, multiple-conformation
90 energy minimization). The source code and other important files are
91 deposited in CLUST-UNRES subdirectory
93 The source code of this version is deposited in clust-unres/src
95 2) CLUST-WHAM: performs cluster analysis of conformations obtained in UNRES
96 MREMD simulations and then processed with WHAM (weighted histogram analysis
97 method). This enables the user to obtain clusters as conformational
98 ensembles at a given temperature and to compute their probabilities
99 (section 2.5 of ref 3). This version is deposited in the CLUST-WHAM
100 subdirectory. This version has single- and multichain variants, whose
101 source codes are deposited in the following subdirectories:
103 a) clust-wham/src single-chain proteins
105 b) clust-wham/src-M oligomeric proteins
107 The version developed for oligomeric proteins treats whole system as a single
108 chain with dummy residues inserted. It also works for single chains but is
109 not fully checked and it is recommended to use single-chain version for
110 single-chain proteins.
115 Customize Makefile to your system. See section 7 of the description of UNRES
116 for compiler flags that are used to created executables for a particular
117 force field. There are already several Makefiles prepared for various
118 systems and force fields.
120 Run make in the appropriate source directory version. CLUST-UNRES runs
121 only in single-processor mode an CLUST-WHAM runs in both serial and parallel
122 mode [only conformation-distance (rmsd) calculations are parallelized].
123 The parallel version uses MPI.
125 5. RUNNING THE PROGRAM
126 ----------------------
128 The program requires a parallel system to run. Depending on system,
129 either the wham.csh C-shell script (in WHAM/bin directory) can be started
130 using mpirun or the binary in the C-shell script must be executed through
131 mpirun. See the wham.csh C-shell script and section 6 for the files
132 processed by the program.
134 6. INPUT AND OUTPUT FILES
135 -------------------------
137 6.1. SUMMARY OF THE FILES
138 -------------------------
140 The C-shell script wham.csh is used to run the program (see the
141 bin/WHAM directory). The data files that the script needs are mostly the same as
142 for UNRES (see section 6 of UNRES description). In addition, the environmental
143 variable CONTFUN specifies the method to assess whether two side chains
144 are at contact; if EONTFUN=GB, the criterion defined by eq 8 of ref 4 is
145 used to assess whether two side chains are at contact. Also, the parameter
146 files from the C-shell scripts are overridden if the data from Hamiltonian
147 MREMD are processed; if so, the parameter files are defined in the main
150 The main input file must have inp extension. If it is INPUT.inp, the output
151 files are as follows:
153 Coordinate input file COORD.ext, where ext denotes file extension in one of the
156 INT (extension int; UNRES angles theta, gamma, alpha, and beta),
157 X (extension x; UNRES Cartesian coordinate format; from MD),
158 PDB (extension pdb; Protein Data Bank format; fro MD),
159 CX (extension cx; xdrf format; from WHAM).
161 INPUT_clust.out (single-processor mode) or INPUT_clust.out_xxx (parallel mode) -
162 output file(s) (INPUT.out_000 is the main output file for parallel mode).
164 COORD_clust.int: leading (lowest-energy) members of the families
165 in internal-coordinate format.
166 COORD_clust.x: leading members of the families in UNRES Cartesian coordinate
168 COORD_xxxx.pdb or COORD_xxxx_yyy.pdb (CLUST-UNRES): PDB file of member yyy
169 of family xxxx; yyy is omitted if the family contains only one member
170 within a given energy cut-off.
171 COORD_TxxxK_yyyy.pdb: concatenated conformations in PDB format of the
172 members of family yyyy clustered at T=xxxK ranked by probabilities in
173 descending order at this temperature (CLUST-WHAM).
174 COORD_T_xxxK_ave.pdb: cluster-averaged coordinates and coordinates of a
175 member of each family that is closest to the cluster average in PDB
176 format, concatenated in a single file (CLUST-WHAM).
178 INPUT_clust.tex: PicTeX code of the cluster tree.
180 INPUT.rms: rmsds between conformations.
185 This file has the same structure as the UNRES input file; most of the data are
186 input in a keyword-based form (see section 7.1 of UNRES description). The data
187 are grouped into records, referred to as lines. Each record, except for the
188 records that are input in non-keyword based form, can be continued by placing
189 an ampersand (&) in column 80. Such a format is referred to as the data list
192 In the following description, the default values are given in parentheses.
194 6.2.1. Title (80-character string)
195 ----------------------------------
197 6.2.2. General data (data list format)
198 --------------------------------------
200 NRES (0) - the number of residues
202 ONE_LETTER - if present, the sequence is input in one-letter code.
204 SYM (1) - number of chains with same sequence (for oligomeric proteins only),
206 WITH_DIHED_CONSTR - if present, dihedral-angle restraints were imposed in the
207 processed MREMD simulations
209 RESCALE (1) - Choice of the type of temperature dependence of the force field.
210 0 - no temperature dependence
211 1 - homographic dependence (not implemented yet with any force field)
212 2 - hyperbolic tangent dependence [3].
214 DISTCHAINMAX (50.0) - for oligomeric proteins, distance between the chains
215 above which restraints will be switched on to keep the chains at a
218 PDBOUT - clusters will be printed in PDB format.
220 ECUT - energy cut-off criterion to print conformations (UNRES-CLUST runs).
221 Only those families will be output the energy of the lowest-energy
222 conformation of which is within ECUT kcal/mol above that of the
223 lowest-energy conformation and for a family only those members will be
224 output which have energy within ECUT kcal/mol above the energy of the
225 lowest-energy member of the family.
227 PRINT_CART - output leading members of the families in UNRES x format.
229 PRINT_INT - output leading members of the families in UNRES int format.
231 REF_STR - if present, reference structure is input and rmsd will be computed
232 with respect to it (CLUST-UNRES only; rmsd is provided in the cx file
233 from WHAM for CLUST-WHAM runs).
235 PDBREF - if present, reference structure will be read in from a pdb file.
237 SIDE - side chains will be considered in superposition when calculating rmsd
239 CA_ONLY - only the Calpha atoms will be used in rmsd calculation
241 NSTART (0) - first residue to superpose
243 NEND (0) - last residue to superpose
245 NTEMP (1) - number of temperatures at which probabilities will be calculated
246 and clustering performed (CLUST-WHAM)
248 TEMPER (NTEMP tiles) - temperatures at which clustering will be performed
251 EFREE - if present, conformation entropy factor is read if the conformation
252 is input from an x or pdb file
254 PROB (0.99) - cut-off on the summary probability of the conformations that
255 are clustered at a given temperature (CLUST-WHAM)
257 IOPT (2) - clustering algorithm:
259 1 - Ward's minimum variance method
260 2 - single link method
261 3 - complete link method
262 4 - average link (or group average) method
263 5 - McQuitty's method
264 6 - Median (Gower's) method
267 Instead of IOPT=1, MINTREE and instead of IOPT=2 MINVAR can be specified
269 NCUT (1) - number of cut-offs in clustering
271 CUTOFF (-1.0; NCUT values) cut-offs at which clustering will be performed;
272 at the cut-off flagged by a "-" sign clustering will be performed with
273 cutoff value=abs(cutoff(i)) and conformations corresponding to clusters
274 will be output in the desired format.
276 MAKE_TREE - if present, produce a clustering-tree graph
278 PLOT_TREE - if present, the tree is written in PicTeX format to a file
280 PRINT_DIST - if present, distance (rmsd) matrix is printed to main output
282 PUNCH_DIST - if present, the upper-triangle of the distance matrix will be
285 6.2.3. Energy-term weights and parameter files
286 ----------------------------------------------
288 WSC (1.0) - side-chain-side-chain interaction energy
290 WSCP (1.0) - side chain-peptide group interaction energy
292 WELEC (1.0) - peptide-group-peptide group interaction energy
294 WEL_LOC (1.0)- third-order backbone-local correlation energy
296 WCORR (1.0) - fourth-order backbone-local correlation energy
298 WCORR5 (1.0) - fifth-order backbone-local correlation energy
300 WCORR6 (1.0) - sixth-order backbone-local correlation energy
302 WTURN3 (1.0) - third-order backbone-local correlation energy of pairs of
303 peptide groups separated by a single peptide group
305 WTURN4 (1.0) - fourth-order backbone-local correlation energy of pairs of
306 peptide groups separated by two peptide groups
308 WTURN6 (1.0) - sixth-order backbone-local correlation energy for pairs of
309 peptide groups separated by four peptide groups
311 WBOND (1.0) - virtual-bond-stretching energy
313 WANG (1.0) - virtual-bond-angle-bending energy
315 WTOR (1.0) - virtual-bond-torsional energy
317 WTORD (1.0) - virtual-bond-double-torsional energy
319 WSCCOR (1.0) - sequence-specific virtual-bond-torsional energy
321 WDIHC (0.0) - dihedral-angle-restraint energy
323 WHPB (1.0) - distance-restraint energy
325 SCAL14 (0.4) - scaling factor of 1,4-interactions
327 6.2.4. Molecule information
328 -----------------------------
330 6.2.4.1. Sequence information
331 -----------------------------
335 3-letter code: Sequence is input in format 20(1X,A3)
337 1-letter code: Sequence is input in format 80A1
339 6.2.4.2. Dihedral angle restraint information
340 ---------------------------------------------
342 This is the information about dihedral-angle restraints, if any are present.
343 It is specified only when WITH_DIHED_CONSTR is present in the first record.
345 1st line: ndih_constr - number of restraints (free format)
347 2nd line: ftors - force constant (free format)
349 Each of the following ndih_constr lines:
351 idih_constr(i),phi0(i),drange(i) (free format)
353 idih_constr(i) - the number of the dihedral angle gamma corresponding to the
356 phi0(i) - center of dihedral-angle restraint
358 drange(i) - range of flat well (no restraints for phi0(i) +/- drange(i))
360 6.2.4.3. Disulfide-bridge data
361 ------------------------------
363 1st line: NS, (ISS(I),I=1,NS) (free format)
365 NS - number of cystine residues forming disulfide bridges
367 ISS(I) - the number of the Ith disulfide-bonding cystine in the sequence
369 2nd line: NSS, (IHPB(I),JHPB(I),I=1,NSS) (free format)
371 NSS - number of disulfide bridges
373 IHPB(I),JHPB(I) - the first and the second residue of ith disulfide link
375 Because the input is in free format, each line can be split
377 6.2.5. Reference structure
378 --------------------------
380 If PDBREF is specified, filename with reference (experimental) structure,
381 otherwise UNRES internal coordinates as the theta, gamma, alpha, and beta
384 6.3. Main output file (out)
385 ------------------------------------------------
387 The main (with name INPUT_clust.out or INPUT_clust.out_000 for parallel runs)
388 output file contains the results of clustering (numbers of families
389 at different cut-off values, probabilities of clusters, composition of
390 families, and rmsd values corresponding to families (0 if rmsd was not
391 computed or read from WHAM-generated cx file).
393 The output files corresponding to non-master processors
394 (INPUT_clust.out_xxx where xxx>0 contain only the information up to the
395 clustering protocol. These files can be deleted right after the run.
397 Excerpts from the a sample output file are given below:
401 THERE ARE 20 FAMILIES OF CONFORMATIONS
403 FAMILY 1 CONTAINS 2 CONFORMATION(S):
404 42 -2.9384E+03 50 -2.9134E+03
407 Max. distance in the family: 14.0; average distance in the family: 14.0
409 FAMILY 2 CONTAINS 3 CONFORMATION(S):
410 13 -2.9342E+03 7 -2.8827E+03 10 -2.8682E+03
415 Maximum distance found: 137.82
416 Free energies and probabilities of clusters at 325.0 K
417 clust efree prob sumprob
418 1 -76.5 0.25035 0.25035
419 2 -76.5 0.24449 0.49484
420 3 -76.4 0.21645 0.71129
421 4 -76.4 0.20045 0.91174
422 5 -75.8 0.08826 1.00000
425 THERE ARE 5 FAMILIES OF CONFORMATIONS
427 FAMILY 1 WITH TOTAL FREE ENERGY -7.65228E+01 CONTAINS 548 CONFORMATION(S):
428 8363 -7.332E+013939 -7.332E+012583 -7.332E+017395 -7.332E+019932 -7.332E+01
429 5816 -7.332E+013096 -7.332E+012663 -7.332E+014099 -7.332E+016822 -7.332E+01
430 3176 -7.332E+017542 -7.332E+018933 -7.332E+017315 -7.332E+01 200 -7.332E+01.
432 5637 -7.062E+018060 -7.061E+013797 -7.060E+018800 -7.057E+016295 -7.057E+01
433 6298 -7.057E+012332 -7.057E+012709 -7.057E+01
435 Max. distance in the family: 16.5; average distance in the family: 8.8
438 6.4. Output coordinate files
439 ----------------------------
441 6.4.1. The internal coordinate (int) files
442 ------------------------------------------
444 The file with name COORD_clust.int contains the angles theta, gamma, alpha,
445 and beta of all residues of the leaders (lowest UNRES energy conformations
446 from consecutive families for CLUST-UNRES runs and lowest free energy
447 conformations for CLUST-WHAM runs). The format is the same as that of the
448 file output by UNRES; see section 9.1.1 of UNRES description.
450 For CLUST-WHAM runs, the first line contains more items:
452 number of family (format i5)
453 UNRES free energy of the conformation (format f12.3)
454 Free energy of the entire family (format f12.3)
455 number of disulfide bonds (format i2)
456 list disulfide-bonded pairs (format 2i3)
457 conformation class number (0 if not provided) (format i10)
459 6.4.2. The Cartesian coordinate (x) files
460 -----------------------------------------
462 The file with name COORD_clust.x contains the Cartesian coordinates of the
463 alpha-carbon and side-chain-center coordinates. The coordinate format is
464 as in section 9.1.2 of UNRES description and the first line contains the
467 Number of the family (format I5)
468 UNRES free energy of the conformation (format f12.3)
469 Free energy of the entire family (format f12.3)
470 number of disulfide bonds (format i2)
471 list disulfide-bonded pairs (format 2i3)
472 conformation class number (0 if not provided) (format i10)
477 The PDB files are in standard format (see
478 ftp://ftp.wwpdb.org/pub/pdb/doc/format_descriptions/Format_v33_Letter.pdf).
479 The ATOM records contain Calpha coordinates (CA) or UNRES side-chain-center
480 coordinates (CB). For oligomeric proteins chain identifiers are present
481 (A, B, ..., etc.) and each chain ends with a TER record. Coordinates of a
482 single conformation or multiple conformations The header (REMARK) records
483 and the contents depends on cluster run type. The next subsections are devoted
484 to different run types.
486 6.4.3.1. CLUST-UNRES runs
487 ---------------------------
489 The files contain the members of the families obtained from clustering such
490 that the lowest-energy conformation of a family is within ECUT kcal/mol higher
491 in energy than the lowest-energy conformation. Again, within a family, only
492 those conformations are output whose energy is within ECUT kcal/mol above
493 that of the lowest-energy member of the family. Families and the members
494 of a family within a family are ranked by increasing energy. The file names are:
496 COORD_xxxx.pdb where xxxx is the number of the family, if the family contains
497 only one member of if only one member is output.
499 COORD_xxxx_yyy.pdb where xxxx is the number of the family and yyy is the number
500 of the member of this family.
502 An example is the following:
504 REMARK R0001 ENERGY -2.93843E+03
505 ATOM 1 CA GLY 1 0.000 0.000 0.000
506 ATOM 2 CA HIS 2 3.800 0.000 0.000
507 ATOM 3 CB HIS 2 5.113 1.656 0.015
508 ATOM 4 CA VAL 3 5.927 -3.149 0.000
512 ATOM 346 CB GLU 183 -43.669 -32.853 -7.320
523 where ENERGY is the UNRES energy. The CONECT records defined the Calpha-Calpha
524 and Calpha-SC connection.
526 6.4.3.2. CLUST-WHAM runs
527 --------------------------
529 The program generates a file for each family with its members and a summary
530 file with ensemble-averaged conformations for all families. These are described
531 in the two next sections.
533 6.4.3.2.1. Conformation family files
534 ------------------------------------
536 For each family, the file name is COORD_TxxxK_yyyy.pdb, where yyyy is the
537 number of the family and xxx is the integer part of the temperature (K).
538 The first REMARK line in the file contains the information about the free
539 energy and average rmsd of the entire cluster and, for each conformation,
540 the initial REMARK line contains these quantities for this conformation.
541 Same applies to oligomeric proteins, for which the TER records separate the
542 chains and the ENDMDL record separates conformations.
543 An example is given below.
545 REMARK CLUSTER 1 FREE ENERGY -7.65228E+01 AVE RMSD 8.22
546 REMARK 1BDD L18G full clust ENERGY -7.33241E+01 RMS 10.40
547 ATOM 1 CA VAL 1 18.059 -33.585 4.616 1.00 5.00
548 ATOM 2 CB VAL 1 18.720 -32.797 3.592 1.00 5.00
552 ATOM 115 CA LYS 58 29.641 -44.596 -8.159 1.00 5.00
553 ATOM 116 CB LYS 58 27.593 -45.927 -8.930 1.00 5.00
562 REMARK 1BDD L18G full clust ENERGY -7.33240E+01 RMS 10.04
563 ATOM 1 CA VAL 1 3.174 2.833 -34.386 1.00 5.00
564 ATOM 2 CB VAL 1 3.887 2.811 -33.168 1.00 5.00
567 ATOM 115 CA LYS 58 16.682 6.695 -20.438 1.00 5.00
568 ATOM 116 CB LYS 58 18.925 5.540 -20.776 1.00 5.00
576 6.4.3.2.2. Average-structure file
577 ---------------------------------
579 The file name is COORD_T_xxxK_ave.pdb. The entries are in pairs; the first
580 one is cluster-averaged conformation and the second is a family member which
581 has the lowest rmsd from this average conformation. Computing average
582 conformations is explained in section 2.5 of ref 3. Example excerpts from
583 an entry corresponding to a given family are shown below. The last
584 number in each ATOM record is the rmsd of the mean coordinate of a given
585 atom averaged over the cluster.
587 REMAR AVERAGE CONFORMATIONS AT TEMPERATURE 300.00
589 REMARK 2HEP clustering 300K ENERGY -8.22572E+01 RMS 3.29
590 ATOM 1 CA MET 1 -17.748 48.148 -19.284 1.00 5.96
591 ATOM 2 CB MET 1 -17.373 47.911 -19.294 1.00 6.34
592 ATOM 3 CA ILE 2 -18.770 49.138 -18.133 1.00 3.98
596 ATOM 80 CB PHE 41 -14.353 44.680 -15.642 1.00 2.62
597 ATOM 81 CA ARG 42 -11.619 41.645 -13.117 1.00 4.06
598 ATOM 82 CB ARG 42 -11.330 40.378 -13.313 1.00 5.19
610 REMARK 2HEP clustering 300K ENERGY -8.22572E+01 RMS 3.29
611 ATOM 1 CA MET 1 -37.698 40.489 -32.408 1.00 5.96
612 ATOM 2 CB MET 1 -38.477 39.426 -34.159 1.00 6.34
616 ATOM 80 CB PHE 41 -35.345 50.342 -31.371 1.00 2.62
617 ATOM 81 CA ARG 42 -33.603 54.332 -27.130 1.00 4.06
618 ATOM 82 CB ARG 42 -33.832 53.074 -24.415 1.00 5.19
632 6.5. The conformation-distance file
633 -----------------------------------
635 The file name is INPUT_clust.rms. It contains the upper-diagonal part of
636 the matrix of rmsds between conformations and differences between their
639 i,j,rmsd,energy(j)-energy(i) (format 2i5,2f10.5)
641 where i and j, j>i are the numbers of the conformations, rmsd is the rmsd
642 between conformation i and conformation j and energy(i) and energy(j) are
643 the UNRES energies of conformations i and j, respectively.
645 6.6. The clustering-tree PicTeX file
646 ------------------------------------
648 This file contains the PicTeX code of the clustering tree. The file name is
649 INPUT_clust.tex. It should be supplemented with LaTeX preamble and final
650 commands or incorporated into a LaTeX source and compiled with LaTeX. The
651 picture is produced by running LaTeX followed by dvips, dvipdf or other command
652 to convert LaTeX-generated dvi files into a human-readable files.
658 Faculty of Chemistry, University of Gdansk
659 ul. Sobieskiego 18, 80-952 Gdansk Poland.
660 phone: +48 58 523 5430
662 e-mail: adam@chem.univ.gda.pl
664 Dr. Cezary Czaplewski
665 Faculty of Chemistry, University of Gdansk
666 ul. Sobieskiego 18, 80-952 Gdansk Poland.
667 phone: +48 58 523 5430
669 e-mail: czarek@chem.univ.gda.pl
671 Prepared by Adam Liwo, 02/19/12