Biophysics module 149

X-ray Crystallography of macromolecules

Additional reading materials:

1. For Crystallization (a) Protein Crystallography", Blundell and Johnson, (1976) chapter 3, pp. 59-82, Academic Press, London. (b) McPherson in Crystallization of Membrane Proteins" (ed. by H. Michel) pp 1-52, CRC Press, 1990).

2. For Crystal symmetry

(a) Protein Crystallography", Blundell and Johnson, (1976) chapter 4, pp. 83-106, Academic Press, London. (b) Principles of Protein X-ray Crystallography", Jan Drenth, (1994) chapter 3, pp. 54-72, Springer-Verlag, New York. (c) X-ray Structure Determination", 2nd edition, Stout and Jensen, (1989) chapter 3, pp. 41-73, Wiley & Sons, New York.

3. For Principles of X-ray diffraction

(a) Biophysical Chemistry" Cantor and Schimmel, (1980) pp687-729. Freeman, New York. (b) Principles of Protein X-ray Crystallography", Jan Drenth, (1994) chapter 4, pp. 73-116, Springer-Verlag, New York. (c) Protein Crystallography", Blundell and Johnson, (1976) chapter 5, pp. 107-140, Academic Press, London. (d) X-ray Structure Determination", 2nd edition, Stout and Jensen, (1989) chapter 2, pp. 18-40, Wiley & Sons, New York.

4. For Phase Problem

(a) X-ray Structure Determination", 2nd edition, Stout and Jensen, (1989) chapter 12, pp. 279-291, Wiley & Sons, New York. (b) Protein Crystallography", Blundell and Johnson, (1976) pp. 151-164, and pp. 337-381, Academic Press, London.

5. For Preparation of heavy atom derivative

(a) Protein Crystallography", Blundell and Johnson, (1976) chapter 8, pp. 183-239, Academic Press, London.

6. For Molecule replacement (a) Protein Crystallography", Blundell and Johnson, (1976) chapter 16, pp. 443-465, Academic Press, London.

7. For Electron density interpretation (a) Protein Crystallography", Blundell and Johnson, (1976) pp. 381-403, Academic Press, London.

8. For Structure refinement (a) Principles of Protein X-ray Crystallography", Jan Drenth, (1994) chapter 13, pp. 242-263, Springer-Verlag, New York. (b) Protein Crystallography", Blundell and Johnson, (1976) pp. 404-436, Academic Press, London. Determination of a macromolecular structure can be essentially devided into seven steps.

1. Crystallization. Single crystals are required for the analysis of three-dimensional structures by X-ray diffraction. Vapor diffusion and dialysis are two major methods to obtain single crystals of biological macromolecules. A crystal diffracting to 3 AA~ or better resolution is suitable to resolve positions of the amino acids or individual atoms.

2. Preliminary characterization of crystals. This step has two purposes : (1) determination of the crystal system, cell parameters, and space group, and (2) examination of crystal quality and diffraction power.

3. Data collection. Collection of diffraction data becomes a more or less automatic procedure. The commonly used facilities for data collection are: multiwire detectors (Hamlin and Siemens), phosphate image plate (Rigaku and Mar research), TV detector (Fast), and charge coupled device (CCD in development). The film methods are out of date, but are sometimes used for space group characterization.

4. Heavy atom derivative preparation. Heavy metal compounds such as mercury and platinum can specifically bind to certain residues such as cysteine and histidine, and thus form a heavy atom derivative of a native protein. Heavy atom derivatives can be prepared mostly by soaking the native crystal in a buffer containing a heavy metal compound and sometimes by cocrystallizing a protein with a heavy metal compound.

5. Phase determination. Phasing is the central problem in the structure determination. The methods include (1) Patterson function, (2) multiple isomorphous replacement, (3) multiwavelength anomalous diffraction, (4) molecular replacement, (5) direct methods and maximum entropy.

6. Electron density map interpretation and model building. This largely depends on personal experience because an electron density map only reveals height of peaks but not an atomic model. People may occationally build a wrong model from an electron density map.

7. Structure refinement. A row model built based upon an electron density map is refined by a least-square or molecular dynamic protocol. A correct structure of a macromolecule has an R-factor below 20% and also good molecular geometry such as bond lengths and bond angles.

In this introductive course of X-ray diffraction, we will concentrate on the principles of protein crystallography and aim at the basic concepts of X-ray crystallography. For the practical determination of macromolecular structures, you are suggested to learn in a crystallographic laboratory. Macromolecules in biological system can be grouped into four major classes: (1) proteins, (2) nucleotides such as DNA and RNA, (3) lipids, and (4) carbohydrates. Other biological molecules such as vitamines and metals belong to small molecular compounds so that their crystallization will not be discussed in this chapter. Molecules of lipids and carbohydrates often have variable chemical components and molecular weights, and are difficult to be crystallized because their heterogeneity. We will focus on the crystallization of proteins and nucleotides in this chapter.

2.1. General consideration of crystallization

The first step towards crystallization of a macromolecule is to prepare solution of the macromolecules. Water is the most common solvent to dissolve proteins and nucleotides while detergents are required for the solubility of hydrophobic molecules such as membrane proteins. Once stable solutions of macromolecules are prepared, crystallization can be carried out by adding precipitants to the solution of proteins or nucleotides. Crystals may be obtained if the precipitation rate and other crystallization conditions are carefully controlled. If precipitation rate is too fast, amorphous precipitation will be obtained. Thus, the whole story to obtain crystals is to control the precipitation process. (From McPherson in Crystallization of Membrane Proteins" (ed. by H. Michel) pp 1-52, CRC Press, 1990).

Crystallization of macromolecules can be divided into three main stages: (1) bringing a macromolecular solution to saturation and supersuturation, (2) formation of crystal seeds or nuclears, and (3) crystal growth. Supersaturation is a metastable solution state in which macromolecules will precipitate out in a certain period of time. The crystal nuclears are not thermodynamically stable and exist in an equilibrium of formation and dissolving of nuclears. This stage of crystallization is called nucleation. Free energy of crystallization at the nucleation stage increases until the nuclears reach a critical size R$_c$ (Fig. 2.3). After passing the critical size (R$_c$), crystals are steadily grown meanwhile the free energy of crystallization decreases.

Fig. 2.3. Free energy of crystallization versus size of crstals. 2.2. Factors effecting crystallization Factors effecting growth of macromolecular crystals include (1) purity of macromolecules, (2) precipitants, (3) pH of solution, (4) buffer, (5) initial protein concentration, (6) organic solvents, (7) salts or ions, (8) detergents, and (9) temperature. Other factors such as gravity, volume of crystallization sample, and vibration of environment also effect crystallization of macromolecules, but are less controllable. Binding of ligands or substates to proteins stabillizes the conformation of macromolecules and thus usually helps with crystallization. Three features are general for crystallization of macromolecules: (1) crystallization is an experience-based science and no theory can guarentee growth of mocromolecular crystals, (2) each protein or nucleotide has different profile of crystallization so that individual experiments have to be setup for each sample, and (3) hundreds or thousands of trials may be needed to obtain usable crystals for X-ray diffraction since so many factors may effect growth of crystals. Thus, crystallization of macromolecules often is very time-consuming and tedious, althoogh knowledge and experience on crystallization of other samples may reduce the number of experiments substantially. We discuss below the general aspects of macromolecular crystallization as a guideline.

Although crystallization has been used as a technique to purify inorganic and organic molecules, a macromolecular sample is required to have a 95% or better purity for crystallization trials. The reason is probably because large size of macromolecules may easily interact with impurities so that crystallization is disturbed. Second requirement for macromolecular crystallization is that the samples do not aggregate themselves. When a macromolecule aggregates, molecular species with different molecular weight, such as monomer, dimer, trimer, etc, may coexisted in the sample solution. These oligermers behaviour like different molecules and thus cannot crystallize in the same crystal lattice. Last, it is difficult to crystallize a macromolecule which comprises several isoforms. The term isoform refers to a molecule having the same chemical sequence or component but different molecular conformations. Isoforms of a macromolecule usually have different isoelectric focusing points (pIs), and a mixture of isoforms are difficult to crystallize.

Precipitants are the first factor to be screened. Ammonium sulphate and polyethylene glycol (PEG) are two commonly used precipitants. Ammonium sulphate is ionic and precipitates macromolecules by salting out effect. At high concentration and very basic pH, ammonium sulphate may decompose to release ammonium. In that case, lithium sulphate is often used for replacement. Other ionic precipitants are potassium (or ammonium) phosphate, sodium chloride, ammonium formate etc. Polyethylene glycol is a polar molecule and has high affinity with water. Its effect on crystallization is considered to interact with water so as to concentrate macromolecules to saturation. However, it may also be possible that polyethylene glycol directly interacts with macromolecules to effect their crystallization. Polyethylene glycol is a linear molecule with a wide distribution of molecular weight and has a formula of $-[CH(OH)-CH(OH)]_n-$, where $n$ is repeating units. Commonly used PEGs are PEG3350 (averaged molecular weight 3350), PEG8000, and sometimes PEG400, PEG1000, PEG6000, PEG10000, and PEG20000. In addition to ammonium sulfate and PEG, high concentration of organic solvents such as 30% isopropanol or dimethylpentanediol can also serve as precipitants.

Crystallization of most proteins and nucleotides are sensetive to pH in terms of solubility and crystallization behavior. A protein has the least solubility at the pH equal to its pI. However, macromolecules can crystallize at a wide range of pHs, and it is not necessary to crystallize a protein at its pI. In fact, crystallization at different pHs may sometimes yield different crystal forms. Thus, the best pH for crystallization is one of the most important factors needed to be determined from the experimental trials.

A buffer system is mainly used to maintain a stable pH of crystallization solution. From this point of view, it is better to use a buffer system with a big buffering capacity at pH of the crystallization solution, such as Tris for pH 7.5-9.3, phosphate for pH around 7, and malaic acid for acidic pHs. However, for some proteins and nucleotides sensetive to pH and buffer components, one has to select a buffer system out of their buffering capacity. For example, an enzyme called chorismate mutase was crystallized in a buffer of 5 mM Tris.HCl, 1 mM 2-mercaptoethanol, 0.5 mM NaN$_3$, 0.1 mM EDTA, and 12% PEG3350 at pH 5.3 in a microdialysis wells, but failed to crystallize in organic acid buffers such as malaic acid which can produce a stable pH at acidic range\footnote{Chook, Y. M., Gray, J. V., Ke, H., & Lipscomb, W. N. (1994), {J. Mol. Biol.}, {240}, 476-500}. Usually, 20 to 50 mM concentration of a buffer compound should be enough to generate a stable pH. It is always preferred to use an initial protein concentration as close as possible to saturation. If the protein concentration is too diluted, one may never be able to precipitate out the protein. Choice of the initial concentration basically depends on the solubility of macromolecules. Most proteins and nucleotides were crystallized at the initial concentration of 10 to 20 mg/ml. But in some extreme cases, very high or low concentration has to be used as the initial protein concentrations for growth of crystals. For example, 200 mg/ml of cytochrome C\footnote{Louie, G. V. \& Brayer, G. D, (1990), {J. Mol. Biol.}, {214}, 527-555} and 2 mg/ml of cell cycle inhibitor p18 successfully yielded crystals for X-ray diffraction. Freguently used organic solvents are methanol, ethanol, isopropanol, dimethyl sulfoxide (DMSO), 2-methyl-2,4-pentanediol (MPD), and dioxane. Organic solvents help crystallization of macromolecules by improving their polar interactions. The concentration of organic solvents used as crystallization helper are ranged 2 to 10% (v/v). At high concentration such as $>$30%, organic solvents may serve as precipitants. In some cases, organic solvents may have strong impact on crystallization. For example, 6.7% ethanol was found to tremendously enhance the yield of crystals of cyclophilin footnote{Ke, H. M., Zydowsky, L. D., Liu, J., & Walsh, C. T. (1991), {Proc. Natl. Acad. Sci., USA} {88,} 9483-9487.} while 5% methanol or DMSO is absolutely required to obtain crystals of HIV-1 GAG p24 protein. Salts are used to generate suitable ionic strength of a buffer system for crystallization of macromolecules. The ionic strength ($mu$) is defined as where $z$ is the valence of the ions of the salt and $c$ is a constant. At low concentration of salts, ions may improve the interactions between macromolecules and water, and therefore increase the solubility of macromolecules in water, as described by where $S$ and $S_0$ are the solubility in the present and absent of electrolyte, constants A and B depend on the temperature and dielectric constants, $a$ is the average size of ions, and $\mu$ is ionic strength. This phenomenon is known as "salting in" effect. At high concentration, salts compete with macromolecules for water and thus concentrate macromolecules. Therefore, salts at high concentration serve as precipitants, or known as salting out" effect. The above two equations show that high valence ions have much stronger impact on the solubility than low valence ions. For some proteins sensetive to ions, it is important to find out which counterpart of ions, cation or anion, is the control factor of the crystallization. Furthermore, some ions such as Zn$^{2+}$ and Mg$^{2+}$, although they have the same charges, may have dramatically different impact on crystallization. Attention also needs to be paid to the crystallization in the presence of divalent metals because some divalent ions such as zinc may be easily precipitated out of solution to form organic metal crystals. Detergents are mostly used for crystallization of membrane proteins, but is also considered to improve crystallization of some water-soluble proteinsfootnote{Crystallization of membrane proteins" Edit by Hartmut Michel, 1991, CRC Press, Boston.}. Frequently used detergents are triton X100, $beta$-octyl glucoside ($beta$-OG), lauryldimethylamine$N$-oxide(LDAO), heptane 1,2,3-triol, benzmidine hydrochloride. Detergents help hydrophobic interactions of macromolecules. The concentrations of detergents, which successfully generated crystals, are in range of 0.1 to 1%. Temperature effects the solubility and crystallization behavior of macromolecules. Solubility dependence on temperature varies from protein to protein. Solubility of some macromolecules increases when temperature increases while others may decrease. It is convenient to crystallize a protein either at 4$^o$C or room temperature. Many proteins have been crystallized at room temperature. If a protein is not heat-stable, crystallization in coldroom is preferred. Temperature usually is not a senestive control factor for crystallization of most macromolecules, but needs to be tested for some heat-sensetive macromolecules. However, in some special cases, temperature may be a key factor. For example, HIV-1 GAG$^{p24}$ protein crystallized at room temperature but precipitated as oil drops at $4^o$C under the same crystallization conditions.

2.3. Methods for crystallization

Four methods have been successfully used to crystallize macromolecules: (1) batch crystallization, (2) vapor diffusion, (3) microdialysis, and (4) seeding. Fig. 2.4. Crystallization by vapor diffusion, (a) hanging drop and (b) sitting drop. Batch crystallization is a classic technique for early crystallization of enzymes, but is still in use for crystallization of some proteins such as cytochrome C and heamoglobin. The method is simple, just adding a precipitant (salt, organic solvent or PEG) into a protein solution in a small crystallization tube or bottle. After sitting for several days, crystals may form. This method allows one to add precipitant gradually or step by step, but often fast brings protein solution to saturation and precipitation. Vapor diffusion crystallizes proteins in a closed chamber where crystallization buffer containing precipitants is placed in a reservoir and a protein drop is placed on the cover slide facing the reservoir. After the chamber is closed up, vapors from the protein drop and the crystallization buffer will exchange. As a result, the protein solution is gradually brought to saturation and crystallizes. Frequently used apparatus for the vapor diffusion method includes hanging drop and sitting drop (Fig. 2.4). The advantage of the vapor diffusion method is that it is a slow process so as to let protein have enough time to pack into crystal lattice. In addition, it requires small amount of protein sample such as 2 $mu$l per drop. Microdialysis is another major method to crystallize macromolecules. The protein solution is filled into a microdialysis well and is covered with dialysis membrane (Fig. 2.5). Then the dialysis button is emerged into a crystllization solution which contains precipitants such as ammonium sulphate or PEG. The protein which is restricted in the microdialysis button will crystallize after the precipitant gradually dialysizes into the button. Since all the components inside the dialysis botton, such as pH, ionic strength, etc, except for protein, are identical to those of the crystallization buffer after the dialysis reaches an equilibrium, the dialysis method can precisely control the crystallization conditions and have good reproducibility. The disadvantage of the dialysis method is that it requires relative large amount of metarials, at least 5 $mu$l of protein solution for each trial. However, large crystals are often obtained from dialysis because the large volume of protein is used in crystallization. Crystallization by seeding is a technique which transfers crystal seeds into protein solution and allows protein deposit on the surface of the seeds to obtain large crystals. Two ways have been used to seed proteins: microseeding and macroseeding. In microseeding, crystal seeds are ground into tiny pieces, and a hair strikes through first the solution with ground seeds and then the protein solution. In this way, the crystal seeds are transfered into the protein solution. Macroseeding transfers one crystal into the protein solution as the seed and let protein molecules deposit on the surface of the seed. For the macroseeding, the crystal surface of the seed should be washed with a buffer to regenerate the growing surface. In both seeding techniques, the protein solution for crystallization should be close to saturation and the seeds should not be completely dissolved after transfered into the protein solution. Fig. 2.5. Crystallization by microdialysis.

2.4. Practical aspects of crystallization

When large amount of proteins or nucleotides (say 5 or more mg) is ready for crystallization, the first thing needs to be checked is the purity of the material by denatured SDS PAGE gels. If only a single band is seen in the SDS gel with a 20 $mu$g load, the material probably has a purity better than 95% and is suitable for crystallization. Secondly, it needs to be checked whether the protein has an aggregation problem. If a protein aggregates into a series of oligomers with different molecular weights, crystals are unlikely to be obtained. However, minor aggregation may be fixed by passing the protein sample through a molecular sieving column. An example is the binding protein of the cyclin-depentdent kinase (CksHs2). The monomeric form of CksHs2 was successfully purified from the hexameric form by gel filtration chromatography and crystallized in the space group C2. footnote{Parge, H. E., Arvai, A. S., Murtari, D. J., Reed, S. I., Tainer, J. A. (1993) {Science}, {262}, 387-395.} For some proteins, aggregation problem may be solved by trimming the proteins with proteases. One successful example is tumor suppressor protein p53. Full length p53 badly aggregate, but its core domain which was found by the protease experiments was crystallized and the structure was determined at 2.2 AA~ resolution footnote{Cho, Y., Gorina, S., Jeffrey, P. D., & Paveltich, N. P. (1994), {Science} {265}, 346-355.}. The aggregation problem can be detected by several techniques, among which dynamic light scattering is powerful for the analytical purpose. Last, it needs to check the coexistence of isoforms or multiple conformations of a protein by isoelectric focusing electrophoresis (IEF). A protein sample which showed multiple bands in the IEF gel has low probability to crystallize. Crystallization is a time-consuming job and need to be patient. If $n$ samplings are taken for each of factors $m$, the total number of tests $n.m$ are usually too large for a complete test because so many factors need to be tried by experiments. A quick approach to crystallization of macromolecules is called factorial method" where the cross interactions or mutual effects of the tested factors can be analyzed by mathematics. footnote{Carter, Jr., C. W. & Carter, C. W, (1979), {J. Biol. Chem.} {254}, 12219-12223.} Thus, the number of experiments can be greatly reduced by sampling part but not full matrix $m.n$. An example using factorial" approach is sparse matrix" which sets up 50 experiments for a preliminary screening of crystallization footnote{Jancarik,. J. & Kim, S. H., (1991), {J. Appl. Cryst.} {24}, 409-411.}, as listed in Crystallization Research Tools", ({Hampton Research}, {5(1)}, 4). large Crystallization screen$dag$ 1. 30% MPD, 0.1 M Na acetate, pH 4.6, 0.02 M calcium chloride 2. 0.4 M K, Na tartrate 3. 0.4 M ammomiun phosphate 4. 2.0 M ammomiun sulphate, 0.1 M Tris.HCl, pH 8.5 5. 30% MPD, 0.1 M sodium HEPES pH 7.5, 0.2 M sodium citrate 6. 30% PEG 4000, 0.1 M Tris.HCl, pH 8.5, 0.2 M Mg chloride 7. 1.4 M sodium acetate, 0.1 M sodium cocadylate, pH 6.5 8. 30% 2-propanol, 0.1 M sodium cocadylate, pH 6.5, 0.2 M Na citrate 9. 30% PEG 4000, 0.1 M Na citrate, pH 5.6, 0.2 M ammonium acetate 10. 30% PEG 4000, 0.1 M Na acetate, pH 4.6, 0.2 M ammonium acetate 11. 1.0 M ammonium phosphate, 0.1 M sodium citrate, pH 5.6 12. 30% 2-propanol, 0.1 M Na Hepes, pH 7.5, 0.2 M Mg chloride 13. 30% PEG 400, 0.1 M Tris.HCl, pH 8.5, 0.2 M Mg chloride 14. 28% PEG 400, 0.1 M Na Hepes, pH 7.5, 0.2 M Ca chloride 15. 30% PEG 8000, 0.1 M Na cacodylate, pH 6.5, 0.2 M ammonium sulphate 16. 1.5 M Li sulphate, 0.1 M Na Hepes, pH 7.5 17. 30% PEG 4000, 0.1 M Tris.HCl, pH 8.5, 0.2 M Li sulphate 18. 20% PEG 8000, 0.1 M Na cacodylate, pH 6.5, 0.2 M ammonium sulphate 19. 30% 2-propanol, 0.1 M Tris-HCl, pH 8.5, 0.2 M ammonium acetate 20. 25% PEG 4000, 0.1 M Na acetate, pH 4.6, 0.2 M ammonium sulphate 21. 30% MPD, 0.1 M sodium cacodylate, pH 6.5, 0.2 M Mg acetate 22. 30% PEG 4000, 0.1 M Tris.HCl, pH 8.5, 0.2 M Na acetate 23. 30% PEG 400, 0.1 M Na Hepes, pH 7.5, 0.2 M Mg chloride 24. 20% 2-propanol, 0.1 M Na acetate, pH 4.6, 0.2 M Ca chloride 25. 1.0 M Na acetate, 0.1 M imidazole, pH 6.5 26. 30% MPD, 0.1 M sodium citrate, pH 5.6, 0.2 M ammonium acetate 27. 20% 2-propanol, 0.1 M Na Hepes, pH 7.5, 0.2 M Na citrate 28. 30% PEG 8000, 0.1 M Na cacodylate, pH 6.5, 0.2 M Na acetate 29. 0.8 M K, Na tartrate, 0.1 M Na Hepes, pH 7.5 30. 30% PEG 8000, 0.2 M ammonium sulphate 31. 30% PEG 4000, 0.2 M ammonium sulphate 32. 2.0 M ammonium sulphate 33. 4.0 M Na formate 34. 2.0 M Na formate, 0.1 M Na acetate, pH 4.6 35. 1.6 M K, Na phosphate, 0.1 M Na Hepes, pH 7.5 36. 8% PEG 8000, 0.1 M Tris.HCl, pH 8.5 37. 8% PEG 4000, 0.1 M Na acetate, pH 4.6 38. 1.4 M Na citrate, 0.1 M Na Hepes, pH 7.5 39. 2% PEG 400, 2.0 M ammonium sulphate, 0.1 M Na Hepes, pH 7.5 40. 20% 2-propanol, 20% PEG 4000, 0.1 M Na citrate, pH 5.6 41. 10% 2-propanol, 20% PEG 4000, 0.1 M Na Hepes, pH 7.5 42. 20% PEG 8000, 0.05 M potassium phosphate 43. 30% PEG 1500 44. 0.2 M Mg formate 45. 18% PEG 8000, 0.1 M Na cacodylate, pH 6.5, 0.2 M Zn acetate 46. 18% PEG 8000, 0.1 M Na cacodylate, pH 6.5, 0.2 M Ca acetate 47. 2.0 M ammonium sulphate, 0.1 M Na acetate, pH 4.6 48. 2.0 M ammonium phosphate, 0.1 M Tris.HCl, pH 8.5 49. 2% PEG 8000, 1.0 M Li sulphate 50. 15% PEG 8000, 0.5 M Li sulphate $dag$ MPD=2-ethyl-2,4-pentanediol, PEG=polyethylene glycol. For many proteins, this preliminary screening may yield crystals under several conditions. To obtain diffraction-quality crystals, the conditions need to be refined. For many cases, use of milder conditions such as slightly less precipitants or addition of organic solvents can lead to a slower process of crystallization and thus yields biger and better crystals. Another useful way to reduce the number of experiments may be a two-step crystallization of (1) first screening pH and precipitants and (2) then evaluating other factors such as buffer and organic solvent at fixed pH and precipitant. The philosophy behind this design is to find sensitive factors of crystallization, which are obviously control factors. For many proteins, pH is a sensetive factor and has strong impact on crystallization. On the other hand, proper amount of precipitants need to be found at very first experiments, too. Too much precipitant makes fast precipition of protein without crystallization, while too less precipitant cannot precititate out protein at all. The following is an example screening pH and two common precipitants (polyethylene glycol and ammonium sulphate) using dialysis. large Crystallization by dialysis 1. 20 mM Tris.base, 6% PEG3350, pH 8.5 2. 20 mM Tris.base, 6% PEG3350, pH 7.5 3. 20 mM malaic acid, 6% PEG3350, pH 6.5 4. 20 mM malaic acid, 6% PEG3350, pH 5.5 5. 20 mM malaic acid, 10% ammonium sulphate, pH 6.5 6. 20 mM malaic acid, 10% ammonium, pH 5.5 7. 20 mM sodium acetate, 10% ammonium, pH 4.5 Once pH and concentration of precipitants are determined, other factors such as buffer, organic solvent, and divalent ions can be tested to obtain usable crystals for X-ray diffraction.

3. Characterization of crystals

3.1. Crystal systems

A single crystal is a periodical arrangement of molecules in three dimensions. The repeated unit or the unit cell is characterized by three vectors of {a, b} and {c}, and three angles of $alpha, beta$ and $gamma$ between them. The unit cells are arranged throughout the crystal by the lattice translation. A typical protein crystal with a size of 0.3 x 0.3 x 0.3 mm is estimated to have about 10$^{13}$ unit cells which have 100 AA~ in each dimension. There are seven fundamental types of unit cells, each of which defines a crystal system. Fig. 3.1. Seven crystal systems and fourteen Bravais lattices Triclinic: $a \neq b \neq c$ and $\alpha \neq \beta \neq \gamma$ Monoclinic: $a \neq b \neq c$, $\alpha = \gamma = 90^o, \beta \neq 90^o$ Orthorhombic: $a \neq b \neq c$, $\alpha = \beta = \gamma = 90^o$ Trigonal: $a = b \neq c$, $\alpha = \beta = 90^o, \gamma = 120^o$ (or Rhombohedral: $a = b = c$, $\alpha = \beta = \gamma \neq 90^o, < 120^o$) Tetragonal: $a = b \neq c$, $\alpha = \beta = \gamma = 90^o$ Hexagonal: $a = b \neq c$, $\alpha = \beta = 90^o, \gamma = 120^o$ Cubic: $a = b = c$, $\alpha = \beta = \gamma = 90^o$ Each lattice point at the eight corners of the unit cell (Fig. 3.1) is shared by eight neighboring cells in crystals. Thus, the unit cell having only lattice points at corners contains one lattice point per unit cell. This type of cell is primitive and has the Bravais lattice type of {P}. Remember that a lattice point is a mathematical extraction, and may represent several molecules. The cells with more than one lattice points are called non-primitive cells. There are three nonprimitive lattice types: {I} for the cells having an extra lattice point at the center of the unit cell in addition to the lattice point at the corner, {C} for cells with two extra lattice points on one pair of opposite faces, and {F} for cells with extra lattice points on all faces (Fig. 3.1). It can be imagined that two net lattice points exist in the {I-} and {C-}centered lattices and four for the {F-}centered lattice. Thus, the seven crystal systems can be further devived into fourteen {Bravais lattices} (Fig. 3.1). The trigonal crystals can be in either hexagonal ({P}) or rhombohedral ({R}) system.

3.2. Symmetry and space group

Most crystal systems, except for the triclinic cell, have crystallographic symmetries within their unit cells. A complete characterization of crystals requires not only the cell parameters of the crystal system as listed above, but also the symmetry in the cell. The symmetry elements can be grouped into two categories: {point symmetry} and {space symmetry}. The point symmetry can be further devided into the following operations: 1. {rotation axes} rotate an object around an axis by a certain degree. For example, a 2-fold axis around y rotates a lattice point of (x,y,z) by 180$^o$ to (-x,y,-z) (Fig. 3.2). The positions of (x,y,z) and (-x,y,-z) are called the equivalent positions related by a 2-fold rotation axis. The rotation axes are named by a number, such as 2 for 2-fold axis, 3 for 3-fold axis, and so on. Fig. 3.2. A 2-fold rotation axis around y. 2. {mirror planes} are designated by {m}. For example, a mirror on the XY plane reflects a point from (x,y,z) to (x,y,-z). 3. {inversion center} is represented by {i}. It puts a point of (x,y,z) to (-x,-y,-z). 4. {rotation-reflection} is a combined symmetry operation. For example, a 2-fold axis along y coupled with a mirror plane perpendicular to it ({ 2/m}), results in an inversion of an object in a sequence of (x,y,z) to (-x,y,-z) by the 2-fold axis and then from (-x,y,-z) to (-x,-y,-z) by the mirror. 5. {rotation-inversion axes} are designated by a number with an overbar. For example, ={4} indicates that each rotation of 90$^o$ in the 4-fold rotation is accompanied by inversion through the origin. The {space symmetry} operations involve translation of an object, including: 1. {screw axes} are the combined operations of rotation and translation. For example, a 3-fold screw axis, $3_1$, has three symmetry elements, each of which is generated by a 120$^o$ rotation accompanied by a translation of the 1/3 unit cell length. A $4_1$ screw axis represents 4 symmetry operations and has 4 equivalent positions, each of which is generated by a 90$^o$ rotation accompanied by a translation of the 1/4 unit cell. 2. {glide planes} are the translation accompanied by reflection. They do not exist in protein crystals. The seven crystal systems can be further classified into 230 space groups if both point and space symmetry are considered, or 32 point groups if only point symmetry is considered. Since a protein molecule is comprised of only L-amino acids, the only symmetries in protein crystals are rotation and screw axes. Thus only 65 space groups are applicable to protein crystals Table 1. The 65 space groups for macromolecules Figure 3.3 shows an example of the space group P222, where P represents a primitive cell, and 222 indicates three 2-fold axes along each cell axis. The symbols of arrows and oval dots represent 2-fold axes. Each 2-fold axis relates two equivalent positions. However, the three 2-fold axes generate only four equivalent positions of $(x,y,z), (bar{x},y,bar{z}), (x,bar{y},bar{z})$, and $(bar{x},bar{y},z)$, because one axis is dependent. The full definition for all 230 space groups can be found in the { International Tables, Vol 1}.

3.3 Matrix representation of symmetry

The algebraic forms of equivalent positions in Fig. 3.3 and in {International Tables} are useful for the representation of space groups. However, it is more convenient for computation to use a matrix representation. All the 32 point groups can be regarded as generallized rotations which are represented by a 3 x 3 matrix. For a vector {r} at (x,y,z), its equivalent position ${r{'}}$ can be generated from a symmetry operation C which is composed of a 3 x 3 rotation matrix R and a translation vector {t}. For example, the two equivalent positions of (x,y,z) and (-x, 1/2+y, -z) in the space group P$2_1$ can be represented as: for position 1: for position 2:

3.4 Asymmetric unit

A crystallographic asymmetric unit is the symmetry-independent unit in a cell. For example, the space group of P222 has four equivalent positions (note that one 2-fold axis is not independent) and its asymmetric unit is one-fourth of the unit cell, such as half x, half y, and a full z. On the other hand, the crystallographic asymmetric unit must contain at least one monomer of protein molecules since a monomeric protein molecule or a polypeptide chain does not have intramolecular symmetries because of their single component of L-amino acids. All need for X-ray protein crystallography is to determine the coordinates in the asymmetric unit and the rest portions of the cell can be generated by the symmetry operations. One step in the preliminary characterization of a protein crystal is to estimate how many molecules in the asymmetric unit. This can be obtained from the crystal density. There exist two components in a protein crystal: protein molecules and solvents. The density of a crystal is then splitted into the densities of the protein and solvent/water. where $W_p$ is the molecular weight of the protein, $N$ is the number of macromolecules in the unit cell, $V$ is the volume of a unit cell, and 1.66 is the factor to convert dalton/Angstrom$^3$ to gram/cm$^3$ ($(10^8)^3$/6.022 x $10^{23}$). If the densities of the whole crystal ($rho_{crystal}$) and the solvent ($rho_{solvent}$) are known, the number of molecules per unit cell and thus per asymmetric unit can be calculated. The density of the whole crystal ($rho_{crystal}$) can be experimentally measured. However, the density of the solvent component ($rho_{solvent}$) cannot be estimated unless the structure is solved. One approach to the problem is to use the average values from known crystal structures. Statistics shows that the solvent component in a macromolecular crystal is ranged from 30 to 75% of the unit cell and the density of a protein crystal ($rho_{crystal}$) is around 1.1 g/cm$^3$. Therefore, the denstiy for the protein ($rho_{protein}$ in the above equation) can be estimated in a range from 0.3 to 0.8 g/cm$^3$. Finally, the number of molecules can be estimated from equation 4. Here is an example to estimate the number of molecules in the asymmetric unit. Cyclophilin is a monomeric protein with a molecular weight of about 18 kd and was crystallized in the space group $P2_12_12_1$ with the cell dimensions of $a=43.0, b=52.6, c=89.2$ AA. Plugging the cell parameters and molecular weight into the above equation, we have 1.66 x 18000/(43 x 52.6 x 89.2) = 0.148. Since the space group $P2_12_12_1$ has four equivalent positions and the molecule has no symmetry, the number of the cyclophilin molecules in the cell must be 4, 8, or 4n, in order to meet the 4-fold crystallographic symmetry. The calculated density for the protein part is 0.59 g/cm$^3$ for 4 molecules per cell and 1.18 for 8 molecules. Therefore, four molecules per unit cell or one molecule per asymmetric unit is most likely to exist in the cyclophilin crystal. This was confirmed by the crystal structure after it was solved. An alternative approach to estimate the number of molecules per asymmetric unit in crystals is discussed by Matthews (1974, {J. Mol. Biol.} {82}, 513).

3.5 Reciprocal unit cell

The crystal system and space group is characterized in an atomic space with the coordinate system of {a, b, c} and the atomic position of (x,y,z). The atomic space is also called a direct or real space. On the other hand, diffraction from a crystal is characterized by the position (h,k,l) of a reflection in a space called reciprocal or diffraction space with the unit vectors of ({$a^*, b^*, c^*$}). The real space for atomic positions is the Fourier transformation of the reciprocal space for the reflection positions, as discussed in the later chapters. Mathematically, the reciprocal unit cell is defined by three vectors of ({$a^*, b^*, c^*$}) and three angles of $alpha^*, beta^*, gamma^*$ between the vectors. The reciprocal vectors of ({$a^*, b^*, c^*$}) have the following relationships with the real space vectors of (${a, b, c}$). Note that the scalar product of two vectors gives out a real number: ${a.a^*} = a a^* cos theta$ where $theta$ is the angle between the two vectors. Fig. 3.4 schematic presentation of the relationship between the real and reciprocal unit cells in a two-dimension. For the real unit cells with all angles equal to $90^o$, the real and reciprocal vectors are coaxial, i.e. ${a parallel a^*, b parallel b^*}$, and ${c parallel c^*}$, and the magnitudes of the vectors have the relationships of $a=frac{1}{a^*}, b=frac{1}{b^*}, c=frac{1}{c^*}$ because $theta=0$. The reciprocal vectors can be further defined as: where {b x c} means the vector product of {b} and {c}. vice versa The vector product results in a new vector which has a direction perpendicular to the two original vectors, i.e. ${a* bot b}, {a* bot c}$, and so on. The reciprocal angles are supplementary to the real angles. Conventionally, the angles in the real space are always defined as $> 90^o$. The relationship between the real and reciprocal unit cell of the triclinic system is given in Table 2. The relationships for other crystal system can be obtained by plugging the special values of the angles into the equations. Table 2. The relationship between the axes and angles in the real and reciprocal lattice in a triclinic space group.

4. Principle of X-ray diffraction

This chapter will first introduce nature of X-ray and then proceed step-by-step to X-ray scatterings by an electron, an atom, and a molecule. X-ray diffraction is the summation of X-ray scattering by molecules in different unit cells, as represented by Bragg's law or Ewald's contruction. The formulae in sectin 4.4 are basic and need to be memerized for a better understanding of later chapters.

4.1 X-ray source

X-rays are photons and also electromagnetic waves with wavelengths in a range of 0.1 to 100 AA. They are emitted by an electron transition from high to low energy shells or by deceleration of an electron after it is accelerated or knocked out from the low energy shells by a high energy device such as a X-ray generator or synchrotron. Deceleration of electrons emits radiations with continous wave lengths called white X-rays while the transition from the L or M shell (high energy shells) to the K shell (low energy shell) emits the radiations with a certain wavelength, called $K_{alpha}$ and $K_{beta}$, respectively. The wave lengths of $K_{alpha}$ and $K_{beta}$ can be calculated by where h is Planck's constant, c is the speed of light, and E is the energy for each energy shell. Fig. 4.1. The spectrum from an X-ray tube with a copper anode. I is energy on an arbitrary scale. In quantum chemistry, the shells of L and M can be further devided into subshells or orbitals: s and p orbitals for the L shell, and s, p, and d orbitals for the M shell. Therefore, two radiations can be obtained from the L to K transition, and have slightly different wave lengths. For example, a copper anode emits Cu K$_{alpha 1}$ = 1.5405 and Cu K$_{alpha 2}$ = 1.5443 AA. However, these two wavelengths are so close that it is dificult experimentally to separate them. In practice, the averaged wave length of K$_{alpha}$=1.5418 is usually used. The M to K transition emits only two radiations, K$_{beta 1}$ and K$_{beta 2}$, insetead of three, because the d orbital is an escape level. K$_{beta 2}$ is so weak as to be ignored. The wavelength for Cu K$_{beta}$ is 1.39217 AA. K$_{alpha 1}$ is twice as intense as K$_{alpha 2}$, and about three to six times as strong as K$_{beta 1}$. The X-rays commonly used in structure determination are Cu K$_{alpha}$ ($lambda= 1.5418$ AA) for protein structures, Mo K$_{alpha}$ ($lambda=0.71069$AA) for small molecules. Variable wave lengths can be obtained from a synchrotron source for multiwavelength anomalous diffraction of macromolecular crystals.

4.2 X-ray scattering by an electron

X-ray scattering by one electron can be described by a general expression for the propagation of an electromagnetic wave. where E({bf r},t) is the electric field strength at point {bf r} and time t, k is a unit vector, v=c/$lambda$ is the freguency with c the speed of light, $delta$ is the phase at {r}=0 and $t=0$. In short, the exponantial portion is the phase and $E_o$ is the amplitute. Fig. 4.2. Characteristics of electromagnetic waves. The electric field amplitude as a function of distance at time zero. A typical X-ray scattering by an electron is shown in Fig. 4.3. Fig. 4.3. X-ray scattering by a single electron. The angle of detection ($2 theta$) between the source and the detector is the same in all three cases. (a) an electron at the origin, (b) an electron at position {r} relative to the origin, (c) the path difference between the two radiations scattered by the electrons at position {r} and at the origin. {\^{s}$_0$} and {\^{s}} are incident and scattered radiations. The scattering of a single electron at the origin can be computed by a proper consideration of quantum mechanics and will be represented here by an electron density function $rho$. A single electron at position {bf r} has a phase shift from the electron at the origin (Fig. 4.3b). The path difference is ({\^{s} - ^{s}$_0$})${.r} = {s.r}$. Therefore, moving an electron from the origin to a position {r} causes a phase shift of {s.r}. In general, the scattered radiation by a single electron is expressed as an exponential form where $E({r})$ is the amplitude and ${s.r}$ is the phase. The scattering of the electron at the origin ({r} = 0) is $E_o$.

4.3. X-ray scattering by an atom

The scattering by multiple electrons in an atom can be obtained by integration of the wave-functions for individual electrons. If an electron has its amplitude or density $E({r})$ at the position {r} and the phase {s.r} where {s} is the scattering vector of the wave, the atomic scattering factor {f} is the summation or integration of scatterings from all electrons. This equation assumes that electrons move continuously in all volume of an atom. For the atomic scattering with many discrete sites of electrons, the intergral is replaced by summation. Now consider in details the case that a single atom is located at the origin and electrons distribute in a symmetric sphere. The intergration in spherical polar coordinates becomes: If the electron density for real atoms has a Gaussian distribution: $E(r) = zNe^{-kr^2}$ where z is the number of electrons, N is normalization factor, and k is the width of the Gausian, the intergral becomes: The atomic scattering factor for forward-scattered radiation (s=0 in the above equation) is simply the atomic number. Therefore, an atom with more electrons scatters stronger. $f_0$ represents the scattering power of an atom and is treated as a basic unit in the later sections. When an atom is located at the position {r}, instead of at the origin, the atomic scattering factor will have a phase shift of ${s.r}$, and can be expressed in a way similar to the case of a single electron at position {r} relative to the origin. where f$_0$ is the atomic scattering factor for an atom located at the origin. It depends on the nature of an atom and its number of electrons. Equation 15 is essential and will be frequently used in later section so that it needs to be memorized. The above treatment assumes a free movement of electrons in the sphere of an atom. However, the real movement of electrons in an atom are bound to the nuclear and is restained by the interactions between electrons and nuclear. This bound effect makes the atomic scattering anomalouly dispersed. In general, the atomic scattering factor of a real atom is a complex: where $f^o$ is the normal scattering factor, $f^{Delta}$ is the wave-dependent anomalous dispersion, $delta$ is a phase shift, and $Delta f^{'}$ and $Delta f^{"}$ are real and imaginary parts of the anomalous despersion.

4.4. X-ray scattering by a molecule

The scattering of a molecule, often called {structure factor}, is a summation of the atomic scattering of all atoms in the molecule. The final vector from the summation, the structure factor ${F}$, is a complex number and can be splitted into the amplitude and phase, as shown schematically in figure 4.4 and as represented by the following equation. Fig. 4.4. A schematic presentation of a structure factor. $f$s are atomic scattering factors. If electron density of atoms is continous, instead of discreted atoms, the structure factor can be expressed as an integral. where the atomic scattering factor is replaced by a continous function of electron density $rho({r})$ for the integration. The exponential term in the above equation is a periodical function, $e^{2 pi i {s.r}} = cos (2 pi {s.r}) + i sin (2 {s.r})$. When ${s.r}$ is an interger, the phase differences between atoms are $2n pi$, and thus the scattering waves from different atoms have the same phase (n periods make the same phase). It is also abvious mathematically that when the phase difference is $2n pi$, the sine term is zero, the cosine term is 1, and the scatterings from different atoms are simply added together. In one dimensional space, the interger number ${s.r}$ can be represented by where {a, b, c} are unit vectors in one dimension for the three-dimensional vector {r}. When ${s.r}$ is not an integer, the scattering signals from atoms will partailly or wholy canceled one another, dependent on the phase defferences. As a result, only the scattering with ${s.r}$ = interger is measurable. ${s.a} = h, {s.b} = k$, and ${s.c} = l$ are called von Laue conditions of diffraction. However, arrangement of atoms in a real protein molecule does not normally meet Laue's conditions, and their scattering often cancels one other. This is a reason why a protein with a large number of atoms diffracts poolly. The scattering from the molecules in different unit cell is added together for their phase differences of 2n$pi$. Therefore, an X-ray diffraction experiment requires a certain size of the crystals. A size of 0.2 to 0.5 mm in each dimension is optimal for the diffraction experiments using the source of a rotation anode X-ray generator. Since The position vector ${r}$ is expressed as ${a}x + {b}y + {c}z$ where x,y,z are the coordinates of an atom and a,b,c are unit vactors, therefore where Laue's conditions are used in the equation. >From now on, we will use the symbol {h} to represent the scattering vector {s}, in order to comply with the convention in many text books. Thus, the structure factor can be rewritten in a three-dimensional space as where h,k,l are integers and are called Miller indices of a reflection, $N$ is the number of atoms in the unit cell, and ${h} = ha^* + kb^* + lc^*$ where $(a^*, b^*, c^*)$ are unit vectors for the diffraction space. Thus ${h.r} = (ha^* + kb^* + lc^*)(ax + by + cz) = (hx + ky + lz)$, remember the product of units $a^*$ and $a$ equals to 1. The mathematical form of the above equation is a Fourier transformation. The physical meaning of a structure factor is a Fourier transformation of electron density. According to the properties of a Fourier transformation, electron density $rho$ is the inverse Fourier transformation of {F}. or in a form of summation where V is the volume of the unit cell. Plug eq. 18 into the above equation,

4.5. Center symmetry of intensity

The intensity of a reflection can be experimentally measured at a scattering angle 2$theta$, and also can be calculated from the product of a structure factor and its conjugate because the structure factor is a complex. In detail, When no anormalous scattering effect is considered, the atomic scattering factor $f$ is a real number, i.e. its conjugate is itself ($f = f^*$). Thus A similar expression can be obtained for the intensity of a reflection with negative indices ($bar{h} bar{k} bar{l}$) Therefore, the intensity of a reflection $(hkl)$ is equal to that of its center-symmetry related reflection $(bar{h} bar{k} bar{l})$ when anomalous scattering is ignored. In the other word, the diffraction space has an extra center symmetry, in addition to the symmetries in the atomic space.

4.6. Bragg's law of diffraction}

Consider a case that an X-ray beam is reflected on a lattice plane having a spacing d, at the angle $theta$ (Fig. 4.5). The difference of the diffraction pathway of two x-ray beams, which are reflected by the two lattice planes, are 2dsin$theta$. The Bragg's conditions for observable diffraction are that the path difference between reflected beams from the adjacent lattice planes are an integral number of wavelengths. The first order of diffraction has n = 1, the second order of diffraction has n=2, and so on. Experimentally, the first order of diffraction is the strongest and observable. Therefore, the Bragg's equation can be simplified as $2d sin theta = lambda$. Fig. 4.5. Diffraction of X-ray by two adjacent lattice planes and derivation of Bragg's law.

4.7. Ewald reflection sphere

The Bragg low of diffraction can be geometrically represented by construction of an Ewald sphere (Fig. 4.6) Fig. 4.6. Ewald construction of X-ray diffraction. The incident and scattered X-ray beams are s$_0$ and s, and the radius of the sphere is 1/$lambda$. From the geometrical relationship, the length of the vector OP can be calculated as OP/2 = sin$theta/lambda$. This formula is equivalent to the Bragg's equation by assigning OP = 1/d. In Ewald representation of diffraction, a crystal sits at the center of the Eward sphere (M), and the center of the reciprocal lattice or diffraction sapce is at O (Fig. 4.6). OP is a vector in the reciprocal space, representing a reflection which is usually denoted by a symbol {h} or Miller index (h,k,l). where ${a^*, b^*, c^*}$ are unit vector of the reciprocal space, and Miller index (h,k,l) represents a reciprocal lattice point. The Ewald construction showed that the reciprocal lattice points which cut the surface of the sphere will meet the Bragg's law and diffract. Thus the Ewald sphere is also called reflection sphere. When a crystal oscillates, or in the other word, when the reciprocal lattice is oscillates (Fig. 4.6), the reciprocal lattice points within the sphere which has a radius of {h} will have chance to pass the surface of the reflection sphere and diffract. The sphere with a radius {h} is called resolution sphere. {h} has a reciprocal relationship with d: {h} =1/d. Osccillation or rotation of a crystal is the experimental method to collect reflections of a crystal. The total number of reflections (or reciprocal lattice points) within a certain resolution limit can be calculated by the ratio of volumes of the resolution sphere over the reciprocal unit cell (Fig. 4.7). Fig. 4.7. Resolution sphere. When the crystal rotates, the lattice points within the resolution sphere will pass the Ewald sphere and diffract. The radius of the resolution sphere {h} or the diffraction power of a crystal is determined by the quality of a crystal and the X-ray source.

4.8. Symmetry in the diffraction space

The symmetry in the reciprocal or diffraction space is determined by the symmetry in the atomic space or the symmetry of a crystal. If a crystal has an m-fold symmetry, the structure factor can be expressed in terms of the atoms in the crystallographic asymmetric unit and the crystallographic symmetry. where $[C]$ is the number of the symmetry operations which are composed of the rotation matrix $[R]$ and translation vector {t}, $N/m$ is the number of atoms in the asymmetric unit. Now let us examine the phase of $2pi ({h}.[R]{r + h.t})$. Since ${h.t}$ is independent on {r}, a translation {t} will cause a phase shift of ${h.t}$ for a certain reflection {h}, but has no effect on the amplitude of the strcutre factor. For example, a translation of half unit cell along y or a screw axis along y will shift the phase of a reflection {h} by $kpi$. This formula shows that the phase shift is $pi$ for reflections with $k=odd$ and 0 for reflection with $k=even$. Now let us look at the rotation matrix. Mathematically, a rotation matrix acting on the right site of a vector is equivalent to its transposed matrix appling to the left site of the vector. It means that the rotation symmetry in the atomic space is conserved in the reciprocal space. Since the translations in the atomic space cause phase shifts in the diffraction space, but have no effect on the magnitude of structure factors, the screw axis symmetry in the atomic space is equivalent to the corresponding rotational axis without translation. For example, a $3_1$ screw axis along z in the atomic space shows a 3-fold rotation axis in the diffraction space. Thus, the three equivalent positions in the atomic space $(x,y,z), (-y,x-y,frac{1}{3}+z), (y-x,-x,frac{2}{3}+z)$, correspond to three equivalent reflections in the reciprocal space $(h,k,l), (k,-h-k,l), (-h-k,h,l)$. The symmetries for the atomic space can be represented in the matrix form: Note that the symmetry matrix for the diffraction space is the transposition of the symmetry matrix for the atomic space and that a transposition of a matrix is the flip of the lower left with the upper right around the diagonal. Thus symmetry matrices for the reciprocal space are In conclusion, (1) the symmetry in the reciprocal space is the point symmetry in the real space in addition to the centrosymmetry, and (2) a translation or screw axis symmetry in the atomic space cause a phase shift in the reciprocal space, but has no effect on the amplitude of the structure factors.

5. Phase Problem

The amplitude of structure factor ($|F|$) can be obtained from the experimentally measured intensity (eq. 27), but the phase of structure factor ($phi$ in eq. 18) is not directly measurable. The phase problem is the central problem in X-ray crystallography. The following methods have been used to phase a macromolecular structure: (1) Patterson function, (2) multiple isomorphous replacement (MIR), (3) multiwavelength anomalous diffraction (MAD), (4) molecular replacement (MR), (5) direct methods and maximum entropy. MIR, MAD, and MR are the three methods which can independently solve a macromolecular structure while direct methods and maximum entropy are still in developement. Patterson function was used to solve small molecular structures, but its use in determination of a macromolecular structure is mainly to locate the heavy atom positions. Therefore Patterson function is the first step of the MIR phasing. MIR which was first introduced about 40 years ago is the oldest but still major method for {ab initio} phasing of macromolecular structures. The theory of MAD was developed as early as its twin sister MIR, but its practical independent application in structure determination of macromolecules was just since the last decade. Molecular replacement solves an unknown structure by rotating and translating a known model into a new crystallographic system, and is therefore not an {ab initio} phasing method. Direct methods showed great success in determination of small molecular structures, but it has only been used as the auxilary methods to improve the MIR phases or to extend low resolution phases to high resolution. Application of direct methods and maximum entropy in determination of macromolecular structures are in development.

5.1. Patterson Function

5.1.1. Patterson Function

The Patterson function $P({u})$ or $P(u,v,w)$ is a Fourier convolution of electron density for two atoms at positions of $rho({r})$ and $rho({r + u})$, where {r} is the vector in atomic space $(x,y,z)$, and {u} is the vector in Patterson space $(u,v,w)$ representing the relative difference of the coordinates between the two atoms. We know that Here, we use ${h{'}}$ to distinguish it from {h} between the two equations. Thus and the Patterson function becomes We moved the terms of {F}s and $e^{-2 pi i {h{'}.u}}$ out of the intergration because they are the function of {h}, instead of {r}. Now let us examine the integration. For ${h} = {-h{'}}$, Remember that the summation is over {-h} to {h}, and the sine terms will cancel one another because $sin ({(h+h{'}).r}) = - sin ({-(h+h{'}).r})$. The cosine terms will also cancel one onother if the vector {r} distributes evenly all over the sphere. Therefore, the Patterson function becomes The above equation shows that the Patterson function is a Fourier summation with square of structure factor magnitudes as coefficients and does not need phase information. Therefore, Patterson function can be directly calculated from the experimentally measured intensity. The physical meaning of the Patterson function is the vectors between atoms. For example, three atoms in a two-dimensional unit cell have six vectors between different atoms ($1 rightarrow 2,2 rightarrow 1,1 rightarrow 3, 3 rightarrow 1,2 rightarrow 3,3 rightarrow 2$) and three vectors between themselves ($1 Leftrightarrow 1,2 Leftrightarrow 2,3 Leftrightarrow 3$) (Fig. 5.1). Fig. 5.1. (a) A two-dimensional unit cell with three atoms, (b) the corresponding Patterson map. The $N$ self-peaks overlap at the origin. Half of the $N(N-1)$ non-origin peaks are independent because of the centrosymmetry in the Patterson map, labeled as $1 rightarrow 2$, $1 rightarrow 3$, $2 rightarrow 3$. The length of self-vector is zero and located at the origin of the Patterson map. In general, there are $N^2$ peaks in the Patterson map for a unit cell with $N$ atoms. Since the $N$ self-vector is located at the origin, the individual peaks in the Patterson map will equal to $N(N-1)$, half of which ($N(N-1)/2$) is unique because of the centrosymmetry of the Patterson function. A small macromolecular structure with 1000 non-hydrogen atoms in a unit cell will have $10^6$ Patterson peaks. This number is too many to resolve themselves. Therefore, the Patterson function is not directly usable to locate atomic positions for macromolecular structure determination. However, the difference Patterson between the heavy atom derivative and the native crystal is most usable in protein structure determination. The difference Patterson takes $(F_{pq} - F_p)^2$ as the coeffiecient of Fourier synthesis, where pq and p represent the heavy atom derivative and native, respectively. Assuming the heavy atom derivative is isomorphous to the native crystal, the positions of the protein atoms in both native and derivative will be identical and the Patterson peaks for the protein atoms will be cancelled in the difference Patterson map. Only remaining peaks are those vectors between heavy atoms or between heavy atoms and protein atoms. Usually, the peaks between heavy atoms and protein atoms are much lighter than the peaks between heavy atoms themselves, and only a few sites on a protein can bind the heavy atom compounds. Thus, the small number of peaks will be observed in the difference Patterson map. This makes the difference Patterson be a very powerful technique to locate the heavy atom positions.

5.1.2. Interpretation of Patterson map

Among Patterson peaks, some special peaks called Harker peaks are very useful for interpretation of the heavy atom positions. The Harker peaks represent the vectors between the symmetry-related atoms in a unit cell. If the unit cell is sliced into sections, the sections containing the Harker peaks are called the Harker sections. For example, The space group $P2_1$ has two equivalent positions of (x,y,z) and (-x,0.5+y,-z), and the Harker peaks are $pm$((x,y,z) - (-x,0.5+y,-z)) = $pm$(2x,-0.5,2z). Thus, y=0.5 is the Harker section of the space group $P2_1$, in which the positions of (x,z) can be located from 2x=u and 2z=w. The following is a practical example on the interpretation of the difference Patterson map of the cyclophilin structure. Fig. 5.2. The difference Patterson maps of the cyclophilin crystal and its iridium derivative. The whole unit cell was sliced into 90 sections along W. The native cyclophilin crystallized in the space group $P2_12_12_1$ with cell dimensions of $a=43.0, b=52.6$ and $c=89.2$ AA. There are four equivalent positions for the space group $P2_12_12_1$: (x,y,z), (0.5-x,-y,0.5+z), (0.5+x,0.5-y,-z), and (-x,0.5+y,0.5-z). The first step to obtain the heavy atom positions from a difference Patterson map is to make a table listing all the Harker peaks. To do so, place the equivalent positions on the 1st row and the 1st column of the table, and substract them from one another (Table 3). \multicolumn{5}{|c|}{Table 3. Harker peaks for $P2_12_12_1$ } \\ \hline The table clearly showed three Harker sections located at u=0.5, v=0.5, and w=0.5. The diagonal of the table splits the Harker peaks into two portions which are related by the center symmetry. The six Harker peaks can be further reduced to three for their mirror symmetry specially for the orthorhombic crystal system. The heavy atom positions of (x,y,z) can be deduced from a comparison of any three independent relationships in the table with the real difference Patterson map (Fig. 5.2). For example, the Harker section at w=0.5 (Fig. 5.2) for the iridium derivative of cyclophilin showed eight peaks, two of which are symmetry-independent. You can pick up any peaks in a quarter to calculate the heavy atom positions. Selection of the peaks in other quarters only makes a different choice of eight allowed origins. Now let us interpret the peak P1 in the upper-left quarter in the section 45, which has the coordinates of u=0.3 and v=0.4 (Fig. 5.2). We have -0.5+2x = 0.3 and 2y=0.4, and thus x=0.4, y=0.2. Remember that it is also allowed to move the Patterson lattice one unit cell in each dimension, i.e. -0.5+2x = 0.3 + 1 and x=0.9, and similarly y=0.7. Therefore, four solutions of (0.4,0.2), (0.9,0.2), (0.4,0.7), and (0.9,0.7) are equivalent and just reflect a different choice of the allowed origins. The z value is determined from the peaks for the same site at u=0.5 or at v=0.5. At the section 27 (total 90), 2x=0.8 or 0.2 and v=0.5 gives out x=0.4 or 0.1. This is equal to the x value of P1 at w=0.5, indicating they are the same site. Therefore, use of 0.5 - 2z = 0.3 yields z=0.1. Remember that z=0.6 is also allowed. In summary, the eight solutions from the difference Patteson (0.4,0.2,0.1), (0.4,0.2,0.6), (0.4,0.7,0.1), (0.4,0.7,0.6), (0.9,0.2,0.1), (0.9,0.2,0.6), (0.9,0.7,0.1), (0.9,0.7,0.6) are equally allowed for the peak P1 and reflect a different choice of the possible origins. The peaks at the section 18 are redundant and should confirm the above solution. On the other hand, the eight solutions with the same (x,y,z) but negative sign such as (-0.4,-0.2,-0.1) are also possible because of the center symmetry of the Patteson function. The pair of (x,y,z) and (-x,-y,-z) reflects the absolute configuration of a structure, one of which is the true solution. Differentiation between (x,y,z) and (-x,-y,-z) or determination of the handness of the coordinate system normally requires the anomalous scattering data, and can be done by a calculation of the anomalous difference Patterson. The interpretation of the anomalous difference Patterson is similar to the difference Patterson. In lack of the anomalous scattering data, the handness can be determined by a comparison of the quality of two $F_o$ maps calculated from the phases of (x,y,z) and (-x,-y,-z). The correct handness will give superior quality of electron density. In the case of cyclophilin, the solution of (-0.4,-0.2,-0.1) was finally selected based on its quality of the $F_o$ map. Now we deal with the peak P2 at the section w=0.5 (Fig. 5.2). Theoretically, any sites in the orthorhombic system should have eight possible solutions reflecting eight allowed origins. However, since the origin for the first site has been fixed, the origin for the second site should be consistent with the first site. Therefore, only one solution, instead of eight, is the true solution for the second site. The second site can be determined by (1) the positions of the cross peaks and (2) the difference Fourier synthesis using the phases from the first site. The difference Fourier is the most commonly used technique to determine the second heavy atom site. In the case of cyclophilin, examination through all the sections revealed no other peaks at u=0.5 or v=0.5, corresponding to the peak P2 at w=0.5. Thus, the second peak at w=0.5 must have the coordinates (x,y,z) close to the peak P1 or result from the disordered binding of the heavy atom. This can be seen from the dispersed shape of the peaks at section 18, and was confirmed by the difference Fourier synthesis.

5.2. Principle of isomorphous replacement

Assuming that binding of heavy atoms to the parent protein does not change the conformation of the protein, the structure factors of heavy atom derivative (pq), native protein (p), and heavy atom (q) have the following relationship. or If the amplitudes of $|F_{pq}|$ and $|F_p|$ are measured experimentally and $|F_q|$ and $alpha_q$ are obtained from the heavy atom model, the above vector equation from single isomorphous replacement (SIR) gives out two possible solutions for the phase of ${F}_p$, one of which is the true solution, as shown in a geometrical representation of the equation (Fig. 5.3). Fig. 5.3. (a) Phase diagram for the single isomorphous replacement and (b) the Harker construction for the double isomorphous replacement. The symbols of $F_{h}$ and $F_{ph}$ in figure (b) is equal to $F_{q}$ and $F_{pq}$ in the equations, respectively. When the second heavy atom derivative which has significantly different binding from the first derivative is prepared, the two vector equations can be obtained. The two equations will give out a unigue solution to the protein phases (Fig. 5.4b). However, this is the oversimplified principle of isomorphous replacement. In practice, more than two heavy atom derivatives are desired because of low occupancy, nonisomorphism, and other reasons. The practical MIR phasing is complicated and can be devided into two steps: phase calculation and heavy atom refinement, as discussed as follows.

5.2.1. Phase calculation

The phase triangle in figure 5.4 is usually not closed when experimental data are used because of errors on the measurements of structure amplitudes and of non-isomorphism between the native and derivative. A statistics was introduced for the treatment of the lack of closure errors (Fig. 5.4). Fig. 5.4. Effect of errors on phase determination. (a) ideal phase triangle, (b) errors on both heavy atom and heavy atom derivative, and (c) and (d) are two ways to combine the errors. Assuming a Gaussian distribution of measurement errors, the phase probability for each derivative is expressed as a normal distribution of width of the standard deviation. where $N$ is a normalization factor, $varepsilon(alpha)$ is the lack of closure of the triangle, and $E$ is the standard deviation which is practically estimated from the root-mean-squared lack of closure of centric reflections. When the lack of closure $varepsilon(alpha)$ is explictly defined as discussed in the later sections, the probability for the phase of each reflection can be calculated. Two peaks with equally high probability is obtained for each heavy atom derivative, corresponding to the two possible solutions in Fig. 5.3a. In the case of MIR, the joint probability for multiple derivatives is equal to the product of the individual probability. This joint probability distribution for MIR yields one maximum probability corresponding to the common solution among all derivatives. Two phase sets can be obtained from the above equation: the most probable" phase which is extracted at the points having maximum probability and the best" phase which is calculated at the centroid of the probability distribution or from the best" Fourier synthesis of the Gausian distribution. where m is the figure of merit.

5.2.2. Heavy atom refinement

. To obtain accurate phases using the above probability formulae, parameters associated with heavy stom derivatives need to be refined, including (a) scaling and temperature factors between F$_{pq}$ and F$_{p}$ to bring the derivative data to the level of the native, and (b) parameters of position, occupancy, and isotropical/anisotropical temperature factor for the heavy atom models. A common method to optimize the parameters is the least square refinement which mimimizes the residual or lack of closure between the observed and calculated amplitudes of derivatives. Assuming that all errors are associated with heavy atom derivatives F$_{pq}$, the lack of closure $varepsilon(alpha)$ in eq. 52 is The least-square refinement minimizes the residual of where $w_h$ is the weight and h represents a reflection. The observed diffraction data of the derivative and the native are scaled by applying the relative scale (S$_{rel}$) and temperature factor ($B_s$) to the derivative data. The parameters for the heavy atom model are: where f is the atomic scattering factor, q is the occupancy, {r} is the position vector (x,y,z) of each heavy atom site, B is the isotropic or anisotropic temperature factor for each heavy atom site. section6. Preparation of heavy atom derivatives

6.1. Introduction

The isomorphous replacement method requires two kinds of crystals, the native macromolecular crystal and its heavy atom derivatives which have the same molecular conformation as the native macromolecules, but with extra binding of heavy metal compounds. Heavy atom derivatives of a macromolecular crystal are prepared by soaking the native crystals in a buffer containing a heavy metal compound or by cocrystallizing a protein with a heavy metal compound. If soaking or cocrystallizing does not change the molecular conformation of the native protein, heavy atom derivative crystals are isomorphous to the native. Isomorphism of the derivatives to the native is required to phase a crystal structure of macromolecules. However, nonisomorphism is often a problem with heavy atom derivative preparation. Binding of heavy atoms may destroy the conformation of macromolecules or even the crystal lattice. In practice, a heavy atom derivative which has small changes on the local molecular conformation such as around its binding site may be still useful for phasing a macromolecular structure, but has reduced phasing power. In theory, a heavy metal compound with more electrons will give out a stronger signal of intensity changes, and thus a better derivative. From this point of view, elements in the 6th period of the periodic table are suitable candidates for the heavy atom derivatives. However, real preparation of heavy atom derivatives are limited by many factors such as solubility of heavy metal compounds in water. Most commonly used heavy atom compounds include mercury compounds and platinum compounds. In addition, selinium is often used to prepare a selinium methionine mutant of a protein for anormalous scattering diffraction experiments. Binding of a heavy atom to a protein cannot be theoratically predicted. Practical preparation of a heavy atom derivative requires a trial-and-error search for optimal soaking conditions such as the concentration of heavy atom compounds, pH etc.

6.2 Factors effecting heavy metal binding

6.2.1. effect of pH

pH significantly effects the binding of heavy atoms to proteins, and may be a determining factor in some cases. Effect of pH on binding have two aspects: the protonation state of side chains of amino acids and capability of heavy metal binding. For aspartic acid and glutamic acid, their carboxyl groups have pK$_a$ of 3.8 and 4.2, respectively, and are thus protonated at pH below 4. Since deprotonation of the carboxylate groups favors the binding of these residues to cations such as Sm$^{3+}$ and Pb$^{2+}$, a pH above 4 is required to prepared such derivatives. Binding of mercury to cysteine is the strongest at basic pHs, but is dramatically reduced at acidic pHs. The effect of pH on mercury binding is its binding capacity. but not the protonation state of cysteine. However, pH in general is not a control factor. In practice, it often chooses the pH of the crystallization buffer for the soaking solution of heavy metal compounds, because (1) most heavy atom compounds can bind to proteins in a wide range of pH in spite of that they may have the tightest association at a certain pH, and (2) use of different pH from the crystallization buffer has a risk to change protein conformation or destroy crystal lattice.

6.2.2. Buffer

Although a buffer system as well as its ionic strength effect binding of heavy metals to proteins, the crystallization buffer is often used for soaking of heavy atom derivatives, in order to avoid of possible non-isomorphism. However, attention needs to pay to two factors of (1) solubility and (2) stability of heavy metal compounds in a buffer. Phosphate ($PO_4^{3-}$) is poor for solubility of most heavy metals such as cations of mercury, lead, thalium, and uranate. Too much chloride group ($Cl^-$) may cause precipitation of mercury and lead. Sulfate is a poor buffer for solubility of mercury and lead while acetate is good for most heavy metals. A full list of solubility of heavy metals in various buffers may be found in Handbook of Chemistry and Physics". Ammonium sulfate is the most common reagent used in the crystallization. However, ammonium can react with many heavy metals such as platinum, palladium, silver, gold, and mercury. At basic pH, ammonium sulfate may decompose to ammonium and sulfuric acid. Ammonium from excess of ammonium sulfate may replace chlorine of $PtCl_4^{2-}$ to yield less reactive $Pt(NH_3)_4^{2+}$. Imidazole buffer needs to be avoided for the preparation of mercury derivatives because mercury is highly reactive with imidazole. It is also reported to react with $PtCl_4^{2-}$. For the preparation of platinum derivatives, bicine as a buffer works better than Tris. In general, it often uses Tris for a basic buffer and acetate for a acidic buffer to soak heavy metals.

6.2.3. Precipitant

When protein crystals are transfered from the crystallization buffer to a heavy metal soaking buffer, they may dissolve because of zero or diluted protein concentration. High concentration of polyethylene glycol (15-40%) or ammonium sulfate (2 M for example), dependent on the original crystallization conditions, is needed to prevent crystals from dissolving. PEG is prefered over ammonium sulfate for its low ionic strength and small effect on the binding of heavy metals. However, high concentration of PEG may slightly dehydrate and shrink the crystal lattice for its high affinity with water and thus cause non-isomorphism. Presoaking of native crystals in the buffer, which used for heavy metal soaking but without heavy metal compounds, should solve this problem.

6.2.4. Others

Other factors effecting preparation of heavy atom derivatives of proteins include concentration of heavy metal compounds, soaking time, soaking temperature, etc. The concentration of heavy metals is an important factor. Too much heavy metal may cause several problems such as non-isomorphism, destroy of diffraction, and dissolving of protein crystals. A suitable range of heavy metal concentration needs to be found experimentally, but experience may help with a quick preparation of heavy atom derivatives, as discussed in the next section of survey of heavy metals". Soaking time is a less sensetive factors, because most reactions happen vary fast. A few days of soaking should be long enough for most heavy metals to reach an equilibrium of binding, but reaction of mercury appears to be faster, several hours or even a few ten minutes were reported to be long enough to prepare a usable mercury derivative for structure determination. If soaking of heavy metal causes loss of diffaction of protein crystals, it is advized to lower the concentration of heavy metal compounds, rather than to shorten the soaking time. For most enzymatic reaction, increase of 10$^o$C may raise a ten-fold reaction rate of heavy metal binding. However, the reaction rate is not a key issue for preparation of heavy metal derivatives because actual soaking time in heavy metel solution is often longer than time needed to reach equilibrium. Therefore, it is not necessary to test soaking temperature, but conveniently chooses room temperature for most proteins or 4$^o$C for some special cases.

6.3. Survey of heavy metals

6.3.1 Mercury

Mercury is the most common heavy metal reagent used in preparation of heavy atom derivatives of proteins. It has 80 electrons and can cause as large as 40% change of diffraction intensity footnote{Protein Crystallography" edited by Blundell and Johnson (1976), 183-239.}. Mercury covalently binds to the sulphur atom of cysteine and occasionally binds noncovalently to histidine, arginine, and others. In zinc proteins, mercury may replace Zn. For example, soaking of carboxypeptidase in 5 mM mercuric chloride results in substitution of mercury at zinc site. footnote{Lipscomb, W. N., Coppola, J. C., Hartsuck, J. A., Ludwig, M. L., Murihead, H., Searl, J., & Steitz, T. A. (1966), {J. Mol. Biol.}, {19}, 423.} Many compounds of mercury are good candidates for preparation of mercury-protein derivatives. Most common are the compounds with +2 valence of mercury, such as mercuric nitrate ($Hg(NO_3)_2$), methyl-mercurichloride ($CH_3HgCl$), ethyl-mercurichloride ($CH_3CH_2HgCl$), $p$-hydroxymercuribenzoate (PCMB), etc. A more complete list can be found in Protein Crystallography" edited by Blundell & Johnson, 1976, pp 183-240, Academic Press. The organic substitution groups such as methyl and ethyl group covalently bond to mercury and do not break during soaking while the inorganic substitution groups such as nitrate and chloride form ionic bonds with mercury and can ionize to yield a positively charged mercury group. Thus, the mechanism for the formation of the covalent bond between mercury and cysteine is considered to involve the positively charged mercury ion and negatively charged sulphur ion $S^-$ of cysteine, which may be generated under certain environment. Thus, very acidic pH unfavors formation of mercury-sulphur covalent bond because the unstable sulphur ion $S^-$ of cysteine is protonated at low pH. Nevertheless, mercury has a strong capability of reacting with cysteine at a pH range of 4 to 10, especially at basic pHs such as pH 8.0. In addition, mercury also react strongly with histine (pK$_a$ = 6.5) at pH above 6.0. Since mercury reaction involves positively charged group, excess of negatively charged ions such as chloride may weaken the reaction. The binding of mercury compounds to proteins depends on their size, shape, and substitution groups. Mercury compounds with large substitution groups in general have mild reactivity. In contrast, the mercury compounds with a small substitution group such as $CH_3HgCl$ may not only react with the amino acids on surface of the protein molecule, but also is able to penetrate into the hydrophobic core of a protein molecule to covalently link to cysteine. In short, preparation of mercury derivative needs several trial-and-error experiments. The concentration of mercury compounds was reported to be 0.1 mM to 1 mM for many proteins and soaking time ranges 15 min to a few days, dependent on individual mercury compounds. Occasionally, the anionic form of mercury, such as $HgCl_4^{2-}$, $HgBr_4^{2-}$, $HgI_4^{2-}$, or $Hg(SCN)_4^{2-}$ is used for heavy atom derivative. The negatively charged mercury ions bind to cysteine and histidine, as positively charged species. This is probably due to the dissociation of the mercuric ions. For example, $HgI_4^{2-}$ may decompose to $HgI_3^-$, $HgI_2$, $HgI^+$, and $I^-$, and the reaction may be through the neutral or positive charged species.

6.3.2 Platinum and analogous reagents

Platinum is another family of compounds widely used in preparation of the heavy atom derivatives of proteins. It forms a stable complex with cysteine, histidine, lysine, methionine, and arginine. The common reagents of platinum are $PtCl_4^{2-}$, $Pt(NO_2)_4^{2-}$, $Pt(CN)_4^{2-}$, $PtBr_4^{2-}$, etc. Their reactivity is in the following order. The platinum compounds are relatively unstable. For example, $PtCl_4^{2-}$ in the excess of ammonium sulphate may convert to $Pt(NH_3)_4^{2+}$ in a few days. If $PtCl_4^{2-}$ is stored in water buffer for long time, it may also turned to $Pt(H_2O)_4^{2+}$ or its intermidiates. Since $Pt(NH_3)_4^{2+}$ and $Pt(H_2O)_4^{2+}$ are much less reactive, it is advized to use freshly prepared solution of $PtCl_4^{2-}$. That may be the reason why only freshly made $PtCl_4^{2-}$ gave intensity changes in the X-ray diffraction of ribonuclease S crystals. footnote{Wyckoff, H. W., Hardman, K. D., Allewell, N. M., Inagami, T., Tsernoglou, D., Johnson, L. N., & Richard, F. M. (1967), {J/ Biol. chem.}, {242}, 3984.} In addition, $PtCl_4^{2-}$ is capable of reacting with Tris. If the crystals are grown in Tris, it may use bicine to replace Tris as the buffer. The concentration of platinum compounds is in a range of 0.1 to 1 mM, and soaking for 2 to 4 days should be long enough to reach equilibrium. The above compounds have platinum valence 2+. Platinum compounds with $Pt(IV)$ such as $K_2PtCl_6$ is sometimes used for heavy metal derivative preparation. The negatively charged ion $PtCl_6^{2-}$ binds to arginine and lysine, in a way similar to $PtCl_4^{2-}$, probably because $PtCl_6^{2-}$ is reduced to $PtCl_4^{2-}$ in reaction system. In addition, $PdCl_4^{2-}$ and $AuCl_4^{-}$ have a simlilar electron configuration as $PtCl_4^{2-}$ and form square planar complexes. The reaction rate are in order of

6.3.3 Iridium and Osmium

Iridium and osmium are transition metals belonging to the same group as platinum (VIII B in the periodical table), thus sharing similar properties with platinum. They coordinate with imidazol and amino groups of histidine, arginine, lysine, and tryptophan. The commonly used compounds are $IrCl_6^{3-}$ and $OsCl_6^{2-}$. They have a octahedral configuration of electrons. In addition, the cationic complex $Ir(NH_3)_6I_3$ is found to be useful for heavy atom derivatives of ferricytochrome c. The compounds of $Ir(III)$ and $Os(IV)$ are stable and can be stored in aqueous solution for long time. The concentration for practical preperation of iridium derivative can be 10 mM $K_3IrCl_6$ without destroy protein crystals. Soaking for 2-4 days should be long enough to see intensive orange color in crystals if iridium binds to the protein.

6.3.4. Uranate and cation reagents}

Uranium and lead are cation reagents. They dissociate to positively charged heavy metal ions and form charge-charge interactions with negatively charged amino acids such as glutamic acid and aspartic acid. Many other heavy metals can generate cations for that purpose, but lanthanide and actinide ions, in addition to uranium and lead, are most common. The reactivity of heavy metal cations with the negatively charged residues is stronger at basic pH and in a buffer with low ionic strength. However, it has been reported that useful heavy atom derivatives were prepared in high salt concentration such as 3 M ammonium sulfate (see Protein Crystallography" edited by Blundell and Johnson (1976), pp 200-204). Some buffer components such as citrate may strongly chelate with heavy metal cations. Acetate and Tris are a proper buffer for cation reagents. The most commonly used uranyl ion is O=U=O$^{2+}$. It is a linear molecule with a double bond with two oxygen atoms. Uranium has 92 electrons and is the heaviest metal used in the preparation of heavy atom derivatives of proteins. Examples of uranyl compounds used in practical preparation of heavy atom derivatives are $UO_2Ac_2$, $UO_2(NO_3)_2$, and $K_3UO_2F_5$. The concentration of uranyl compounds varies from 0.1 mM to 100 mM for the practical heavy atom derivatives. Lanthanides also bind to carboxylate sidechains. The most stable valent state of lanthanide elements are three plus (III). Several members of lanthanide group have been practically used in heavy atom derivatives of proteins, such as $SmCl_3$, $TbCl_3$, $GdAc_3$, $LaCl_3$, $EuCl_3$, etc. The concentration of lanthanide ions ranges from 1 to 40 mM. Different from uranium and lanthanides, thalium ($Tl^+$) and lead ($Pb^{2+}$) are two non-transition cations which react with carboxylate sidechains. The solubility of $TlNO_3$ and $Pb(NO_3)_2$ as well as their acetate salts is high, but lead may easily precipitate out in a phosphate buffer, thus preventing from the high concentration soaking. A $novo$ compound of lead, trimethyl lead acetate has been highly recommanded for the choice of heavy metals.footnote{Holden & Rayment, (1991), {Archives Biochem. Biophys.} {291}, 187-194.} This compound specifically binds to aspartate or glutamate near a hydrophobic environment. The advantage is that this compound makes a heavy atom derivative isomorphous to the native. The reaction rate of trimethyllead acetate is slow. It requires high concentration (20-40 mM) and about two weeks soaking.

7.1. Introduction

We have learned the multiple isomorphous replacement to solve crystal structures of macromolecules. This method requires preparation of heavy atom derivatives, which is often difficult because the heavy metal derivatives may not be isomorphous to the native crystals or no suitable heavy metal compounds could be found to bind to macromolecules. Molecular replacement is an alternative method to phase crystal structures without preparation of heavy atom derivatives. Macromolecules from different sources such as human and bacteria may have different chemical components and different sequences, but they usually have the same three-dimensional molecular structures. Thus, a known macromolecular structure of a certain species can be used to solve the unknown structures of other species because the molecular structures are expected to be the same although they may have a sequence homology as low as 30%. In addtion, some proteins can crystallize in several space groups under same or defferent crystallization conditions. The same molecule in different space group usually has the same molecular conformation but different lattice packing or intermolecular interactions. To determine these type of crystal structures, one can use a known structure as the initial model and find its orientation in a new crystal system. This method is called molecular replacement, and is another major method for determination of macromolecular structures. In fact, majority of structures in Brookhaven Protein Data Bank were determined by molecular replacement. The advantage of molecular replacement is that it only needs a native diffraction data set without preparation of heavy atom derivatives. The second use of the molecular replacement method is to find out the molecular symmetry in the asymmetric unit and use it to improve the MIR phases. If the crystallographic symmetry-independent unit or asymmetric unit contains more than one molecule, these molecules may be related by a 2-fold, 3-fold, or other rotational axis. The symmetry in the asymmetric unit is called molecular symmetry or local symmetry and is useful to improve phases. This chapter will start with an introduction to transformation of a coordinate system, and then follow by rotation and translation functions and local symmetry.

7.2. Coordinates transformation

crystal strcutures are often stored in a three-dimensional coordinate system such as cartesian system. A set of coordinates will have different numerical value of (x,y,z), when it expresses in a different coordinate system or when the coordinate system is rotated. For example, if a two-dimensional cartesian system is rotated by an $alpha$ degree (Fig. 7.1a), the new coordinate $(x',y')$ after rotated can be expressed as an application of a rotation matrix to the original coordinate $(x,y)$. In the seven crystal systems, hexagonal, triclinic and monoclinic systems have unit cell angles not equal to 90$^o$. The coordinates stored in these crystallographic systems are different from those stored in the orthogonal coordinate system. The coordinates in defferent coordinate systems can be transformed one another. For example, the transformation between hexagonal crystal system and an orthogonal system (Fig. 7.1b) is Fig. 7.1. Coordinate transformation, (a) rotation of a two-dimensional cartisian system by $alpha$ degree, and (b) transformation from the hexagonal crystallographic system to the orthoganol system. Fig. 7.2 a) Eulerian angles $alpha, beta, gamma$, and (b) Polar angles $chi, omega, phi$.

7.3. Eulerian and Polar angles

It is convenient to perform rotation and to compute the rotation function in terms of Eulerian angles of $(alpha, beta, gamma)$. The following definition follows the convention in the program CCP4, which is based on the mathematical method of Crowther. footnote{Crowther, R. A., in The molecular replacement method" (ed. M. G. Rossmann), 173-178, Gordon & Breach, 1972.} In the orthogonal coordinate system (x,y,z), z is arranged in the crystallographic axis of higher symmetry. Thus, in most crystal systems except for the monoclinic, the crystallographic $c$ axis coincides with z, and the crystallographic $a$ axis is along the direction of x. In the monoclinic system, the crystallographic $b$ is along z, and the crystallographic $c$ coincides with x. The Eulerian rotation in the orthogonal system is carried out by three operations of (Fig. 7.2) (1) a rotation of the coordinate system by $alpha$ about z, (2) a rotation $beta$ about the new axis of y, (3) a rotation $gamma$ about the new axis of z. The coordinate transformation or rotation can be expressed in the following general equation: The rotation matrix of the Eulerian system is where It is convenient to compare the Eulerian system with the crystallographic symmetry. However, the disavantage is that the rotations of $alpha$ and $gamma$ are not independent and become ($alpha + gamma$) or ($alpha - gamma$) when $beta$ is 0 or $pi$. Another widely used coordinate system is spherical polar angles of $(chi, omega, phi)$, as defined in Fig. 7.2b. Here, the polar axis is z, and $omega$ and $phi$ define the longitude and latitude of the rotation axes. The advantage of the polar system is that the $chi$ angle straightly represents the molecular symmetry. For instance, $chi = 180^o$ is a molecular 2-fold axis, $chi = 120^o$ is a molecular 3-fold axis, and so on.

7.4. Principle of molecular replacement

If a molecule has a coordinate set ${X_1}$ in a crystal lattice (e.g. the known crystal structure) and ${X_2}$ in another crystal lattice (e.g. the structure needs to be determined), these two cooridnate sets of ${X_1}$ and ${X_2}$ have the relationship of where, ${X}$ is the coordinate vector of (x,y,z), $[R]$ is the rotation matrix with dimensions of 3x3, and ${t}$ is the translational vector in three dimensional space (x,y,z). When a structure and its coordinate set ${X_1}$ is known in a crystal lattice, the coordinates ${X_2}$ in the second crystal lattice can be obtained by finding the rotation matrix and translation vector. In general, an unknown structure can be determined by using a known structure as the initial model as long as their molecules have significant sequence homology (e.g. $>$30%) or are believed to have same molecular conformations. The structure determination by molecular replacement usually includes two steps, rotation function and translation.

7.4.1 Rotation function

The rotation matrix for transformation relating one crystal lattice to another can be found out by several ways, among which Patterson function is the most common one. In a comparison of two Patterson maps, one of which is calculated from observed structure factor $F_{obs}$ of an unknown structure and another calculated from a known structure, the pattern of the two Patterson maps will be similar if the orientation of the known structure is the same as that of the unknown crystal system. In order word, the convolution of two Patterson functions has a maximum at the rotation angles of $(alpha, beta, gamma)$ which define the same orientation of the two structures. Thus, the rotation function is defined by where $u$ is the volume of the vector space. The Patterson function $P_1$ is calculated from the observed structure factors $F_{obs}$ of the unknown structure. The rotated Patterson $P_2$ is calculated from the known model at all possible rotation angles. where [~{C}] is the transpose of matrix $[C]$, or the flip of the matrix along the diagonal. If ${h'}$ is used to represent [{C}]${k}$, the second Patterson becomes Remember that ${h}$ or ${k}$ is the diffraction index (h,k,l) in the reciprocal space and is an integer. However, the transformation matrix [~{C}] is not an integer in general, and thus ${h'}$ is a non-interger number. Replace the above Patterson functions into the rotation function, The integral is an interference function and is complicate. Rossman and Blow footnote{Rossmann, M. G. & Blow, D. M. (1962), {Acta Cryst.} {15}, 24.} computed the rotation function by transforming the crystallographic system into a Cartesian coordinate system and then converting this into a system of Eulerian angles, as in program GLRF. footnote{Tong, L. & Rossmann, M. G. (1990) {Acta Cryst.} {A46}, 783-792.} This computation is very time consuming. Crowther expended the Patterson density in terms of spherical harmonics rather than Cartesian Fourier components $F_{h}$ and showed that the computation can be 100 times faster. footnote{Crowther, R. A., in The molecular replacement method" (ed. M. G. Rossmann), 173-178, Gordon & Breach, 1972.} Crowther's method is often called fast rotation function" and is widely implemented in many programs such as AMoRe. footnote{Navaza, J. (1994) {Acta Cryst.} {A50}, 157-163.} Qualitatively, two aspects about the rotation function need to be emphasized. First, the rotation function is proportional to the order of $F^4$ so as to be dominated by strong reflections. Small $F$s or weak reflections have little contribution to the rotation function and are therefore often omitted to accelerate the calculation. Second, the Patterson terms can be grouped into two sets of vectors $S_a$ and $S_c$. Vectors between atoms within the same asymmetric unit ($S_a$) are called self-vectors, while vectors between molecules in different asymmetric unit ($S_c$) are called cross-vectors. The self-vectors $S_a$ will be the same in different crystal lattices, assuming that the molecular conformation remains unchanged. However, the cross-vectors ($S_c$) are different in different crystal lattices. When the same orientation is found, the self-vectors $S_a$ will give out a maximum of rotation function, while the cross-vectors $S_c$ are different and generate noise. In practical rotation of a known model into an unknown crystal system, a large unit cell such as twice size as the rotated object is used to host the known model in order to get rid of the cross-vectors.

7.4.2 Translation function

When the rotation matrix is established, the next parameter need to be determined is translation, or the relative position of a molecule to the origin of the crystal system. In the seven crystal systems, the origin is fixed to comply with the crystal symmetry. For the space group P1, origin can be arbitrarily defined because of no symmetry in P1 and thus there is no need to search for the translation. The origin for the space groups of the monoclinic system is arbitrary in the direction y but is fixed in the x and z directions. The origins for all other crystal systems are fixed by the symmetry. The translation can be determined in several ways, either in real space or reciprocal space. Early approaches to the translation problem use Patterson function. We describe here the Q" function and cross-Patterson. Assume that a crystal system has a symmetry operation $T$ at position $-{t}$ relative to the origin of the crystal lattice and a known structure has a coordinate position ${r}$ to the same origin, the symmetry-related molecule will have the coordinates at $T({r}_j + {t}) - {t}$. The interatomic vectors between molecules related by the symmetry is given by ${r}_j + {t} - T({r}_{j'} + {t})$. The Q" function is defined as the Patterson function of the symmetry-related cross-vectors. footnote{Tollin, P. (1966) {Acta Cryst.} {21}, 613.} The Q" function is calculated at various {t} translations and has a maximum at {t} where the trial cross vectors are properly positined relative to the symmetry axis. This {t} value is the true translation. Let us look at an example of 2-fold axis in the space group P2. The monoclinic space group has an arbitrary origin for the y direction so that the relative position of {t} is $(X_0,Z_0)$. The 2-fold axis generates an equivalent position of (-x,y,-z) from (x,y,z). Thus, ${r}_j + {t} - T( r}_{j'} + {t}) = (x+x'+2X_0, y-y', z+z'+2Z_0)$, and the Q" function becomes The second approach to the translation problem is cross Patterson which is defined by footnote{Crowther, R. A. & Blow, D. M. (1967) {Acta cryst.} {23}, 544.} where $P(u)$ is Patterson function of observed diffraction of the unknown structure. $P(u,{t})$ is cross Patterson function of the known model and is calculated by where $rho_2$ is the electron density for the symmetry-related molecule and {t} is the translation between molecules 1 and 2. $F^*$ is the magnitude of structure factors after application of symmetry operation. Thus, the cross Patterson becomes The cross Patterson and Q" function use Patterson function and are similar in many ways. But computation of both methods are complicated. Nowadays, fast computers allow people to calculate the translation function in a simple way that the Fourier transformation is perfomred at various translations, and the crystallographic R-factor and correlation coefficient are computed as function of the translation position of the subunit in the unit cell. The solution with a low R-factor and a high correlation coefficient is most likely to be a correct translation position. The crystallographic R-factor is defined as and the correlatioon coefficient The following is the algorithm for the translation used in the CCP4 program SEARCH written by Dodson. footnote{Dodson, E. J. (1985) Molecular Replacement" Proceedings of the Daresbury study weekend, p33-38, Daresbury Laboratory.} The calculated structure factor ${F}_{calc}$ for the model molecules in the whole unit cell is the summation of the structure factors for the molecule in asymmetric unit (${F}_{c1}$) and its symmetry-related molecules (${F}_{c2} ...$). where, m is the number of symmetry operations. If the molecule in the asymmetric unit has coordinate ${r}$ and a translation ${t}$, then where $n$ is the number of atoms in the asymmetric unit. The summation in the above equation, $sum_j^n F_j exp(-2 pi {h.r_j})$, represents the structure factor of the asymmetric molecule at translation ${t} = 0$ (${F}_{c1}^o({h})$). Thus, the ${F}_{c1}({h})$ at the translatiom ${t}$ will equal to the product of ${F}_{c1}^o({h})$ and a phase shift of $exp(-2 pi {h.t})$. Similarly, the symmetry-related molecules will have coordinates of ${F}_{c1}^o({h}) \sum_i [C_i]({\bf r+t})$ where $[C]$ is the symmetry operation. Replace the above equation into the definition of structure factor, Therefore, once ${\bf F}_{c1}^o({\bf h})$ is calculated, it only needs to sum them together with appropriate phase modifications. The translation problem is then determined by the comparison of R-factors and correlation coefficients.

7.4.3 Molecular symmetry and self-rotaion function

If the crystallographic asymmetric unit has more than one molecule or subunit, those molecules/subunits will be related one another by a non-crystallographic symmetry, often called moelcular symmetry or local symmetry. An example is the tetrameric molecule of fructose-1,6-bisphosphatase (FBPase). Its four subunits are related by three 2-fold axes and the molecule of FBPase has a 222 symmetry (Fig. 7.3). The horizontal 2-fold axis {\bf r} and the 2-fold axis perpendicular to the paper plane are coaxial with the crystallographic 2-fold axes. The crystallographic asymmetric unit is a dimer of C1 and C2 chains which is related by the molecular 2-fold axis {\bf p}. Molecules or subunits in the crystallographic asymmetric unit can be related by axes of 2-fold, 3-fold, 4-fold, 5-fold, 6-fold, 7-fold, and 11-fold. The axes of 5-fold, 7-fold, and 11-fold are specially belonged to molecular symmetry but not crystallographic symmetry. When two moelcules or sunbunits are related by a rotation of exact 180$^o$, the symmetry is called {\bf\em proper} moelcular 2-fold axis, such as the axis {\bf p} of FBPase. When the rotation is close to but not exact 180$^o$ such as a few degree off, the 2-fold axis is called {improper} molecular 2-fold. Similar terms are used to define other kinds of molecular symmetry. Fig. 7.3. Schematic presentation of fructose-1,6-bisphosphatase. The tetrameric molecule of FBPase has four identical subunits and related by three 2-fold rotational axes marked by letters of {r} and {p}, and by the oval dot at the center of the molecule. Molecular symmetry can be identified by (1) Patterson function, (2) heavy atom positions, and (3) rotation function. Molecular symmetry in the atomic space will generate a related symmetry in Patterson map. A Patterson section which is perpendicular to molecular symmetry may reveal the molecular symmetry. However, this is limited to some special cases in practice because the location of molecular symmetry is unknown at the beginning of the structure determination. When molecular symmetry is close to crystallographic axes, the molecular symmetry can be straightly viewed in the Patterson map. For example, the molecular 3-fold axis of hexagonal insulin can be clearly seen in the Patterson map (Fig. 7.4). Heavy atom positions are also usable for locating molecular symmetry. footnote{Ke, H. M., Thorpe, C. M., Seaton, B. A., Marcus, F., and Lipscomb, W. N. (1990) {J. Mol. Biol.}, {212}, 513-539; {214}, 950.} However, it is slightly complicated to find out how heavy atom positions in a unit cell are grouped into asymmetric unit. Rotation function is the most powerful method to located molecular symmetry. Similar to cross-rotation function, self-rotation function is defined as where the first Patterson $P_1$ is calculated from the observed diffraction data and the second Patterson $P_2$ is the Patterson function after rotated. The self-rotation function will be maximum when the molecules have the same orientation or overlap one another. Fig. 7.4. (a) The precession photo and (b) the different Patterson function of hexagonal insulin. The crystallographic axes are marked by a and b, while the molecule 2-fold axes are represented by arrows. (This figure is copied from \\lq Protein Crystallography", ed. Blundell and Johnson, pp 455, 1976, Academic Press, London.)

8. Electron density interpretation and model building

Once the phases are obtained, an eletron density map can be calculated using equation 25. The locations of eletron density in the map indicate the positions of atoms. The height of the density is proportional to the number of electrons that an atom has. Theoretically, the electron density should be discontinous for each individual atoms. However, this requires an atomic resolution of the diffraction data. The term of resolution means the d value in Bragg's equation (eq. 31). A smaller d value represents a higher resolution. The highest resolution is limited by the wave length of the X-ray source. For example, the maximum d value is 0.77 for the Cu K$alpha$ radiation ($lambda$ =1.54 AA). However, the practical resolution is determined by the quality of a crystal and its diffraction ability. Small inorganic or organic molecular crystals can normally diffract to an atomic resolution such as $>$ 1 AA. At that resolution, individual and discontinous electron density can be clearly resolved for most kinds of atoms except for hydrogen. The resolution range of 2 to 3 AA~ is typical for a macromolecular crystal. An atomic model can be built from a map at this resolution range, but the shape of electron density looks different from that at atomic resolution. The essential feature of a macromolecular map is that the electron density appears continous and is connected each other (Fig. 8.1). The shape of electron density depends on resolution.

Fig. 8.1. The electron density map (Fo) at 3.5 AA~ resolution. It was projected from six sections along z axis (about one tenth the unit cell). The map was calculated from the MIR phases of fructoase-1,6-bisphosphatase and is contoured at 1 $sigma$. The connectivity of electron density is obvious. The piece of the density at the middle bottom can be recognized as an $alpha-$helix.

1. At 6 AA~ resolution, the overall shape of a molecule can be recognized and the electron density for the macromolecule is clearly distinquished from the solvent region. At this resolution, the quaternary structure which is a term representing the domain or subunit arrangement of a macromolecule, can be resolved. In addition, the secondary structure (Fig. 8.2), which is a term describing the $alpha-$helix and $beta-$sheet of a macromolecule, is sometimes visible.

Fig. 8.2. Ribbon presentation of the trimeric structure of chorismate mutase. The helices are drawn in the column, and the $beta$-strands are drawn in arrows.

2. At 4 AA~ resolution, the secondary structure is clearly resolved, and many C$alpha$ carbon atoms are recognizable from the branched density for the side chains of amino acids. At this resolution, few individual amino acids can be recognized from the shape of electron density for the side chains.

3. At 3 to 2.5 AA~ resolution, most C$alpha$ atoms and side chains of residues are recognizable. Some of carbonyl oxygens can also be recognized, especially for the $alpha-$helix and $beta-$strand regions. Therefore, the three-dimensional structure and molecular conformation, or sometimes called the tertiary structure of a macromolecule, can be determined at this resolution range although it may be easy to mistrace a structure.

4. At 2.5 to 2 AA~ resolution, most carbonyl oxygen positions can be uniquely located so that the backbone conformation of individual amino acids ($phi, psi$ and $omega$) can be determined. In addition, conformations of the side chains of amino acids are recognizable for many residues and some bound water molecules or solvents are visible (Fig. 8.3).

Fig. 8.3. A stereo view of the ($F_o - F_c$) electron density for the model of fructose-6-phosphate (F6P) in the complex structure of fructose-1,6-bisphosphatase (protein) and F6P (ligand). The map was calculated by omitting F6P at 2.1 AA~ resolution and was contoured at 5 $sigma$. F6P has two conformations at equilibrium in solution, $alpha$-D-F6P and $beta$-D-F6P. The electron density suggests the $beta$-D-F6P conformer binds to the enzyme but needs to be confirmed at higher resolution.

5. At 2 AA~ or higher resolution, most side chain conformations are resolvable (Fig. 8.4), and bound water can be placed. However, the noise peaks in the electron density map may sometimes be misinterpreted as bound waters, because the binding strength of water to a protein varies on the binding environment and also because of the errors on the amplitudes and phases. Therefore, the strongly bound waters are most meanful while waters with a low occupency and high B-factor are questionable.

Model building of a macromolecule now uses a graphic computer system. Programs such as FRODO and O have been used for the purpose effectively. Model building requires the knowledge of structure chemistry and skill of running the program. The best way to learn this is to practically build a model in a X-ray crystallophic laboratory.

Fig. 8.4. Stereo picture of dipeptide Ala-Pro in the structure of cyclophilin complexed with Ala-Pro. The electron density comes from the $(2F_o - F_c)$ map at 1.65 AA~ resolution and is contoured at 1.2 $sigma$. The dipeptide in free solution has two conformations for the peptide bond: $cis$ conformation with $omega = 0^o$ or $trans$ with $omega = 180^o$. The electron density at the 1.65 AA~ resolution uniquely revealed the $cis$ conformation of Ala-Pro bound the protein cyclophilin. The peptide bonds of $phi$ and $psi$ can be clearly defined at this resolution.

section

9. Structure refinement

9.1. Standards to judge the correctness of a structure

A raw model built from an electron density map is not accurate and may even sometimes includes a partially wrong structure. An iterative process of structure refinement and model revision is necessary to assue the accuracy of the structure and to correct the misinterpretation of the electron density map. The question is how do people know whether the refined structure is correct or not. The first standard is the crystallographic R-factor which is defined as the residual between the observed (F$_o$) and calculated (F$_c$) structure amplitudes. where k is a scale factor to bring the two amplitudes to the same level. A raw macromolecular model built from the MIR or other electron density maps typically has an initial R-factor around 40 to 55\% while a well refined structure usually has an R-factor below 20\%. A low value of R-factor ($<$ 20\%) is an essential indicator of the correctness of the structure, but the R-factor is not an absolute standard. It has been reported that some partially wrong structures have an R-factor less than 20\%. A sophisicated use of R-factor is a calculation of an R$_{free}$, in which partail reflection data are omitted during structure refinement and finally R$_{free}$ is calculated using all reflections. Obviously, R$_{free}$ will be larger than, but should be close to R-factor (say R$_{free}=$0.25 when R=0.2). A difference between R$_{free}$ and R-factor, larger than 10\%, often indicates that portion of the structure is problemsome. The second standard is how far the molecular geometry of the model deviates from the ideal value. A good model should have the root-mean-square (RMS) errors of $<$ 0.02 \AA~ for the bond lengths and $< 5^o$ for bond angles. The R-factor and molecular geometry may compensate one another during refinement. Sacrifice of the molecular geometry may yield a better R-factor and vice versa. The third standard is the resolution and the number of reflections. The higher resolution and the more reflections are used in the structure determination, the more reliable structure will be obtained.

{9.2. Least-square refinement

Structure refinement is time-consuming. The convergence of a refinement is dependent on the quality of the starting model and the personal experience and skill. Normally, people first refine the structure of a macromolecule at low resolution such as 3 \AA~ and gradually advance to a high resolution such as 2.5, 2 \AA~ and so on. After a good geometry of the macromolecule is obtained, water molecules may be picked up from the ($F_o - F_c$) map and gradually added to the refinement. Least-square and molecular dynamics are two widely used methods to refine a macromolecular structure. These refinements use the structure amplitudes and thus are the refinement in the reciprocal space. The refinement in the real space has been also shown to be useful to optimize a macromolecular structure. Other methods such as the diffusion equation method\footnote{Kostrowicki, et al., 1991, {\em J. Phys. Chem.}, {\bf 95}, 4113; Kostrowicki and Scheraga, 1992, {\em J. Phys. Chem.}, {\bf 96}, 7442} are still in developement. The following discussion will cover only the basic theory of the reciprocal refinement. The least-square refinement minimizes the residuals between the observed and calculated structure amplitudes. The summation is all over the symmetry-independent reflections. The principle of the least square method is to find the mimimum $Q$ value by setting the first derivative of $Q$ equal to zero. where $p_j$ are the variables to be refined. A refinement of the $n$ variables will generate a $n$ normal equations which can be solved by knowledge of algebra. The following is the details of the least square method. (If your interest is not in X-ray crystallography, skip to section 9.3. The following discussion may be too much mathematics for you). The calculated amplitudes can be obtained from: where B is the temperature factor, the term of $e^{-B (\frac{sin \theta} {\lambda})^2}$ is the thermal movement of an atom, and the summation is over total number of atoms. However, the exponential form of the structure amplitudes does not fit to the least square method and has to be transformed into a linear form. This can be done by expansion of the structure factor to a Taylor series. The Taylor series has a general form as follows. where $(p_1,...p_n)$ are the n variables to be determined, $p_o$ is the initial value before the refinement, $\Delta p = p_{new} - p_{old}$ is the shift. For a good approximation, the third and higher order terms can be neglected. If the atomic position parameters (x,y,z) and its individual B-factor are refined, the calculated amplitude $F_c$ can be linearly expressed as: where the summation is over the total number of atoms, or where the number of variables equals to the product of the total atoms and the parameters for each atom such as (x,y,z) and thermal parameters (B). In the above equation, the refined variables are the shifts $\Delta p_i$. Replace this expression into eq. 83 and set the first order derivative of the equation to zero (the least-square): Rearrange the equation: where the first summation term and the inner summation of the second term are constants, the variables need to be refined are $\Delta p_i$. Let $b_j = \sum_{hkl} w(hkl) (|F_{obs}| - kF_c) \frac{\partial F_c} {\partial p_j}, a_{ij} = \sum_{hkl} kw(hkl) \frac{\partial F_c} {\partial p_i} \frac{\partial F_c} {\partial p_j}$, and $x_i = \Delta p_i$, the equation is abbreviated as In more detail, the above equations are the normal equations of: \[ \begin{array}{c} or in a matrix form The matrix has $n$ x $n$ dimensions. The line from the upper left corner to the lower right corner of the matrix is called the {\em principal diagonal} and has the elements $a_{ii}$ lying on it. The matrix is symmetric, i.e. the elements $a_{ij} = a_{ji}$. A more compact notation can be expressed as: a matrix {\bf A} acting on a vector {\bf x} gives out the vector {\bf B} This normal equation is ready to be solved by multiplication of an {\em inverse matrix} ${\bf A}^{-1}$ on the left of both sides of the equation. Note ${\bf A}^{-1}{\bf A} = 1$. These equations give a set of shifts. Application of the shifts to the initial values gives out a new set of the parameters. However, the shifts from the first solution of the equations may not accurate because of the omission of the higher order terms in the Taylor series and errors on the measured structure factors. To improve the accuracy, the least square method normally runs in an iterative procedure in which the set of the shifts is fed back into the equations until the convergence of the process. Mathematically, solution of the normal equations with n variables requires n independent observations. If the number of observations (reflections) is more than the variables, the redundance of the observations will result in more accurate results. When the number of reflections is less than the number of the variables, the n normal equations are not completely independent and underdetermined.

9.3. Stereochemically restrained least-square refinement.

In macromolecular structures, the basic units, amino acids or nucleotides, have stereochemical geometry similar to their unpolymerized forms whose stereochemistry can be very accurately determined in small molecular structures. The ideal stereochemistry has been used as the restrains in the refinement of macromolecular structures to enhance the power of the least-square refinement. The terms to be minimized in the stereochemically restrained refinement include the crystallographical term and the stereochemical terms, as implemented in the program PROLSQ (Hendricksen, 1985, {\em Methods in Enzym.}, {\bf 115}, 252): where Ws are the weights for different terms. The first term is the X-ray term while all others are stereochemical restrains terms. The second term restains the distances between atoms, including bond lengths, bond angles, and dihedral angles. The third term is the planarity of atoms, such as the peptide amide bonds, aromatic rings and the guanidyl part of arginine. The fourth term restrains the configuration of the correct enantiomers for chiral centers at the C$\alpha$ atoms and the C$_{\beta}$ atoms of threonine and isoleucine. This term uses a chiral volume which is defined as the triple scalar product of the vectors from a central atom to three attached atoms. For example, the chiral volume of the C$\alpha$ atom is The fifth term restrains the nonbonded distances or van der Waals contacts to prevent the close interactions of nonbonded atoms. The last term is for the torsion angles. Another program TNT uses the similar principle of the stereochemically restrained refinement, but Fast Fourier transformation algorithm so that it is faster than PROLSQ.

9.4. Energy refinement

The energy refinement uses a similar logic as the stereochemically restrained refinement. where $E$ is the potential energy and $w_x$ is the weight for the relative contribution of the energy and X-ray terms. The potential energy terms for the stereochemical restrains are where the subscript o represents the ideal values, $b, \tau$, and $\xi$ are the distances between the involved atoms, $\theta$ is the rotation angle, $\delta$ is a phase angle determining the zero point of rotation, $m$ is the rotation frequency, e.g 3 for the $C-C$ bond in ethane.

Problems

1. Explain why the factorial method can reduce the number of tests to crystallize a protein. Please read the paper by Carter, Jr., C. W. & Carter, C. W, (1979), {J. Biol. Chem.} {254}, 12219-12223, and write your comments.

2. In the space group P222, there are three 2-fold rotation axes. However, only two 2-fold axes are independent and make a total of 4 equivalent positions. Prove the above statement by using the symmetry operations.

3. Draw real and reciprocal unit cells of crystal hexagonal system. Represent the equivalent positions of the space group P3_221 in matrix notation. And also draw the matrices for the symmetries in the diffraction space. The equivalent positions are: (x,y,z; -y,x-y,2/3+z, y-x,-x,1/3+z; y,x,-z; -x,y-x,2/3-z; x-y,-y,1/3-z).

4. Calculate the symmetry-independent number of reflections at 2 \AA~ resolution for cyclophilin in the space group P2_12_12_1 with a=43, b=52.6, and c=89.2.

5. In an image plate system, the detector has dimensions of 20 x 20 cm and a crystal was place at 10 cm away from the detector center, as shown in the following figure. What is the maxinum resolution people can collect from this setting when wavelength=1.54 angstroms is used?

6. Make a table listing the Harker peaks for the space group P3_221. (The equivalent positions are (x,y,z), (y,x,-z), (-y,x-y,2/3+z), (-x,y-x,2/3-z), (y-x,-x,1/3+z), (x-y,-y,1/3-z). How many and where are the Harker sections?

7. The coordinates of crystal structures can be represented in an orthogonal system such as Brookhaven Data Bank fromat (PDB format) or in a crystal system such as Konnert-Hendrickson format. What is the rotational matrix for the transformation from an orthogonal system to a hexagonal crystal system?

8. A scaling factor k and a temparature factor B are often used to scale the heavy atom derivative data (Fpq) to the native (Fp) Use the least-square method to obtain the best k and B.