*U.S. Dept. of Commerce / NOAA / OAR / PMEL / Publications*

Copyright 2007 American Meteorological Society. Further electronic distribution is not allowed.

Time series of winter-averaged LSAT for all model ensemble members and observations (thick yellow line) over the Arctic land area (60°–90°N) is shown in Fig. 2. To be consistent among the models, and to avoid the impact from late-twentieth-century warming, all the anomalies are calculated relative to the 1901–80 period mean in each realization. It is apparent that almost all model realizations are able to reproduce positive temperature anomalies in the last two decades. Some of the realizations have relatively large interannual to decadal variability, while the others are less variable. All three runs from the Flexible Global Ocean–Atmosphere–Land System Model gridpoint version 1.0 (FGOALSg1.0; magenta thin lines with open triangles in Fig. 2) started from relative warm states, contrary to simulations from other models and observations. The sea ice simulation by this model apparently shows inappropriate initialization for simulating the climate of the twentieth century (Zhang and Walsh 2006). Another explanation is that the model is still in a nonequilibrium state (Y. Yu 2005, IPCC workshop, personal communication). Because of this, the results from FGOALS-g1.0 are excluded from the statistics and discussions in the next sections.

**FIG. 2. Time series of LSAT anomalies over the Arctic (60°–90°N)
based on 63 realizations from 20 models investigated in their 20C3M simulations.
The observed series based on CRUTS2.0 is shown by a thick orange line. The
anomalies are relative to the mean of 1901–80. All curves are smoothed
with 5-yr running mean, in units of °C.**

Stott et al. (2000) demonstrate that global mean surface air temperature changes since 1979 have contributions from both natural and anthropogenic factors based on their Third Hadley Centre Coupled Ocean–Atmosphere General Circulation Model (HadCM3) simulations. Over the Arctic we see that the majority of the ensemble members show warm anomalies during the last two decades, which are comparable with the observed (Fig. 2) in their 20C3M simulations. Figure 3 (top) displays the averaged LSAT anomalies for the 1979–99 period for all ensemble members from 19 models. The period 1979–99 is chosen, because nearly half of the 20C3M simulations (27 runs out of 63) ended at December 1999 (Table 1). Although there are differences among the models and each of their realizations, all ensemble members from all models show positive anomalies for the last two decades in various degrees. The smallest amplitudes are from ensemble members of the three models: the Geophysical Fluid Dynamics Laboratory Climate Model version 2.1 (GFDL-CM2.1; bars 22–24), the GISS-EH (bars 27–31), and the GISSER (bars 32–40). In addition there are four realizations—the ECHAM5/Max Planck Institute Ocean Model (MPI-OM): run 2, the GFDL-CM2.0: run 3, the ECHAM5 Hamburg Ocean Primitive Equation (ECHO-G): run 2, and the single run from the Met Office (UKMO-HadCM3) in which amplitudes of averaged anomalies in these two decades are less than two-thirds of the observed value. On the other hand, there are 10 realizations in which the warm anomalies are one-third larger than observed. Among these, three are from the Community Climate System Model version 3 (CCSM3: bars 3, 5, and 8), two from the Meteorological Research Institute Coupled GCM version 2.3.2 (MRI-CGCM2.3.2: bars 53 and 54), and one realization each from the Commonwealth Scientific and Industrial Research Organisation Mark version 3.0 (CSIRO-Mk3.0: bar 14), and the Parallel Climate Model (PCM: bar 57). The remaining three are the single realizations provided by the Canadian Centre for Climate Modelling and Analysis Coupled GCM version 3.1 (CCCma-CGCM3.1-T47: bar 10), the Centre National de Recherches Météorologiques Coupled Global Climate Model version 3 (CNRM-CM3: bar 12), and MIROC3.2(hires) (bar 43).

**FIG. 3. Mean winter arctic LSAT anomalies for the 1979–99 period
from observation (first bar in each panel, light shaded) and model simulations
in the 20C3M scenario. (top) Individual realizations of each model (bars
2–61), and (bottom) the ensemble mean for models when more than one
realization is provided, or the only realization available. The confidence
limits are two standard deviations derived from the detrended control run
time series. Due to an abrupt change in the GISS-EH control run, the confidence
limit is not shown. The last bar in the bottom panel shows the ensemble mean
from all runs of all models.**

The bottom panel in Fig. 3 shows the model ensemble means of the anomalies for the last two decades. Confidence limits are estimated as ± two standard deviations of the detrended time series from the corresponding control run (PIcntrl) of each model. That all but two ensemble mean anomalies are different from zero, with the lower bounds of the confidence limits being above the zero line, suggests that the warm anomalies in these two decades are beyond the range of natural variability. In other words, differences caused by intrinsic variability, which have essentially cancelled each other out, imply that the late-twentieth-century warm anomalies could be a consequence of long-term change in external forcing. Nine models show anomalies at least the same or larger than the amplitudes of observed [CCSM3, CGCM3.1-T47, CGCM3.1-T63, CNRM-CM3, CSIRO-MK3.0, the Institute of Numerical Mathematics Coupled Model version 3.0 (INMCM3.0), MIROC3.2(hires), MRI-CGCM2.3.2, and PCM], and another seven models show the ensemble means are within two-thirds of the observed, these are ECHAM5/MPI-OM, GFDL-CM2.0, the Goddard Institute for Space Studies Atmosphere–Ocean Model (GISS-AOM), the L’Institut Pierre-Simon Laplace Coupled Model version 4 (IPSL-CM4), the Model for Interdisciplinary Research on Climate 3.2, medium-resolution version [MIROC3.2(medres)], ECHO-G, and UKMO-HadCM3. Ensemble means from the three models (GFDL-CM2.1, GISS-EH, and GISS-ER) that have small amplitude of warm anomalies in every single realization, are less than two-thirds of the observed. As a group, the multimodel mean of averaged winter Arctic LSAT anomalies is 0.62°C (rightmost bar in bottom panel in Fig. 3) for 1979–99, which is close to the observed value of 0.64°C (leftmost bar) from CRUTS2.0. This is encouraging. However, between model differences are not small.

As discussed in section 2 there were prolonged warm anomalies of more than
0.7°C in the Arctic in the mid-century from the late 1920s to the 1940s.
The decadal mean of individual realizations from the IPCC models for 1939–49
display large variability in magnitude and sign (Fig.
4): among 61 realizations, 30 of the decadal mean SAT anomalies are positive,
21 are negative, and another 9 are near zero. None of the decadal mean anomalies
from any model is greater than the observed value. In contrast to the warm
anomalies simulated by models for the last two decades (Fig.
3), the large discrepancies with observations for the 1940s among the models *and
among their ensemble members *indicate the potential for large internal
variability within the models. It is interesting to note that at least one
realization has the opposite sign of decadal mean anomalies from other ensemble
members when multiple realizations are provided for a single model (except
for CSIROMk3.0 which generated negative anomalies for all three realizations
in this decade).

**FIG. 4. Decadal mean winter LSAT anomalies for the 1939–49 period
based on individual realizations from each model over the region of 60°–90°N.
The first bar on the left is the observed mean (CRUTS2.0). Units are in °C.**

Based on Fig. 4, our hypothesis is that intrinsic natural variability is the main cause behind the large anomalies in the early/midpart of the century. Thus, the models should not necessarily replicate the year-to-year changes in the observations, but should produce events with the same type of multiyear behavior as the observations. Separate panels in Fig. 5 display the model simulated and observed (thick black solid line) winter LSAT over the Arctic from the late nineteenth to twentieth century. Each panel shows the ensemble members from one model, and all time series are presented with 5-yr running mean. All realizations are different. For example, four of eight runs from CCSM3 (top-left panel of Fig. 5) have relatively sizable amplitudes of anomalies during the midcentury (red, blue, yellow, and black line) with run 1 (thin red line) matching the observations in both amplitude and timing. One realization from GFDL-CM2.0 (blue line) matches the timing and amplitude of the observed time series, while the other two (red and green lines) have an amplitude similar to the warm anomaly in the 1950s. The amplitudes of anomalies from GFDL-CM2.1 are slightly weaker than the observed, but one realization has a long duration (red line). The two Canadian models CGCM3.1-T47 and CGCM3.1-T63 present similar results: both show twin peak warm anomalies in the midcentury with amplitude weaker than observations. Two realizations from CSIRO-Mk3 (red and blue line) produce warm anomalies in the 1930s, which last about 10 years. ECHAM5/MPI-OM has one realization (red line) in which the warm anomalies are close to those observed around 1940s, while other realizations show weaker amplitudes at later times. Similar situations are seen in MIROC3.2(medres). The warm anomalies from all PCM and ECHO-G realizations show comparable amplitude, but are not synchronized with observations. This is also true for the single realization from CNRMCM3 (late), INM-CM3.0 (late), and UKMO-HadCM3 (early). Two GISS models (GISS-EH and GISS-ER) have little variability through their entire runs, even at the end of twentieth century. Another two models (GISS-AOM and ISPL-CM4) also have a rather flat curve for the first 100 years until the end of the twentieth century. Large warm anomalies are simulated by the high-resolution model developed by Japan [MIROC3.2(hires)] at the end of twentieth century, but warm anomalies in the midcentury are weak. In many cases the warm anomalies simulated by models have comparable amplitude to the observed midcentury warm events, but with a shorter duration.

**FIG. 5. Winter LSAT anomalies over the Arctic for individual realizations
of each model. Thick solid black line is the observed time series based on
CRUTS2.0. (left), (middle) the models with natural forcing included, (right)
the models without natural forcing in their simulations. All the time series
are smoothed with 5-yr running mean. Units are in °C.**

To provide a quantified estimate of model performance, we assess the models’ ability to reproduce a midcentury-type warm anomaly by applying the following criterion, labeled 2/3CRU. A 5-yr running window is applied to the simulated winter LSAT time series. A decadal mean is calculated around the maximum value found in the models during any portion of a 50-yr period (1911–60). If the decadal averaged anomaly equals to or exceeds two-thirds (2/3) of the observed decadal mean (0.36°C), then it is considered to be a comparable simulation. Although the 2/3CRU criterion is an arbitrary selection, it does provide a quantitative measure. Because we are interested in decadal and longer phenomenon, a 2/3CRU criterion of continuous positive temperature anomalies for 10 years is a minimum requirement for examining warm events. The decadal mean of the SAT anomalies for each realization is shown in Fig. 6. Compared with Fig. 3 where 21 realizations are found to be at least the same or larger than the observed at the end of twentieth century, only 3 realizations (one each from three models: CCSM3, ECHO-G, and PCM) produced warm anomalies larger than the observed in the midcentury. Another 14 realizations from 8 models produced warm anomaly amplitudes within two-thirds of the observed value: CCSM3, CISRO-Mk3.0, ECHAM5/MPI-OM, GFDL-CM2.0, GFDL-CM2.1, INM-CM3.0, ECHO-G, and PCM. Over 60% of the realizations (37 out of 60) do not produce midcentury warm anomalies greater than half of the observed CRUTS2.0 value (0.27°C). One run [run 2 from MIROC3.2(medres), bar 45] missed the 2/3CRU cutoff line by a small fraction. A summary of the success rate of the twentieth-century simulations (20C3M) under this criterion is provided in Table 2 (third column).

**FIG. 6. LSAT anomalies averaged over a decade that is centered in the peak
value detected during the 1910–60 period in the 20C3M simulation. The
thin black line indicates a value that is two-thirds of the observed amplitude.
The first gray bar is based on CRUTS2.0. Units are in °C.**

**Table 2.**

A second test is to compare the variance of the control runs with the variance
from observations. While there is almost no “error” in estimating
the control run variance because of their length, one can consider estimated
confidence limits of the standard deviation from observations. The standard
deviation of CRUTS2.0 on an interannual time scale is computed from the de-trended
time series for 1902–59. The decadal and multidecadal scales are represented
by the detrended time series with a 5-yr and 15-yr running mean. A simple test
is whether the model standard deviations are less than the value of observed
standard deviation minus the 90% confidence interval based on χ^{2} estimates.
The ratios of the model/observed standard deviations on time scales from interannual
to multidecadal are shown in Fig. 7. The effective
sample size is estimated based on a formula by Santer
et al. (2000). As a result, the 90% normalized confidence limits are (0.83,
1.27), (0.58, 4.42), and (0.51, 15.95) for the three time scales. Nine models
[ECHAM5/MPI-OM, GFDL-CM2.0, GFDL-CM2.1, GISS-AOM, GISS-ER, IPSL-CM4, MIROC3.2(hires),
MIROC3.2(medres), and MRICGCM2.3.2] lie outside the range of the observed variability
on decadal to interdecadal time scales (Figs. 7b
and 7c). The GISS-EH model is excluded due to a large abrupt change in
the time series of its control run. An autocorrelation analysis further revealed
that there is no preferred time scale in all of the model control runs.

**FIG. 7. The ratio of standard deviation of model control runs to the observed
(CRUTS2.0) on (a) interannual, (b) decadal, and (c) interdecadal time scales.
GISS-EH is excluded from the figure due to a large abrupt change found in
its control run. All standard deviations are calculated after the time series
is detrended and a (b) 5-yr and (c) 15-yr running mean applied, respectively.
The dashed line indicates the lower range of the 90% confidence limit on
the standard deviation normalized by CRUTS2.0.**

Model standard deviations from their control runs are listed in Table 2 (last three columns), with those with values within the confidence limit range of the observations shown bold. Five models (CCSM3, CSIRO-Mk3.0, INM-CM3.0, ECHO-G, and PCM) passed both the variance test in control runs and the 2/3CRU criterion in 20C3M simulations (highlighted by yellow). Another four models (CGCM3.1-T47, CGCM3.1-T63, CNRM-CM3, and UKMO-HadCM3) also passed the 90% confidence limit in their control run, indicating that these models may have enough intrinsic variability from the interannual to interdecadal time scale, yet they fail the 2/3CRU criterion (highlighted by blue) in their single realization of the 20C3M simulations. It is therefore important to have multiple ensemble runs to evaluate a model’s performance. Three models (ECHAM5/MPI-OM, GFDL-CM2.0, and GFDL-CM2.1) show quite reasonable amplitude and duration of the midcentury warm events but do not have enough variance in their control runs based on variance test. The reason behind this is unclear.

The warm event criterion (2/3CRU) was also applied to the control runs based on 100-yr segments. As the length of the control runs of each model ranges from 100 to 500 years, the number of the truncated time series is different among the models. The “yes” in column 4 of Table 2 indicates that at least one of the truncated control run time series passes the 2/3CRU criterion. All of the nine models that passed the variance test for decadal and interdecadal time scales also passed the 2/3CRU criterion. The 2/3CRU criterion has good correspondence between the control runs and the 20C3M simulations with exception for the single run simulations. The MIROC3.2(medres) model fails to reproduce the midcentury warm events in all three 20C3M simulations, even though it passed the 2/3CRU criterion in its control runs. However this model shows only enough variance on the interannual time scale, but not on longer time scales. Two more models (ECHAM5/MPI-OM and GFDL-CM2.0) passed the2/3CRU criterion in the control run without passing the variance test in any scale.

In summary, seven models do not have enough variance nor do they produce enough magnitude comparable to the midcentury warm event. These are GISS-AOM, GISS-EH, GISS-ER, IPSL-CM4, MIROC3.2(hires), MIROC3.2(medres), and MRI-CGCM2.3.2. The FGOALS-g1.0 model has an unrealistic initial condition in its 20C3M simulations and also has a large abrupt change in its control run, and is therefore excluded.

Rather than calculating across-model means when assessing projections for future climate, one should concentrate on those models that simulate reasonable results in the past. Based on the present study, we suggest a subgroup of 12 models, for further review, that passed either the criterion based on their 20C3M simulations or the variance test in their control runs. Five models are of special note, passing both criteria: CCSM3, CSIRO-Mk3.0, INM-CM3.0, ECHO-G, and PCM. Seven other models have, at best, limited applicability for projections of change relative to natural variability: CGCM3.1-T47, CGCM3.1-T63, CNRMCM3, UKMO-HadCM3 (passed the standard deviation test), ECHAM5/MPI-OM, and GFDL-CM2.0 and GFDL-CM2.1 (passed the 2/3CRU test). Figure 8 (top panel) shows the time series from sixteen 20C3M realizations from eight models that replicated a reasonable magnitude compared to the observed midcentury warm event. Almost all the realizations that replicate the mid-century warm anomaly amplitudes at random timing also produce reasonable magnitude of warm anomalies at the end of twentieth century. The bottom panel in Fig. 8 shows the truncated 100-yr time series from control runs from the nine models that passed the variance test. The maximum anomaly of these control runs is lined up at year 1937 with the CRUTS2.0 analysis. Figure 8b shows that the midcentury warm anomalies in the models can be reproduced under no-external forcing conditions, whereas the late-twentieth-century warming cannot.

**FIG. 8. (top) Winter LSAT anomalies averaged over the Arctic based on model
ensemble runs that passed the proposed 2/3CRU criterion in their 20C3M simulations.
(bottom) The truncated 100-yr-long time series from control runs of the nine
models that pass the variance test. All time series are smoothed with a 5-yr
running mean. Models with natural forcing are shown in solid lines, while
those without are shown in dashed lines. All models show upward warming trend
in the Arctic for the last two decades in 20C3M scenario, while none is shown
in the control runs. The range of variation during 1911–60 is about
the same in the 20C3M simulations as well as in the control runs.**

Because external forcing (either natural or anthropogenic) is not imposed on the control runs, we consider variability in Fig. 8b to be representative of intrinsic climate variability, including internal feedback processes from atmosphere–sea ice–ocean interactions. Although more than half of the twentieth-century simulations have natural forcing, that is, solar and volcanic aerosols (as shown in the last column of Table 1), comparison of the magnitude of SAT anomalies in the Arctic for twentieth-century simulations before 1980 with those of the control runs, and the random timing of midcentury warm events in the 20C3M simulations, support the conclusion that intrinsic variability is a first-order effect of arctic climate. The similarity of midcentury events in the 20C3M simulations compared to the control runs and the qualitatively different behavior of the time series at the end of twentieth century are evidence that the midcentury Arctic warming event in the observational data was due to different causes from those of the late-twentieth-century warming.

Return to previous section or go to next section