Alphabet Soup: GHCND

As scientists, we tend to get carried away with our acronyms. Sometimes they can be clever (such as Verification of the Origins of Rotation in Tornadoes Experiment, or VORTEX), sometimes there’s an acronym within an acronym (TOVS is an optical vertical sounder, but the ‘T’ stands for TIROS, which stood for Television Infrared Observation Satellite), and others seem confusing at first (CoCoRaHS, or the Community Collaborative Rain Hail Snow dataset). Well have no fear, I’m here to break down some of these acronyms used to describe weather and climate data sets in which I’m calling: Alphabet Soup.

The very first data set we will cover is one of the most popular products NOAAs NCEI produces (that’s the National Oceanic Atmospheric Administration’s National Centers for Environmental Information… say that five times fast). It’s known as GHCN, or the Global Historical Climatology Network. It’s a multi-network, multi-elemental dataset that spans the entire world. GHCN first began as a monthly product back in 1992. Today, the monthly product stands in its third version (soon to be fourth), and is one of the datasets that goes into NCEI monthly monitoring reports.

As we entered the 21st century however, there was much more demand for digitized daily data, especially since observers have been writing down daily reports since the 1800s. Well thanks to faster computing power, and a few expert data scientists, the daily version of the product was produced, and is one of the most popular datasets utilized by many different sectors. GHCN-Daily comprises of over 106,000 stations over the world. The major weather variables utilized are temperature (daily maximum, minimum, and average), precipitation, and snowfall (including snow depth). Most of the dataset is precipitation (104,000 stations), with snowfall in second (63,000) and temperature third (35,000).

Location of temperature stations in the GHCN-Daily dataset as of September 7th, 2018. This image was provided by the  International Surface Temperature Initiative .

Location of temperature stations in the GHCN-Daily dataset as of September 7th, 2018. This image was provided by the International Surface Temperature Initiative.

The above map shows the location of temperature stations around the globe. Most of the stations in GHCN-Daily exist in the United States (58,000, about 19,000 reporting temperature). Other areas with good spatial coverage is Europe, and parts of Australia. However, even in the 21st century, there are areas of the world, such as South America and Africa, where we are lacking digitized data. There are many organizations out there trying to remedy this, including the International Surface Temperature Initiative (ISTI), the Atmospheric Circulation Reconstructions over the Earth (ACRE), and the International Environmental Data Rescue Organization (IEDRO)...see, I told you we love our acronyms.

Just because the United States has the most stations, does not mean we have the longest. Only a couple of stations in the US have data over 200 years (black dots above). Many more exist on the European continent. In fact, the longest station in GHCN-D is held by Milan, Italy, who has data as far back as 1763, 13 years before the Declaration of Independence was signed.

So far I have only mentioned the three major variables (snowfall, temperature, and precipitation), however GHCN-Daily has a whopping 137 variables. Many of them are related, such as the water equivalent of snowfall that fell, and multi-day precipitation, but some are unique, such as wind and soil moisture. One of the reasons there are so many variables is because there are many datasets that actually go into GHCN-D (29 datasets total). Some of the datasets used are below (stay tuned for upcoming Alphabet Soup posts on these):

  • COOP (Cooperative Observer Program)

  • ASOS (Automated Surface Observing System)

  • CoCoRaHS (Community Collaborative Rain Hail Snow)

  • USCRN (United States Climate Reference Network)

With so many different datasets and elements, it’s important to have some checks in place to make sure there aren’t bad data. GHCN-D has 14 different quality control checks, ranging from simple (daily minimum temperature cannot exceed maximum), to complex (spatial consistency check using neighbor comparisons and z-scores). In the end, data that fails a check gets a flag placed next to the value, however it is not removed. This is where a data user has to be very careful, because if they do not know this, the value can remain in ones analysis.

Speaking of which, if one was adventurous and grabbed the data from the NCEI website, one would stumble on the following:

pasted image 0.png

Looks a little confusing right? Almost like the computer code in the Matrix trilogy. Well a little digging into the dataset readme file and the decipher is provided. I can tell you that you are looking at data at Asheville Regional Airport in North Carolina for the months of July and August 2018 (also the morning low temperature on July 4th was 68 degrees Fahrenheit, but I digress). This is why it’s very important to understand what one is looking at before analyzing the data. Little nuances, such as missing data, flagged data, and what source they came from are important to know, and one cannot just load this data in Microsoft Excel and plot (in fact, it will most likely crash the program). With a little practice, one can turn the precipitation line above (second line… known as PRCP) and turn it into this:


More Info on GHCN-D can be found on the NCEI site. If you want to dive right into the data and have some fun, check it out here.