Random forests to save human lives
Flash flood prediction using machine learning has proven capable in the U.S. and Europe; we're now bringing it to East Africa.
Flash floods are the second most deadly extreme weather event in the world (after heat), but many of these deaths could be avoided with better prediction. Unfortunately, flood prediction is notoriously hard: the small space-time window of flash floods means that forecasts from atmospheric models must very accurately place the occurrence of precipitation and capture the extreme rainfall magnitude. But those most vulnerable are in the poorest countries, where weather data is dependent on global forecast models with grid spacing between 100 and 800 km2. Extreme localized rainfall (flash floods are commonly defined to occur within six hours of causative rainfall and over basins less than 1,000 km2) tends to be averaged out in these models. While the global models do a poor job at capturing localized maximums, they are more than capable of identifying environments supportive of extreme rainfall. Case studies of previous flash flood rainfall events have identified these environments to contain high values of precipitable water and slow storm motions.
We started out building better forecast tools for the U.S. National Weather Service, but quickly learned these same tools can be applied to other parts of the world as well. The machine learning model we developed can be used to help forecasters in the U.S. better identify when and where costly flash flood events are likely to occur. Meanwhile, we are starting to work with NGOs and governments to enable them to use this same model to help mitigate and respond to flash flood disasters elsewhere in the world.
Manually identified characteristics are wonderful, but it would be better if we could easily extract and identify these environments from large numbers of flash flood cases to ensure that the full range of contributing parameters and values is captured. Machine learning, particularly with random forests, is an ideal candidate for automatically classifying environments that may produce flash floods. We used SciKit-Learn for this work because its speed enables rapid experimentation while still being performant enough for quasi-operational products.
Supervised machine learning requires a labeled training data set. In this example, we use human-verified reports of impacts from flash floods in the United States as the thing we seek to predict. These impact reports are contained in the “Storm Data” (original data; the processed data used in this project) publication issued by the U.S. National Weather Service. We would like this system to eventually be able to forecast flash flood environments anywhere in the world; however, the need for high-quality training data sets limits where we can build and validate the initial model. We have worked with partners in the Red Cross to identify flash flood data sets that they have collected over other parts of the developing world. These data sets will be used in the future once they have been compiled and quality controlled.
One of the nice things about a random forest is its ability to chew through large amounts of data in relatively short periods of time, so our biggest worry here is having enough past cases to generate interesting and relevant statistics. In other words, we would rather have too much data than too little data. For that reason, we obtained archived Global Forecast System (GFS, a weather model run by the U.S. government) forecasts dating back to March 2004; this starting date is set by how far back in time the U.S. government’s archive of these forecasts extends. Our labeled flash flood cases (the “Storm Data” reports) go back to October 2006.
So starting in October 2006, we associate each of these “Storm Data” reports—in space and time—with vectors of atmospheric and land surface information from the U.S.’s Global Forecast System weather model. (You can think of these vectors of data as “forecasts” of the state of the atmosphere and the land surface at a particular time and place.) Each vector contains information about the location, structure, and magnitude of interesting atmospheric quantities, including moisture, winds, and many other quantities. To help the random forest better distinguish between flash flood and non-flash flood environments, we can introduce non-linear relationships between some of these relatively simple atmospheric quantities; these non-linear relationships are drawn from forecaster experiences and scientific case studies collected in the literature over the last half-century or so. The GFS model is run every six hours every day of the year. When you put all this together, covering the lower 48 states for about a 10-year period at 1-deg by 1-deg results in about 12 million “cases,” one every six hours for every grid cell, and each of these “cases” has 144 predictor variables associated with it.
When we run the random forest, we split these cases into testing and training data sets. The testing data are all the cases that occurred on the 5th, 10th, 15th, 20th, 25th, or 30th day of any month. This ensures that we draw cases from all seasons and all regions, and is a nice and easy way to save about 20% of the total cases for testing purposes. The remaining cases are used to train the random forest. Because flash floods are rare, we need to “rebalance” the training data set. There are about 340 non-flood cases for every one flood case. The random forest could just predict “no flood” every time and be right 340 out of 341 times if we don’t rebalance the data. So, we randomly undersample the dominant class (the non-flood cases) so there is one flash flood for every single non-flash flood in the resultant training data set. Then we let loose the random forest and test the resultant model on the original un-resampled testing cases.
Because the random forest is an ensemble of individual decision-makers (in this case, classification trees), we can use the fraction of trees voting “yes, there was a flood” as a proxy for the confidence of the entire forest in a particular prediction. When 299 of 300 trees vote “yes,” we are more confident in the prediction than when 151 of 300 trees vote “yes.” Doing some curve-fitting and calibration exercises, we can convert these fractions into probabilities.
The results are promising. In the U.S., the statistics over our whole time period of interest indicate that random-forest forecasts of flash flood environments from the GFS are skillful. Even better, we showed that these forecasts are useful for large-scale synoptically forced events (the kind that show up well in relatively coarse global weather models) five or even seven days out. Most intriguingly, we showed that a random forest model trained on U.S. storm reports and used to forecast flash floods in Europe (verified against an analogous database of European flash floods from the European Severe Storms Laboratory) has the same skill as a random forest trained on European flash floods and verified against U.S. reports. In other words, we can extend this methodology globally (at least in the mid-latitudes) to regions in which flash flood report databases are hard to come by or simply do not exist.
We are headed to the Regional Center for Mapping of Resources for Development (RCMRD) in Nairobi, Kenya, this weekend as part of a NASA SERVIR applied science team. While in Nairobi, we will begin to collect data that can be used to validate this machine learning model over the East Africa region. The staff at RCMRD will also receive introductory training on how the system works and be able to provide feedback on any concerns they may have over the method. The goal at the end of this three-year project is to have a system in place at RCMRD that can be used by Kenya and neighboring countries to prevent loss of life due to flash floods.