Alignment and Preprocessing¶
Right click to download this notebook from GitHub.
Once the data is made available as detailed in Data Ingestion, the next step is to ensure the data has been appropriately reshaped and aligned across data sources for consumption by the machine learning pipeline (or other analysis or simulation steps).
We'll be aggregating data across several years, using small versions of the Landsat images used in the Walker Lake example. See that topic for more work on calculating the difference between the water levels over time.
import intake
import numpy as np
import xarray as xr
import holoviews as hv
import cartopy.crs as ccrs
import geoviews as gv
import hvplot.xarray
hv.extension('bokeh', width=80)
Recap: Loading data¶
We'll use intake to read in the small versions of the landsat_5 and landsat_8 data.
cat = intake.open_catalog('../catalog.yml')
landsat_5_da = cat.landsat_5_small.read_chunked()
landsat_8_da = cat.landsat_8_small.read_chunked()
landsat_8_da
Do computations¶
We can do regular computations with thesexarray.DataArray
s. For instance we can calculate the NDVI (vegetation index) for each of these image sets. Note that in landsat 5 the Red and NIR bands are stored in bands 3 and 4 respectively whereas in landsat 8 the Red and NIR are bands 4 and 5.
NDVI_1988 = (landsat_5_da.sel(band=4) - landsat_5_da.sel(band=3)) / (landsat_5_da.sel(band=4) + landsat_5_da.sel(band=3))
NDVI_1988
NDVI_2017 = (landsat_8_da.sel(band=5) - landsat_8_da.sel(band=4)) / (landsat_8_da.sel(band=5) + landsat_8_da.sel(band=4))
NDVI_2017
We can calculate the difference between these two years by subtracting one from the other.
diff = NDVI_2017 - NDVI_1988
diff.hvplot.image(x='x', y='y', width=400, cmap='coolwarm', clim=(-1,1))
Notice how pixelated that image looks. What is going on here? To figure it out, let's take a look at the shape of diff
.
diff.shape
That's a lot smaller than our NDVI for each year. What is happening is that when we compute the difference on the data we only get values where there are values for each year in the same grid cell. Since the cells are on a different resolution this only happens once every so often. What we'd rather do is interpolate to the same grid and then do our computations on that.
Combine data from overlapping grids¶
These two sets of Landsat bands cover roughly the same area but were taken in 1988 and 2017. We have modified them to have different resolutions, different numbers of grid cells per image, and different x and y offsets. We can see that by printing the first x
value for each year and seeing that they are not equivalent.
print(NDVI_1988.x[0].values)
print(NDVI_2017.x[0].values)
We can also do a quick check of resolution by subtracting the second x value from the first x value for each year.
print((NDVI_1988.x[1] - NDVI_1988.x[0]).values)
print((NDVI_2017.x[1] - NDVI_2017.x[0]).values)
Select region of interest¶
We'll define a central point in lat, lon convert that point to the coordinate reference system (CRS) of the data, and then use the area around the central point as the Region of Interest (ROI).
The first step is getting the CRS. This information is stored in the attributes of our original landsat data. Let's take a look at it:
landsat_8_da.crs
So what we have is a proj4 string. We can convert that to a cartopy.crs
object using the geoviews.util.proj_to_cartopy
method.
crs = gv.util.proj_to_cartopy(landsat_8_da.crs)
Now with that cartopy.crs
object, we can transform out lat, lon - which are in cartopy.crs.PlateCarree()
to our new data crs
.
x_center, y_center = crs.transform_point(-118.7081, 38.6942, ccrs.PlateCarree())
Now we just need to define the area that we are interested in around this point. In this case we'll use a 30 km box around the center point.
buffer = 1.5e4
xmin = x_center - buffer
xmax = x_center + buffer
ymin = y_center - buffer
ymax = y_center + buffer
bounding_box = [[[xmin, ymin], [xmin, ymax], [xmax, ymax], [xmax, ymin]]]
Let's just check that bouding box captures the lake:
gv.tile_sources.EsriImagery * gv.Polygons(bounding_box, crs=crs).options(alpha=0.4)
Regrid¶
We can use this region to define a new grid onto which we will interpolate our data.
res = 200
x = np.arange(xmin, xmax, res)
y = np.arange(ymin, ymax, res)
We will use linear interpolation to calculate the values for each grid cell. This operation is not yet supported for dask arrays, so we'll first load in the data.
NDVI_1988.load()
NDVI_2017.load()
Now we can use the new coordinates defined above to interpolate from the input coordinates to the new grid. The options are nearest
and linear
with linear
being selected by default.
NDVI_2017_regridded = NDVI_2017.interp(x=x, y=y)
NDVI_1988_regridded = NDVI_1988.interp(x=x, y=y)
Combining the data¶
Now that we have our data on the same grid we can combine our two years into one xarray
object. We will treat the years as names and create an xarray.Dataset
- a group of named xarray.DataArray
s that share some of the same coordinates.
ds_regridded = xr.Dataset({'NDVI_1988': NDVI_1988_regridded, 'NDVI_2017': NDVI_2017_regridded})
ds_regridded
Vizualizing output¶
We can plot the arrays side by side:
ds_regridded.NDVI_1988.hvplot.image(x='x', y='y', crs=crs, width=350, clim=(-3, 1)).relabel('1988') +\
ds_regridded.NDVI_2017.hvplot.image(x='x', y='y', crs=crs, width=350, clim=(-3, 1)).relabel('2017')
Or we can calculate and plot the difference between the two years:
diff_regridded = ds_regridded['NDVI_2017'] - ds_regridded['NDVI_1988']
diff_regridded
diff_regridded.hvplot.image(x='x', y='y', crs=crs, width=400, cmap='coolwarm', clim=(-1,1))
Side-note: Resampling¶
We can down-sample an xarray
object by grouping the values into bins based on the desired resolution and taking the mean on each of those bins.
res_1000 = 1e3
x_1000 = np.arange(xmin, xmax, res_1000)
y_1000 = np.arange(ymin, ymax, res_1000)
We'll use the left edge as the label for now.
diff_res_1000 = (diff_regridded
.groupby_bins('x', x_1000, labels=x_1000[:-1]).mean(dim='x')
.groupby_bins('y', y_1000, labels=y_1000[:-1]).mean(dim='y')
.rename(x_bins='x', y_bins='y')
)
diff_res_1000
# Exercise: Try to aggregate using a method other than mean.
See GeoViews docs for more information of resampling grids.
Next:¶
Now that the data have been aligned and preprocessed appropriately, we can then use them in subsequent steps in a workflow. Next we'll learn about Machine Learning.