⚠️ IMPORTANT NOTICE ⚠️
This site is ARCHIVED and will NO LONGER BE UPDATED.
For updated Tutorial material, please visit the Pythia landsat-ml cookbook.
For Topic Examples, head over to the HoloViz Examples website.

Alignment and Preprocessing

Right click to download this notebook from GitHub.


Once the data is made available as detailed in Data Ingestion, the next step is to ensure the data has been appropriately reshaped and aligned across data sources for consumption by the machine learning pipeline (or other analysis or simulation steps).

We'll be aggregating data across several years, using small versions of the Landsat images used in the Walker Lake example. See that topic for more work on calculating the difference between the water levels over time.

In [1]:
import intake
import numpy as np
import xarray as xr

import holoviews as hv

import cartopy.crs as ccrs
import geoviews as gv

import hvplot.xarray

hv.extension('bokeh', width=80)

Recap: Loading data

We'll use intake to read in the small versions of the landsat_5 and landsat_8 data.

In [2]:
cat = intake.open_catalog('../catalog.yml')
landsat_5_da = cat.landsat_5_small.read_chunked()
landsat_8_da = cat.landsat_8_small.read_chunked()
2020-05-06 22:49:27,471 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
2020-05-06 22:49:27,475 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
2020-05-06 22:49:27,795 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
2020-05-06 22:49:27,812 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
2020-05-06 22:49:27,904 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
2020-05-06 22:49:27,930 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
2020-05-06 22:49:28,087 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
2020-05-06 22:49:28,089 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
2020-05-06 22:49:28,189 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
2020-05-06 22:49:28,192 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
2020-05-06 22:49:28,275 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
2020-05-06 22:49:28,282 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
2020-05-06 22:49:28,371 - intake - WARNING - cache.py:_download:L268 - Cache progress bar requires tqdm to be installed: conda/pip install tqdm
In [3]:
landsat_8_da
Out[3]:
<xarray.DataArray (band: 7, y: 286, x: 286)>
dask.array<shape=(7, 286, 286), dtype=float64, chunksize=(1, 50, 50)>
Coordinates:
  * y        (y) float64 4.321e+06 4.321e+06 4.32e+06 ... 4.261e+06 4.261e+06
  * x        (x) float64 3.183e+05 3.185e+05 3.187e+05 ... 3.779e+05 3.782e+05
  * band     (band) int64 1 2 3 4 5 6 7
Attributes:
    transform:   (210.0, 0.0, 318195.0, 0.0, -210.0, 4321005.0)
    crs:         +init=epsg:32611
    res:         (210.0, 210.0)
    is_tiled:    0
    nodatavals:  (nan,)

Do computations

We can do regular computations with thesexarray.DataArrays. For instance we can calculate the NDVI (vegetation index) for each of these image sets. Note that in landsat 5 the Red and NIR bands are stored in bands 3 and 4 respectively whereas in landsat 8 the Red and NIR are bands 4 and 5.

In [4]:
NDVI_1988 = (landsat_5_da.sel(band=4) - landsat_5_da.sel(band=3)) / (landsat_5_da.sel(band=4) + landsat_5_da.sel(band=3))
NDVI_1988
Out[4]:
<xarray.DataArray (y: 300, x: 300)>
dask.array<shape=(300, 300), dtype=float64, chunksize=(50, 50)>
Coordinates:
  * y        (y) float64 4.309e+06 4.309e+06 4.309e+06 ... 4.264e+06 4.264e+06
  * x        (x) float64 3.324e+05 3.326e+05 3.327e+05 ... 3.771e+05 3.772e+05
In [5]:
NDVI_2017 = (landsat_8_da.sel(band=5) - landsat_8_da.sel(band=4)) / (landsat_8_da.sel(band=5) + landsat_8_da.sel(band=4))
NDVI_2017
Out[5]:
<xarray.DataArray (y: 286, x: 286)>
dask.array<shape=(286, 286), dtype=float64, chunksize=(50, 50)>
Coordinates:
  * y        (y) float64 4.321e+06 4.321e+06 4.32e+06 ... 4.261e+06 4.261e+06
  * x        (x) float64 3.183e+05 3.185e+05 3.187e+05 ... 3.779e+05 3.782e+05

We can calculate the difference between these two years by subtracting one from the other.

In [6]:
diff = NDVI_2017 - NDVI_1988
diff.hvplot.image(x='x', y='y', width=400, cmap='coolwarm', clim=(-1,1))
Out[6]:

Notice how pixelated that image looks. What is going on here? To figure it out, let's take a look at the shape of diff.

In [7]:
diff.shape
Out[7]:
(42, 43)

That's a lot smaller than our NDVI for each year. What is happening is that when we compute the difference on the data we only get values where there are values for each year in the same grid cell. Since the cells are on a different resolution this only happens once every so often. What we'd rather do is interpolate to the same grid and then do our computations on that.

Combine data from overlapping grids

These two sets of Landsat bands cover roughly the same area but were taken in 1988 and 2017. We have modified them to have different resolutions, different numbers of grid cells per image, and different x and y offsets. We can see that by printing the first x value for each year and seeing that they are not equivalent.

In [8]:
print(NDVI_1988.x[0].values)
print(NDVI_2017.x[0].values)
332400.0
318300.0

We can also do a quick check of resolution by subtracting the second x value from the first x value for each year.

In [9]:
print((NDVI_1988.x[1] - NDVI_1988.x[0]).values)
print((NDVI_2017.x[1] - NDVI_2017.x[0]).values)
150.0
210.0

Select region of interest

We'll define a central point in lat, lon convert that point to the coordinate reference system (CRS) of the data, and then use the area around the central point as the Region of Interest (ROI).

The first step is getting the CRS. This information is stored in the attributes of our original landsat data. Let's take a look at it:

In [10]:
landsat_8_da.crs
Out[10]:
'+init=epsg:32611'

So what we have is a proj4 string. We can convert that to a cartopy.crs object using the geoviews.util.proj_to_cartopy method.

In [11]:
crs = gv.util.proj_to_cartopy(landsat_8_da.crs)

Now with that cartopy.crs object, we can transform out lat, lon - which are in cartopy.crs.PlateCarree() to our new data crs.

In [12]:
x_center, y_center = crs.transform_point(-118.7081, 38.6942, ccrs.PlateCarree())

Now we just need to define the area that we are interested in around this point. In this case we'll use a 30 km box around the center point.

In [13]:
buffer = 1.5e4

xmin = x_center - buffer
xmax = x_center + buffer
ymin = y_center - buffer
ymax = y_center + buffer

bounding_box = [[[xmin, ymin], [xmin, ymax], [xmax, ymax], [xmax, ymin]]]

Let's just check that bouding box captures the lake:

In [14]:
gv.tile_sources.EsriImagery * gv.Polygons(bounding_box, crs=crs).options(alpha=0.4)
Out[14]:

Regrid

We can use this region to define a new grid onto which we will interpolate our data.

In [15]:
res = 200
x = np.arange(xmin, xmax, res)
y = np.arange(ymin, ymax, res)

We will use linear interpolation to calculate the values for each grid cell. This operation is not yet supported for dask arrays, so we'll first load in the data.

In [16]:
NDVI_1988.load()
NDVI_2017.load()
Out[16]:
<xarray.DataArray (y: 286, x: 286)>
array([[0.499094, 0.154549, 0.151772, ..., 0.108119, 0.100703, 0.084654],
       [0.212411, 0.161451, 0.165759, ..., 0.092169, 0.073178, 0.092607],
       [0.326255, 0.430742, 0.170662, ..., 0.086179, 0.071393, 0.098268],
       ...,
       [0.171988, 0.171437, 0.220445, ..., 0.154233, 0.130262, 0.127033],
       [0.153172, 0.155023, 0.150425, ..., 0.12414 , 0.088485, 0.088379],
       [0.181222, 0.170052, 0.175138, ..., 0.150632, 0.124844, 0.08195 ]])
Coordinates:
  * y        (y) float64 4.321e+06 4.321e+06 4.32e+06 ... 4.261e+06 4.261e+06
  * x        (x) float64 3.183e+05 3.185e+05 3.187e+05 ... 3.779e+05 3.782e+05

Now we can use the new coordinates defined above to interpolate from the input coordinates to the new grid. The options are nearest and linear with linear being selected by default.

In [17]:
NDVI_2017_regridded = NDVI_2017.interp(x=x, y=y)
NDVI_1988_regridded = NDVI_1988.interp(x=x, y=y)

Combining the data

Now that we have our data on the same grid we can combine our two years into one xarray object. We will treat the years as names and create an xarray.Dataset - a group of named xarray.DataArrays that share some of the same coordinates.

In [18]:
ds_regridded = xr.Dataset({'NDVI_1988': NDVI_1988_regridded, 'NDVI_2017': NDVI_2017_regridded})
ds_regridded
Out[18]:
<xarray.Dataset>
Dimensions:    (x: 150, y: 150)
Coordinates:
  * x          (x) float64 3.365e+05 3.367e+05 3.369e+05 ... 3.661e+05 3.663e+05
  * y          (y) float64 4.269e+06 4.269e+06 4.27e+06 ... 4.299e+06 4.299e+06
Data variables:
    NDVI_1988  (y, x) float64 0.08224 0.1283 0.1236 ... 0.0652 0.07704 0.0721
    NDVI_2017  (y, x) float64 0.1551 0.1523 0.1462 ... 0.0836 0.08673 0.1005

Vizualizing output

We can plot the arrays side by side:

In [19]:
ds_regridded.NDVI_1988.hvplot.image(x='x', y='y', crs=crs, width=350, clim=(-3, 1)).relabel('1988') +\
ds_regridded.NDVI_2017.hvplot.image(x='x', y='y', crs=crs, width=350, clim=(-3, 1)).relabel('2017')
Out[19]:

Or we can calculate and plot the difference between the two years:

In [20]:
diff_regridded = ds_regridded['NDVI_2017'] - ds_regridded['NDVI_1988']
diff_regridded
Out[20]:
<xarray.DataArray (y: 150, x: 150)>
array([[ 0.072858,  0.023945,  0.022558, ...,  0.022373,  0.026544,  0.018706],
       [ 0.093092,  0.113596,  0.162406, ...,  0.018458,  0.017184,  0.016239],
       [ 0.103948,  0.151742,  0.143821, ...,  0.018667,  0.013534,  0.0148  ],
       ...,
       [-0.105646,  0.01861 ,  0.066806, ...,  0.011637,  0.018419,  0.009683],
       [ 0.017486, -0.060214,  0.03494 , ...,  0.037579,  0.021413,  0.034056],
       [ 0.013113,  0.084052, -0.023846, ...,  0.018395,  0.009683,  0.028359]])
Coordinates:
  * x        (x) float64 3.365e+05 3.367e+05 3.369e+05 ... 3.661e+05 3.663e+05
  * y        (y) float64 4.269e+06 4.269e+06 4.27e+06 ... 4.299e+06 4.299e+06
In [21]:
diff_regridded.hvplot.image(x='x', y='y', crs=crs, width=400, cmap='coolwarm', clim=(-1,1))
Out[21]:

Side-note: Resampling

We can down-sample an xarray object by grouping the values into bins based on the desired resolution and taking the mean on each of those bins.

In [22]:
res_1000 = 1e3
x_1000 = np.arange(xmin, xmax, res_1000)
y_1000 = np.arange(ymin, ymax, res_1000)

We'll use the left edge as the label for now.

In [23]:
diff_res_1000 = (diff_regridded
    .groupby_bins('x', x_1000, labels=x_1000[:-1]).mean(dim='x')
    .groupby_bins('y', y_1000, labels=y_1000[:-1]).mean(dim='y')
    .rename(x_bins='x', y_bins='y')
)
diff_res_1000
Out[23]:
<xarray.DataArray (y: 29, x: 29)>
array([[0.068691, 0.131777, 0.154695, ..., 0.021748, 0.018803, 0.013594],
       [0.08347 , 0.087767, 0.135485, ..., 0.020225, 0.018137, 0.015641],
       [0.093605, 0.092893, 0.1206  , ..., 0.022006, 0.016734, 0.007764],
       ...,
       [0.056683, 0.076643, 0.088305, ..., 0.029144, 0.029847, 0.029519],
       [0.061665, 0.043628, 0.067579, ..., 0.017919, 0.028404, 0.028303],
       [0.049258, 0.078702, 0.128735, ..., 0.027532, 0.029182, 0.023603]])
Coordinates:
  * y        (y) float64 4.269e+06 4.27e+06 4.271e+06 ... 4.296e+06 4.297e+06
  * x        (x) float64 3.365e+05 3.375e+05 3.385e+05 ... 3.635e+05 3.645e+05
In [24]:
# Exercise: Try to aggregate using a method other than mean.

See GeoViews docs for more information of resampling grids.

Next:

Now that the data have been aligned and preprocessed appropriately, we can then use them in subsequent steps in a workflow. Next we'll learn about Machine Learning.