Machine Learning¶

Right click to download this notebook from GitHub.

With the data preparation complete, this step will demonstrate how you can configure a scikit-learn or dask_ml pipeline, but any library, algorithm, or simulator could be used at this stage if it can accept array data. In the next step of the tutorial, Data Visualization you will learn how to visualize the output of this pipeline and diagnose as well as ensure that the inputs to the pipeline have the expected structure.

In [1]:

import intake
import numpy as np
import xarray as xr

import holoviews as hv

import cartopy.crs as ccrs
import geoviews as gv

import hvplot.xarray

hv.extension('bokeh', width=80)

Recap: Loading data¶

Note in this tutorial we will use the small version of the landsat data for time constraints. If you prefer to work with full-scale data, use cat.landsat_4.read_chunked() instead.

In [2]:

cat = intake.open_catalog('../catalog.yml')
landsat_5_da = cat.landsat_5_small.read_chunked()
landsat_5_da.shape

Out[2]:

(6, 300, 300)

Reshaping Data¶

We'll need to reshape the image to be how dask-ml / scikit-learn expect it: (n_samples, n_features) where n_features is the number of bands and n_samples is the total number of pixels in each band. Essentially, we'll be creating a bag of pixels out of each image, where each pixel has multiple features (bands), but the ordering of the pixels is no longer relevant. In this case we start with an array that is n_bands by n_y by n_x (6, 300, 300) and we need to reshape to an array that is (n_samples, n_features) (9e4, 6). We'll first look at using NumPy, then Xarray.

Numpy¶

Data can be reshaped at the lowest level using NumPy, by getting the underlying values from the xarray.DataArray, and using flatten and transpose to get the right shape.

In [3]:

arr = landsat_5_da.values
arr.shape

Out[3]:

(6, 300, 300)

Since we want to flatten along the x and y but not along the band axis, we need to iterate over each band and flatten the data.

In [4]:

flattened_npa = np.array([arr[i].flatten() for i in range(arr.shape[0])])
flattened_npa

Out[4]:

array([[ 640.,  842.,  864., ..., 1309., 1636., 1199.],
       [ 810., 1096., 1191., ..., 1736., 2250., 1736.],
       [1007., 1345., 1471., ..., 2202., 2783., 1994.],
       [1221., 1662., 1809., ..., 2755., 3431., 2223.],
       [1819., 2596., 2495., ..., 3067., 3802., 2665.],
       [1682., 2215., 2070., ..., 2860., 3724., 2333.]])

In [5]:

flattened_npa.shape

Out[5]:

(6, 90000)

To get our flattened into the shape of n_samples, n_features, we'll reorder the dimensions using .transpose

In [6]:

flattened_t_npa = flattened_npa.transpose()
flattened_t_npa.shape

Out[6]:

(90000, 6)

Since numpy.arrays are not labeled data, the semantics of the data are lost over the course of these operations, as the necessary metadata does not exist at the NumPy level.

Xarray¶

By using xarray methods to flatten the data, we can keep track of the coordinate labels ('x' and 'y') along the way. This means that we have the ability to reshape back to our original array at any time with no information loss.

In [7]:

flattened_xda = landsat_5_da.stack(z=('x','y'))
flattened_xda

Out[7]:

<xarray.DataArray (band: 6, z: 90000)>
dask.array<shape=(6, 90000), dtype=float64, chunksize=(1, 2400)>
Coordinates:
  * band     (band) int64 1 2 3 4 5 7
  * z        (z) MultiIndex
  - x        (z) float64 3.324e+05 3.324e+05 3.324e+05 ... 3.324e+05 3.324e+05
  - y        (z) float64 4.309e+06 4.309e+06 4.309e+06 ... 4.305e+06 4.305e+06
Attributes:
    transform:   (150.0, 0.0, 332325.0, 0.0, -150.0, 4309275.0)
    crs:         +init=epsg:32611
    res:         (150.0, 150.0)
    is_tiled:    0
    nodatavals:  (nan,)

We can reorder the dimensions using DataArray.transpose:

In [8]:

flattened_t_xda = flattened_xda.transpose('z', 'band')
flattened_t_xda.shape

Out[8]:

(90000, 6)

Rescaling Data¶

Rescale (standardize) the data to input to the algorithm since the ML pipeline that we have selected expects input values to be small. Here we'll demonstrate doing this in numpy or xarray.

In [9]:

(flattened_t_npa - flattened_t_npa.mean()) / flattened_t_npa.std()

Out[9]:

array([[-1.29960701, -1.10062865, -0.87004784, -0.6195692 ,  0.08036645,
        -0.0799867 ],
       [-1.0631739 , -0.76587681, -0.47443204, -0.10339592,  0.98981461,
         0.54386898],
       [-1.03742375, -0.65468302, -0.32695396,  0.06866184,  0.87159805,
         0.37415215],
       ...,
       [-0.51656863, -0.01678181,  0.52865299,  1.1759179 ,  1.54110171,
         1.2988163 ],
       [-0.1338279 ,  0.58483512,  1.2086908 ,  1.9671495 ,  2.40139051,
         2.31009455],
       [-0.64531934, -0.01678181,  0.28519712,  0.55323267,  1.07057641,
         0.68198338]])

In [10]:

rescaled = (flattened_t_xda - flattened_t_xda.mean()) / flattened_t_xda.std()
rescaled.compute()

Out[10]:

<xarray.DataArray (z: 90000, band: 6)>
array([[-1.299607, -1.100629, -0.870048, -0.619569,  0.080366, -0.079987],
       [-1.117015, -0.765877, -0.573921, -0.218101,  0.595369,  0.147083],
       [-0.907503, -0.543489, -0.228635,  0.296902,  1.304669,  0.770938],
       ...,
       [-1.059663, -0.786945, -0.59499 , -0.525932, -0.415909, -0.721399],
       [-1.059663, -0.788116, -0.693308, -0.525932, -0.297692, -0.496671],
       [-0.645319, -0.016782,  0.285197,  0.553233,  1.070576,  0.681983]])
Coordinates:
  * band     (band) int64 1 2 3 4 5 7
  * z        (z) MultiIndex
  - x        (z) float64 3.324e+05 3.324e+05 3.324e+05 ... 3.324e+05 3.324e+05
  - y        (z) float64 4.309e+06 4.309e+06 4.309e+06 ... 4.305e+06 4.305e+06

NOTE: Since the the xarray object is in dask, the actual computation isn't performed until .compute()is called.

In [11]:

# Exercise: Inspect the numpy array at rescaled.values to check that it matches the numpy array above. You could use == for this with .all

Side-note: Other preprocessing¶

Although this isn't the case in this instance, sometimes to get the data into the right shape you need to add or remove axes. Here is an example of adding an axis with numpy and with xarray.

In [12]:

np.expand_dims(flattened_t_npa, axis=2).shape

Out[12]:

(90000, 6, 1)

In [13]:

flattened_t_xda.expand_dims(dim='e', axis=2).shape

Out[13]:

(90000, 6, 1)

In [14]:

# Exercise: Try removing the extra axis using np.squeeze or .squeeze on the xarray object

ML pipeline¶

The Machine Learning pipeline shown below is just for the purpose of understanding the shaping/reshaping of the data. In practice you will likely be using a more sophisticated pipeline. Here we will use a version of SpectralClustering from dask_ml that is a scalable equivalent to operations from Scikit-learn that cluster pixels based on similarity (across all bands, which makes it spectral clustering by spectra!).

In [15]:

from dask_ml.cluster import SpectralClustering
from dask.distributed import Client

In [16]:

client = Client(processes=False)
client

Out[16]:

Client

Scheduler: inproc://10.20.0.73/9936/1
Dashboard: http://10.20.0.73:8787/status

Cluster

Workers: 1
Cores: 2
Memory: 8.36 GB

Now we will compute and persist the rescaled data to feed into the ML pipeline. Notice that X has the shape: n_samples, n_features as discussed above.

In [17]:

X = client.persist(rescaled)
X.shape

Out[17]:

(90000, 6)

First we will set up the model with the number of clusters, and other options.

In [18]:

clf = SpectralClustering(n_clusters=4, random_state=0, gamma=None,
                         kmeans_params={'init_max_iter': 5},
                         persist_embedding=True)

This is the slow part. Then we'll fit the model to out data X. This is the part that will take a noticeable amount of time. Something like 1 minute for the data in this tutorial or 9 minutes for a full size landsat image.

In [19]:

%time clf.fit(X)

CPU times: user 37.1 s, sys: 4.31 s, total: 41.4 s
Wall time: 39.3 s

Out[19]:

SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3,
                   eigen_solver=None, eigen_tol=0.0, gamma=None,
                   kernel_params=None, kmeans_params={'init_max_iter': 5},
                   n_clusters=4, n_components=100, n_init=10, n_jobs=1,
                   n_neighbors=10, persist_embedding=True, random_state=0)

In [20]:

# Exercise: Open the dask status dashboard and watch the workers in progress.

In [21]:

labels = clf.assign_labels_.labels_.compute()
labels.shape

Out[21]:

(90000,)

Un-flattening¶

Once the computation is done, the output can be used to create a new array with the same structure as the input array. This new output array will have the coordinates needed to be unstacked similarly to how they were stacked. One of the main benefits of using xarray for this stacking and unstacking is that allows xarray to keep track of the coordinate information for us.

Since the original array is n_samples by n_features (90_000, 6) and the output only contains one feature (90_000,), the template structure for this data needs to have the shape (n_samples). We achieve this by just taking one of the bands.

In [22]:

template = flattened_t_xda[:, 0]
output_array = template.copy(data=labels)
output_array

Out[22]:

<xarray.DataArray (z: 90000)>
array([0, 0, 2, ..., 0, 0, 2], dtype=int32)
Coordinates:
    band     int64 1
  * z        (z) MultiIndex
  - x        (z) float64 3.324e+05 3.324e+05 3.324e+05 ... 3.324e+05 3.324e+05
  - y        (z) float64 4.309e+06 4.309e+06 4.309e+06 ... 4.305e+06 4.305e+06
Attributes:
    transform:   (150.0, 0.0, 332325.0, 0.0, -150.0, 4309275.0)
    crs:         +init=epsg:32611
    res:         (150.0, 150.0)
    is_tiled:    0
    nodatavals:  (nan,)

With this new output array in hand, we can unstack back to the original dimensions:

In [23]:

unstacked = output_array.unstack()
unstacked

Out[23]:

<xarray.DataArray (x: 300, y: 300)>
array([[0, 0, 2, ..., 2, 2, 1],
       [0, 0, 0, ..., 2, 2, 2],
       [0, 0, 2, ..., 2, 2, 2],
       ...,
       [2, 2, 2, ..., 2, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 2]], dtype=int32)
Coordinates:
    band     int64 1
  * x        (x) float64 3.324e+05 3.326e+05 3.327e+05 ... 3.771e+05 3.772e+05
  * y        (y) float64 4.309e+06 4.309e+06 4.309e+06 ... 4.264e+06 4.264e+06
Attributes:
    transform:   (150.0, 0.0, 332325.0, 0.0, -150.0, 4309275.0)
    crs:         +init=epsg:32611
    res:         (150.0, 150.0)
    is_tiled:    0
    nodatavals:  (nan,)

In [24]:

landsat_5_da.sel(band=4).hvplot(x='x', y='y', width=400, height=400, datashade=True, cmap='greys').relabel('Image') + \
               unstacked.hvplot(x='x', y='y', width=400, height=400, cmap='Category10', colorbar=False).relabel('Clustered')

Out[24]:

Geographic plot¶

The plot above is useful and quick to generate, but it isn't referenced against the underlying geographic coordinates, which is crucial if we want to overlay the data on any other geographic data sources. Adding the coordinate reference system in the hvplot method, ensures that the data is properly positioned in space. This geo-referencing is made very straightforward because of the way xarray persists metadata. We can even add tiles underneath.

In [25]:

gv.tile_sources.EsriImagery * unstacked.hvplot(x='x', y='y', geo=True, height=500, cmap='Category10', alpha=0.7)

Out[25]:

In [26]:

# Exercise: Try adding a different set of map tiles. Use tab completion to find others.

Next:¶

Now that your analysis is complete, you are ready for some more information about Data Visualization you will learn how to visualize the output of this pipeline and diagnose as well as ensure that the inputs to the pipeline have the expected structure.