Land or Water? - numpy with large datasets
==========================================

Let's look at how we handle large data sets in numpy. As an example, we'll take
a world map at 1km resolution per pixel.

Dataset
-------

The example data comes
from (the now closed)
`http://www.landcover.org/data/landcover/ <https://web.archive.org/web/20190324155156/http://www.landcover.org/data/landcover/index.shtml>`_.
The University of Maryland Department of Geography generated this global land cover
classification collection in 1998. Imagery from the AVHRR satellites acquired
between 1981 and 1994 were analyzed to classify the world's surface
into fourteen land cover types at a resolution of 1 km [#f2]_ .


Download
........

First, let's get all files we need, then we'll discuss them afterwards.
Create a new empty directory to work in, and...

* Download `landcover.zip <https://folk.uib.no/dgr061/landcover/landcover.zip>`_
  and extract the data file from it. It is called ``gl-latlong-1km-landcover.bsq``
  and should be 933120000 bytes big. Put the unzipped data file in your working directory.
  This is a raw binary file of data, with no internal description of how we can
  interpret it. Nowadays, one would use a self-describing format such as
  `NetCDF <https://en.wikipedia.org/wiki/NetCDF>`_ for this kind of geographical data.

* The data format is described in a separate file :download:`gl0500bs.txt` (click to download).

  .. literalinclude:: gl0500bs.txt
       :language: none
       :emphasize-lines: 2,3

  We'll make sense of it later. For now, let's note the first 2 lines: it's meant to be
  an image of 21600 lines with 43200 pixels. And the values are stored as 8-bit
  unsigned integer.

* The possible integer values go from 0 to 13, as we can see in the encoding of the
  land types mentioned on the project's website:  :download:`landcover_legend.png`
  (click to download)

  .. image:: landcover_legend.png


Making sense of it
..................

What is actually in the data file? A search for "numpy load binary data" suggests
``numpy.fromfile``. Let's try it::

    import numpy as np
    data = np.fromfile('gl-latlong-1km-landcover.bsq')

    print('data', data)
    print('minmax', data.min(), data.max())
    print('shape', data.shape)

We get::

    data [0.00000000e+000 0.00000000e+000 0.00000000e+000 ... 1.22416778e-250
 1.22416778e-250 1.22416778e-250]
    minmax 0.0 8.309872195179385e-246
    shape (116640000,)

The values here make no sense. What has gone wrong? We were expecting integer
values in the data file, but got very small float values instead. Also, from a file of size
933120000, we only got 116640000 values. We must be misinterpreting the binary data.

Let's look at the documentation of
`np.fromfile <https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html>`_
We see that there is a ``dtype`` parameter we can specify, to determine the data type.
By default it is `float`, which in Python is a 64-bit data type. So we are taking 64 bits
from the file and interpreting them as a float number. Instead, the description told
us it was 8-bit unsigned integer. In numpy, this is represented as ``uint8``
(see `numpy basic types <https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html>`_)

Let's try to specify this explicitly::

    import numpy as np
    data = np.fromfile('gl-latlong-1km-landcover.bsq', dtype='uint8')

    print('data', data)
    print('minmax', data.min(), data.max())
    print('shape', data.shape)

The result is a lot better::

    data [ 0  0  0 ... 12 12 12]
    minmax 0 13
    shape (933120000,)

We see that the values are integers, and they go from 0 to 13 like we expect.
Only the shape is still strange. It should be 2-dimensional with 43200 pixels
and 21600 lines. Instead, we have a one-dimensional long array with 933 million
values. But :math:`43200\times 21600=933120000`!

Let's reshape the array into the right form::

    data = np.fromfile('gl-latlong-1km-landcover.bsq', dtype='uint8')
    data = data.reshape(21600, 43200)

Now the output is the right shape::

    data [[ 0  0  0 ...  0  0  0]
     [ 0  0  0 ...  0  0  0]
     [ 0  0  0 ...  0  0  0]
     ...
     [12 12 12 ... 12 12 12]
     [12 12 12 ... 12 12 12]
     [12 12 12 ... 12 12 12]]
     minmax 0 13
     shape (21600, 43200)

Each position in the array approximately corresponds to a 1km x 1km square on
Earth. The value at that position represents the dominant landcover.

Visualisation
-------------

Remember that we can use ``plt.imshow()`` to look at 2-dimensional arrays. Let's
try that here. Viewing almost 1GB of data will be very resource-intensive. So
let's do a subsample of the data. We go from beginning to end in steps of 50: ``::50``.
This will reduce the data size for plotting by :math:`50\times 50=2500` to approximately
400kB::

    viz_data = data[::50, ::50]

    import matplotlib.pyplot as plt
    plt.imshow(viz_data)
    plt.show()

The result looks promising:

.. image:: map1.png


We can also use the original data set, as long as we don't try to plot it all at
the same time. Try these two alternatives. The last one nicely shows how detailed the
data set is::

    plt.imshow(data[2000:7000, 20000:25000])

    plt.imshow(data[3400:3700, 22120:22520])


Colours
.......

In the default colour scheme, it is hard to tell apart water from forest. Instead,
let us use the  RGB colour values given in the overview table:

.. image:: landcover_legend.png

In maptlotlib, we can use this information to create a colour map. Googling how
to do this points us at `some examples <https://matplotlib.org/tutorials/colors/colormap-manipulation.html>`_.
Let's look at the completed code:

.. literalinclude:: landcover_ex1.py

.. image:: bergen.png


Memory mapping
..............

Maybe you noticed that ``np.fromfile`` takes a little while to load up the data set.
The whole 1GB is copied into memory immediately.

We have an alternative, ``np.memmap``, where the data is left on disk, and only
loaded when it is needed. `The syntax <https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html>`_
is very similar::

    data = np.memmap('gl-latlong-1km-landcover.bsq',
                     mode='r',
                     shape=(21600,43200)
                    )

We can directly specify the shape here. The option ``mode='r'`` makes the file read-only.
If we do not set this, any changes we make to ``data`` will be immediately changed
in the file, too!

The rest of the program can stay unchanged, but should run faster now if we don't use
all the data points.


Task: Map locations
-------------------

Write a program that asks for a Latitude (North-South) and Longitude (East-West)
number as decimals (such as ``Lat: 48.95`` ``Lon: 9.13``), and
prints out **if that point is on land or on water**. By convention, North and East
are positive numbers, South and West are negative numbers.

You'll need two functions that translate from coordinates to the array-indexes
that you can use with ``data``.

To help you check if you found the right place, you can draw the map around the
chosen point. `Google Maps <https://www.google.com/maps>`_ also allows input of
coordinates. Compare the two to see if you got it right.

Solution
........

We need a translation between geographical coordinates running from 90 to -90
(up-down) and -180 to 180 (left-right) into the pixel index in the array. See this
sketch to help with the positioning. The black numbers are the index numbers of each pixel.
The blue numbers are geographical coordinates:

.. image:: coords.png

Let's take the latitude / north-south / up-down direction first. We need a linear function
that maps the latitude 90 to 0 and -90 to 21600.

.. math::

    \mathit{lat} \rightarrow 21600\cdot\frac{90-\mathit{lat}}{180}

Array indices need to be integers, so we truncate::

    def lat_to_pixel(lat):
        return int(21600 * (90-lat) / 180)

The same for longitude / west-east / left-right.
-180 should return pixel 0 and 180 should return 43200:

.. math::

    \mathit{lon} \rightarrow 43200\cdot\frac{\mathit{lon}-180}{360}

::

    def lon_to_pixel(lon):
        return int(43200 * (lon-180) / 360)


Now we can write the function that replies with the landcover type at
the requested spot::

    def landcover_at_coord(lat, lon):
        i = lat_to_pixel(lat)
        j = lon_to_pixel(lon)
        return data[i,j]

Edge cases
..........

It's often easy to get the function working in 99% of all cases. But what about
the edges? Exactly at latitude -90 our function returns 21600. But this
is not a valid array index, the last valid index is 21599. We can fix this in
different ways, but there's no single right answer here.

We can make sure the highest index is 21599. The downside is that this shifts
all the other pixel boundaries very slightly, too::

    def lat_to_pixel__option_1(lat):
        return int(21599 * (90-lat) / 180)

My preferred option is a fixed pixel value for the problematic coordinate value::

    def lat_to_pixel__option_2(lat):
        if lat == -90:
            return 21599
        else:
            return int(21600 * (90-lat) / 180)

This could now be extended to dealing with all sorts of nonsensical latitudes::

    def lat_to_pixel(lat):
        if lat <= -90:
            return 21599
        elif lat > 90:
            return 0
        else:
            return int(21600 * (90-lat) / 180)


In the East-West direction things are a bit easier, since the coordinates actually
wrap around. We can treat -180 and +180 as the same place. The easiest way to fix that
is by using the modulus operation ``%``. It gives us a cycling behaviour, for example::

    [ (i % 4) for i in range(10) ]

gives us ``[0, 1, 2, 3, 0, 1, 2, 3, 0, 1]``

Using the modulus for longitude keeps all return values between 0 and 43199::

    def lon_to_pixel(lon):
        return int(43200 * (lon-180) / 360) % 43200


When we're addressing edge cases in production code, it's
important to make consistent choices, and to document which choice has been made and why.


Finished program
----------------

The finished program also includes a little plot option,
to look at the area around the chosen point. The full code can be downloaded
at this link :download:`landcover_at_coord.py`.

.. literalinclude:: landcover_at_coord.py


..
    Task 3: Earthquake list
    -----------------------

    Decide for each earthquake in the following file if it occurred on land or on water and add that information as an additional semicolon-separated column to the data file:

    * :download:`Earthquake data <events_4.5.txt>`


References
..........

.. [#f2] Hansen, M., R. DeFries, J.R.G. Townshend, and R. Sohlberg (1998),
         UMD Global Land Cover Classification, 1 Kilometer, 1.0, Department of
         Geography, University of Maryland, College Park, Maryland, 1981-1994.

         Hansen, M., R. DeFries, J.R.G. Townshend, and R. Sohlberg (2000), Global
         land cover classification at 1km resolution using a decision tree classifier,
         International Journal of Remote Sensing. 21: 1331-1365.