Handling CSV files - an example
===============================

In this chapter, we'll look at handling structured data from spreadsheets in Python.
The CSV module's main documentation is at
`CSV File Reading and Writing <https://docs.python.org/3.8/library/csv.html>`_.


Example dataset
---------------

For our example we will be using aquaculture data which is freely available from
`Fiskeridirektoratet <https://www.fiskeridir.no/Akvakultur/Registre-og-skjema/Akvakulturregisteret>`_.

You can get a copy of the dataset also from this local link
:download:`Akvakulturregisteret.csv` (Source: Fiskeridirektoratet, 13.04.2020).

The dataset contains an overview of all registered aquaculture setups in Norway,
showing which species they grow, where they're located, and so on.

The CSV file format is the most common import and export format for spreadsheets
and databases, and is a plain text file. Let's open it in an editor to see what columns it
contains. Spreadsheets often have the columns explained in the first few lines::

  AKVAKULTURTILLATELSER PR. 13.04.2020;;;;;;;;;;;;;(Enhet i stk. oppgis i 1000) ;;;;;;;;(Enhet i stk. oppgis i 1000) ;;;
  TILL_NR;ORG.NR/PERS.NR;NAVN;ADRESSE;POSTNR;POSTSTED;TILDELINGSTIDSPUNKT;TIDSBEGRENSET;TILL_KOMNR;TILL_KOM;FORMÅL;PRODUKSJONSFORM;ART;TILL_KAP;TILL_ENHET;LOK_NR;LOK_NAVN;LOK_KOMNR;LOK_KOM;LOK_PLASS;VANNMILJØ;LOK_KAP;LOK_ENHET;N_GEOWGS84;Ø_GEOWGS84

Here, it looks like the second line is what we need. The rest of the file contains the data,
one line for each data point, and the entries are separated with semicolons.


CSV module
----------

Since the data is stored as plain text, we could use our usual file reading code
to read it in, and use :code:`line.split(';')` to separate out the columns.

The python library also comes with a specialised module to read CSV files, that
does a lot of the repetitive work for us. Take a look at the
`CSV File Reading and Writing <https://docs.python.org/3.8/library/csv.html>`_
section in the Python manual. Let's try to use the simple example they give there
with our data instead. Compared to the first example on the Python CSV website,
I have changed the *delimiter* option to :code:`;` and removed the
*quotechar* option, since it looks like no quotes are used in our CSV file.
::

  import csv
  with open('Akvakulturregisteret.csv', newline='') as csvfile:
      akvareader = csv.reader(csvfile, delimiter=';')
      for row in akvareader:
          print(row)

Trying to run this, you may see
:code:`UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 230: invalid continuation byte`.
This indicates that the encoding of the text in the file is not in the UTF-8
format that your Python expects as standard. Given it's a Norwegian dataset, the
other likely encoding is the older :code:`iso-8859-1` format. Let's specify that
explicitly::

  import csv
  with open('Akvakulturregisteret.csv', newline='', encoding='iso-8859-1') as csvfile:
      akvareader = csv.reader(csvfile, delimiter=';')
      for row in akvareader:
          print(row)

This seems to work! We get a printout of all datasets.

Application examples
--------------------

The last two fields in each entry are latitude and longitude. Let's try to plot
the locations on a map. First, we check if we can extract the values::

  import csv

  lats = []
  lons = []

  with open('Akvakulturregisteret.csv', newline='', encoding='iso-8859-1') as csvfile:
      akvareader = csv.reader(csvfile, delimiter=';')
      for row in akvareader:
          lat = float(row[-2]) # latitude is second last
          lon = float(row[-1]) # longitude is last
          lats.append(lat)
          lons.append(lon)

  print(lats,lons)

We get stuck on the first line in the CSV-file: :code:`ValueError: could not convert string to float: ''`.
Right, they didn't contain any data, just the header information. One way
of skipping them is to handle this exception with :code:`try ... except`. If we can't
get a valid latitude, let's continue to the next line:

.. literalinclude:: akva2.py

This works and prints long lists of numbers. Let's plot them:

.. literalinclude:: akva3.py

.. image:: akvakultur.png