Pandas examples
===============
.. warning::
Pandas is NOT part of the exam
Here we will take brief look at the `Pandas `_ library,
using the example dataset we have used in the previous chapter.
Pandas is a very useful data analysis library, that makes many common tasks
easy to handle. For all the detail, have a look at the introduction here:
* https://pandas.pydata.org/docs/getting_started/overview.html
* https://pandas.pydata.org/docs/getting_started/10min.html
* https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html
Akvakultur
----------
Load dataset
............
The data file is the same as before: :download:`../csv/Akvakulturregisteret.csv`.
The `instructions on opening files `_
tell us to use :code:`read_csv`. The initial attempt ::
import pandas as pd
akva = pd.read_csv('Akvakulturregisteret.csv')
print(akva.columns)
fails with some errors. We need to tell it about the non-standard delimiter :code:`;` and
text encoding :code:`iso-8859-1`. Also, unusually, the header with column names is in row 1, not 0.
Let's provide all these as options::
import pandas as pd
import matplotlib.pyplot as plt
akva = pd.read_csv(
'Akvakulturregisteret.csv',
delimiter=';',
encoding='iso-8859-1',
header=1
)
print(akva.columns)
print(akva)
This looks like it works already. Compared to the CSV module, we have much more information
in our pandas dataframe :code:`akva`. The column names are automatically chosen, and I can print
some information with e.g.::
print(akva['ART'])
print(akva['POSTSTED'].min(), akva['POSTSTED'].max())
Even slicing and filtering works like we've seen in numpy::
filter = akva['ART'] == 'Laks'
print(akva[filter])
Plotting
........
Let's look at visualization. Again, the beginner
`tutorial on plotting `_
is very informative. It looks like we only need to add another line, to see a scatterplot
of all locations:
.. literalinclude:: akva_pd_1.py
.. image:: akva_pd_1.png
Tasks from CSV chapter
......................
The different common data analysis tasks we saw before can be done easily with pandas:
* Count the different species (*ART*, column 12). Googling *count categories in pandas*
suggested :code:`value_count()`::
print(akva['ART'].value_count())
* Plot only those that grow *Laks*. Here, we can use filters::
laks = akva[ akva['ART'] == 'Laks' ]
laks.plot.scatter(x='Ø_GEOWGS84', y='N_GEOWGS84', alpha=0.2))
* Plot *FERSKVANN* in one colour and *SALTVANN* in another (*VANNMILJØ*, column 20). Again we can use filters.
The scatter plots here are the usual matplotlib plots, not the ones from pandas. You can see
that the libraries work well with each other:
.. literalinclude:: akva_pd_2.py
.. image:: akva_pd_2.png
Weather data
------------
One of the strengths of Pandas is in analysing time series of measurements. Just to show what is possible,
let's take an example from https://www.bergensveret.no/ by UiB's skolelab. These three data files
contain weather data collected over 3 years, every 10 minutes.
* ::download:`Garnes-2016-01-01-2019-09-16.csv`
* ::download:`Sandgotna-2016-01-01-2019-09-16.csv`
* ::download:`Haukeland-2016-01-01-2019-09-16.csv`
Comparison
..........
This code compares the three stations during July 2018:
.. literalinclude:: 26-sammenligning.py
.. image:: 26.png
Grouping and averaging
......................
We can also take one station and look at the average temperature during the day,
for different months:
.. literalinclude:: 27-gruppering.py
.. image:: 27.png
Summary
-------
Pandas is one of the most used data analysis tools in science, and offers far more than we can show in the
frame of an introductory lecture on Python. If you find it useful, start on the pandas website, and
follow through the tutorials with your own data in mind.