This is the internet!

Note

Code for this session in: https://gitlab.fabcloud.org/barcelonaworkshops/code-club/tree/2020/01_python_basics and https://gitlab.fabcloud.org/barcelonaworkshops/code-club/tree/2020/02_python_internet

Basics

Warning

If you know this, you know it all!

Some basic python structures:

Functions

A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result.

# This is how you define a function
def function(arg1, arg2):
    '''
        Some nice documentation about what your function does
    '''
    if arg1 == 1:
        print (arg2)
    return arg1 + arg2

# This is how you define a function
function(arg1 = 1, arg2 = 2)
# Alternatively, you can call it like this
function(1, 2)

Class

python is an object oriented programming language (OOP). Almost everything in python is an object, with its properties and methods. A class is like an object constructor, or a blueprint for creating objects.

Easy book example:

class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age

  def sayname(self):
    print("Hello my name is " + self.name)

person_1 = Person("John", 36)
person_1.sayname() 

Another example:

class Furniture:
  def __init__(self, type, legs):
    self.type = type
    self.legs = legs

  def define(self):
    print("Hello I am a " + self.type " with " + self.legs + "legs")

piece_1 = Furniture("table", 4)
piece_1.define() 

piece_2 = Furniture("chair", 3)
piece_2.define() 

piece_3 = Furniture("stool", 4)
piece_3.define()

Organising information

Three main structures to organise information: list, dict and tuple:

List

List is a collection which is ordered and changeable. Allows duplicate members.

students = ["Andrew", "Antonio", "Manolito"]
print(students)
print(students[0])

Dictionary

A dictionary is a collection which is unordered, changeable and indexed. In Python dictionaries are written with curly brackets, and they have keys and values.

car =  {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}
print(car)
print(car["model"])

Tuples

A tuple is a collection which is ordered and unchangeable. In Python tuples are written with round brackets.

fruits = ("apple", "banana", "cherry")
print(fruits)
print(fruits[0])

Getting around

Some tricks for getting help:

help()

help(list)

type()

Returns the type of an object.

a = list()
type(a)

dir()

If called without an argument, return the names in the current scope. Else, return an alphabetized list of names comprising (some of) the attributes of the given object, and of attributes reachable from it.

a = list()
dir(a)

Interacting with the internet

With python we can connect to the internet and do many many things. We can post and download files, make requests… :

What is all this?

Check this out: How web works

What is parsing?

Parsing, syntax analysis, or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.

Before we start

Let’s take a selfie!

And also, install some requirements: pip install requests, bs4, imageio

Learning with examples

We have 3 (three!!!) examples for you today. We do not expect to cover all of them, but they are all super-trendy:

BBC News web scrapping

One way to get content from the internet is directly scrapping the code from a web via HTTP request. This is called WEB SCRAPING.

Definition

Web scraping is a technique used to retrieve data from websites with a direct HTTP request. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Read before start -> BeautifulSoup quickstart.

Planning the application

Code an application to get BBC News Headlines and show them in an interface:

Requesting the data

For this, we will use the requests library:

from requests import get
from requests.exceptions import RequestException

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200
            and content_type is not None
            and content_type.find('html') > -1)

simple_get('http://www.bbc.com/news')

Parsing the content

For this, we use the result from the previous point:

from bs4 import BeautifulSoup
raw_html = simple_get('https://www.bbc.com/news')
html = BeautifulSoup(raw_html, 'html.parser')

Storing it

We will use a list. Lists in python are used to store data in a sequential way. They can be defined as:

>>> a = list()
>>> b = []
>>> print (type(a), type(b))
<type 'list'> <type 'list'>
>>> c = [1, 2, 3, 4, 'hello', 'goodbye']
>>> print (c)
[1, 2, 3, 4, 'hello', 'goodbye']

In the example we iterate over the items in the html and put the text field (p.text) in the bbcnews list:

bbcnews = []
raw_html = simple_get('https://www.bbc.com/news')
html = BeautifulSoup(raw_html, 'html.parser')
for p in html.select('h3'):
    if p.text not in bbcnews:
        bbcnews.append(p.text)

bbcnews.remove("BBC World News TV")
bbcnews.remove("News daily newsletter")
bbcnews.remove('Mobile app')
bbcnews.remove('Get in touch')
bbcnews.remove('BBC World Service Radio')

Refreshing the information

In the example, the function refresher is called every 2s

import tkinter
def Refresher(frame=None):
    print ('refreshing')
    frame = Draw(frame)
    frame.after(2000, Refresher, frame) # refresh in 2 seconds

Refresher()

And displaying it

Using tkinter, we make a window, with yellow background and print each item in the list:

def Draw(oldframe=None):
    frame = tkinter.Frame(window,width=1000,height=600,relief='solid',bd=0)
    lalabel = tkinter.Label(frame,  bg="yellow", fg="black", pady= 350,font=("Times New Roman", 20),text=bbcnews[randint(0, len(bbcnews)-1)]).pack()
    frame.pack()
    if oldframe is not None:
        oldframe.pack_forget()
        #.destroy() # cleanup
    return frame

window.geometry('{}x{}'.format(w, h))
window.configure(bg='yellow')    ###To diff between root & Frame
window.resizable(False, False)

# to rename the title of the window
window.title("BBC Live News")
# pack is used to show the object in the window
Refresher()
window.mainloop()

Putting it all together

Get the file here:

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
import tkinter
from random import randint

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200
            and content_type is not None
            and content_type.find('html') > -1)


def log_error(e):
    """
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

def Draw(oldframe=None):
    frame = tkinter.Frame(window,width=1000,height=600,relief='solid',bd=0)
    lalabel = tkinter.Label(frame,  bg="yellow", fg="black", pady= 350,font=("Times New Roman", 20),text=bbcnews[randint(0, len(bbcnews)-1)]).pack()
    frame.pack()
    if oldframe is not None:
        oldframe.pack_forget()
        #.destroy() # cleanup
    return frame

def Refresher(frame=None):
    print ('refreshing')
    frame = Draw(frame)
    frame.after(2000, Refresher, frame) # refresh in 10 seconds

bbcnews = []
raw_html = simple_get('https://www.bbc.com/news')
html = BeautifulSoup(raw_html, 'html.parser')
for p in html.select('h3'):
    if p.text not in bbcnews:
        bbcnews.append(p.text)

bbcnews.remove("BBC World News TV")
bbcnews.remove("News daily newsletter")
bbcnews.remove('Mobile app')
bbcnews.remove('Get in touch')
bbcnews.remove('BBC World Service Radio')
window = tkinter.Tk()
w = '1200'
h = '800'
window.geometry('{}x{}'.format(w, h))
window.configure(bg='yellow')    ###To diff between root & Frame
window.resizable(False, False)

# to rename the title of the window
window.title("BBC Live News")
# pack is used to show the object in the window
Refresher()
window.mainloop()

API requests

Sometimes websites are not very happy when they are scrapped. For instance, IMDB says in their terms and conditions:

For this, other means for interacting with online content is provided in the form of an API:

We can connect to an API directly by it’s endpoints:

An example of an open API is the SmartCitizen API:

Data Format

The data is available generally in JSON format. Json is done by packing data in between {}:

{ "glossary": { "title": "example glossary", "GlossDiv": { "title": "S", "GlossList": { "GlossEntry": { "ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": { "para": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML"] }, "GlossSee": "markup" } } } } }

With Python, we can make requests to APIs via the Requests Library and store the data in the

Planning the app

We’ll make an application that gets a word, looks for all the movies in OMDB that contain that word in the title and makes a gif animation with the posters of those movies. For example, with the word Laboratory we want this:

To plan for this:

Exploring the API data

First, in some cases, we will need to have an API key to access the data. Normally, we would like to store the key in a secret file (in this case .env file):

with open(join(getcwd(), '.env')) as environment:
    for var in environment:
        key = var.split('=')
        os.environ[key[0]] = re.sub('\n','',key[1])

API_KEY = os.environ['apikey']

In this example, we’ll have a look at the API’s data from OMDB.

Basic API Request Structure

The way we request data to an API comes with the following format:

Requesting data

Using the same library as before, we make the get request to the API:

import requests, json

title = 'peter'
baseurl = "http://omdbapi.com/?s=" #only submitting the title parameter

API_KEY = XXXXXXX

def make_request(search):
    response = requests.get(baseurl + search + "&apikey=" + API_KEY)
    movies = {}
    if response.status_code == 200:
        movies = json.loads(response.text)
    else:
        raise ValueError("Bad request")

    return movies

movies = make_request(title)

Exploring the data

The data from the API is returned in a dict:

print (movies.keys())
dict_keys(['Search', 'totalResults', 'Response'])

And the values by:

print (movies['Search'])
[{'Title': 'Peter Pan', 'Year': '1953', 'imdbID': 'tt0046183', 'Type': 'movie',...]

Making use of it

def get_poster(_title, _link):
    try:
        print ('Downloading poster for', _title, '...')
        _file_name = _title + ".jpg"
        urllib.request.urlretrieve(_link, _file_name)
        return _file_name
    except:
        return ''

file_name = get_poster(movie['Title'], movie['Poster'])

Yields something like:

https://m.media-amazon.com/images/M/MV5BMzIwMzUyYTUtMjQ3My00NDc3LWIyZjQtOGUzNDJmNTFlNWUxXkEyXkFqcGdeQXVyMjA0MDQ0Mjc@._V1_SX300.jpg

Make the gif

We will use imageio:

import imageio

for filename in list_movies:
    images.append(imageio.imread(filename))
    imageio.mimsave(join(getcwd(), args.title + '.gif'), images)

Putting it all together! file here:

import os
from os import getcwd, pardir
from os.path import join, abspath
import requests, json
import urllib
import argparse
import re
import imageio

baseurl = "http://omdbapi.com/?s=" #only submitting the title parameter
with open(join(getcwd(), '.env')) as environment:
    for var in environment:
        key = var.split('=')
        os.environ[key[0]] = re.sub('\n','',key[1])

API_KEY = os.environ['apikey']

def make_request(search):
    #OPT!1:
    url_search = baseurl + search + "&apikey=" + API_KEY
    # http://omdbapi.com/?s=peter&apikey=123456
    response = requests.get(url_search)
    movies = dict()
    if response.status_code == 200:
        movies = json.loads(response.text)
    else:
        raise ValueError("Bad request")

    return movies

def get_poster(_title, _link):
    try:
        print ('Downloading poster for', _title, '...')
        _file_name = _title + ".jpg"
        urllib.request.urlretrieve(_link, _file_name)
        return _file_name
    except:
        return ''

if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument("--title", "-t", help="Movie title query")

    args = parser.parse_args()
    movies = make_request(args.title)

    list_movies = list()
    images = []

    if movies:
        for movie in movies['Search']:
            print (movie['Title'])
            print (movie['Poster'])
            file_name = get_poster(movie['Title'], movie['Poster'])
            if file_name != '':
                list_movies.append(file_name)

        for filename in list_movies:
            images.append(imageio.imread(filename))
            imageio.mimsave(join(getcwd(), args.title + '.gif'), images)

The result!

>>> ./api_request.py -t "Peter"
...