Skip to content

Python vs the internet

Python can connect with the internet and interact with it. It can post and download files, make http requests and many more things. Some ways in which python can interact with the internet that we will cover here:

  • HTTP Requests: requests allows sending HTTP/1.1 requests, without the need for manual labor. Keep-alive and HTTP connection pooling are 100% automatic.
  • URL lib: this module provides a high-level interface for fetching data across the World Wide Web. In particular, the urlopen() function is similar to the built-in function open(), but accepts Universal Resource Locators (URLs) instead of filenames. Some restrictions apply — it can only open URLs for reading, and no seek operations are available.
  • BeautifulSoup: beautiful Soup is a package for parsing HTML and XML documents. It is perfect for making HTTP requests to a website and analyse the content of the HTML.

What is parsing?

Parsing, syntax analysis, or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.

WEB SCRAPING

One way to get content from the internet is directly scrapping the code from a web via HTTP request. This is called WEB SCRAPING.

Definition

Web scraping is a technique used to retrieve data from websites with a direct HTTP request. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Read before start -> BeautifulSoup quickstart.

Planning an application in Python for web scraping

Code an application to get BBC News Headlines and show them in an interface:

  • How do we request the data?
  • How do we parse it?
  • Where will you store the data?
  • How often do we request the information?
  • How do we show it?

Requesting the data

For this, we will use the requests library:

from requests import get
from requests.exceptions import RequestException

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200
            and content_type is not None
            and content_type.find('html') > -1)

simple_get('http://www.bbc.com/news')

Parsing the content

For this, we use the result from the previous point:

from bs4 import BeautifulSoup
raw_html = simple_get('https://www.bbc.com/news')
html = BeautifulSoup(raw_html, 'html.parser')

Storing it

We will use a list. Lists in python are used to store data in a sequential way. They can be defined as:

>>> a = list()
>>> b = []
>>> print (type(a), type(b))
<type 'list'> <type 'list'>
>>> c = [1, 2, 3, 4, 'hello', 'goodbye']
>>> print (c)
[1, 2, 3, 4, 'hello', 'goodbye']

In the example we iterate over the items in the html and put the text field (p.text) in the bbcnews list:

bbcnews = []
raw_html = simple_get('https://www.bbc.com/news')
html = BeautifulSoup(raw_html, 'html.parser')
for p in html.select('h3'):
    if p.text not in bbcnews:
        bbcnews.append(p.text)

bbcnews.remove("BBC World News TV")
bbcnews.remove("News daily newsletter")
bbcnews.remove('Mobile app')
bbcnews.remove('Get in touch')
bbcnews.remove('BBC World Service Radio')

Refreshing the information

In the example, the function refresher is called every 2s

import tkinter
def Refresher(frame=None):
    print ('refreshing')
    frame = Draw(frame)
    frame.after(2000, Refresher, frame) # refresh in 2 seconds

Refresher()

And displaying it

Using tkinter, we make a window, with yellow background and print each item in the list:

def Draw(oldframe=None):
    frame = tkinter.Frame(window,width=1000,height=600,relief='solid',bd=0)
    lalabel = tkinter.Label(frame,  bg="yellow", fg="black", pady= 350,font=("Times New Roman", 20),text=bbcnews[randint(0, len(bbcnews)-1)]).pack()
    frame.pack()
    if oldframe is not None:
        oldframe.pack_forget()
        #.destroy() # cleanup
    return frame

window.geometry('{}x{}'.format(w, h))
window.configure(bg='yellow')    ###To diff between root & Frame
window.resizable(False, False)

# to rename the title of the window
window.title("BBC Live News")
# pack is used to show the object in the window
Refresher()
window.mainloop()

Putting it all together

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
import tkinter
from random import randint

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200
            and content_type is not None
            and content_type.find('html') > -1)


def log_error(e):
    """
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

def Draw(oldframe=None):
    frame = tkinter.Frame(window,width=1000,height=600,relief='solid',bd=0)
    lalabel = tkinter.Label(frame,  bg="yellow", fg="black", pady= 350,font=("Times New Roman", 20),text=bbcnews[randint(0, len(bbcnews)-1)]).pack()
    frame.pack()
    if oldframe is not None:
        oldframe.pack_forget()
        #.destroy() # cleanup
    return frame

def Refresher(frame=None):
    print ('refreshing')
    frame = Draw(frame)
    frame.after(2000, Refresher, frame) # refresh in 10 seconds

bbcnews = []
raw_html = simple_get('https://www.bbc.com/news')
html = BeautifulSoup(raw_html, 'html.parser')
for p in html.select('h3'):
    if p.text not in bbcnews:
        bbcnews.append(p.text)

bbcnews.remove("BBC World News TV")
bbcnews.remove("News daily newsletter")
bbcnews.remove('Mobile app')
bbcnews.remove('Get in touch')
bbcnews.remove('BBC World Service Radio')
window = tkinter.Tk()
w = '1200'
h = '800'
window.geometry('{}x{}'.format(w, h))
window.configure(bg='yellow')    ###To diff between root & Frame
window.resizable(False, False)

# to rename the title of the window
window.title("BBC Live News")
# pack is used to show the object in the window
Refresher()
window.mainloop()

API requests

Sometimes websites are not very happy when they are scrapped. For instance, IMDB says in their terms and conditions:

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

For this, other means for interacting with online content is provided in the form of an API:

A Web API is an application programming interface for either a web server or a web browser. It is a web development concept, usually limited to a web application’s client-side (including any web frameworks being used), and thus usually does not include web server or browser implementation details such as SAPIs or APIs unless publicly accessible by a remote web application.

We can connect to an API directly by it’s endpoints:

Endpoints are important aspects of interacting with server-side web APIs, as they specify where resources lie that can be accessed by third party software. Usually the access is via a URI to which HTTP requests are posted, and from which the response is thus expected.

An example of an open API is the SmartCitizen API:

Data Format

The data is available generally in JSON format. Json is done by packing data in between {}: { "glossary": { "title": "example glossary", "GlossDiv": { "title": "S", "GlossList": { "GlossEntry": { "ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": { "para": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML"] }, "GlossSee": "markup" } } } } }

With Python, we can make requests to APIs via the Requests Library and store the data in the

Planning an application for API requests

We’ll make an application that gets a word, looks for all the movies in OMDB that contain that word in the title and makes a gif animation with the posters of those movies. For example, with the word Laboratory we want this:

To plan for this:

  • We need to find the right API (in this case OMDB) and understand how the data is stored
  • Request the data
  • Explore the received data and store it
  • Make use of the data
  • Download images and make the gif

Exploring the API data

First, in some cases, we will need to have an API key to access the data. Normally, we would like to store the key in a secret file (in this case .env file):

with open(join(getcwd(), '.env')) as environment:
    for var in environment:
        key = var.split('=')
        os.environ[key[0]] = re.sub('\n','',key[1])

API_KEY = os.environ['apikey']

In this example, we’ll have a look at the API’s data from OMDB.

Basic API Request Structure

The way we request data to an API comes with the following format:

  • Base URL: http://www.omdbapi.com/
  • Query: ? + parameter + queryname. The parameter can be found in the API documentation. Several parameters can be separated by &. An example: http://www.omdbapi.com/?s=jose&plot=full&apikey=2a31115

Requesting data

Using the same library as before, we make the get request to the API:

import requests, json

title = 'peter'
baseurl = "http://omdbapi.com/?s=" #only submitting the title parameter

API_KEY = XXXXXXX

def make_request(search):
    response = requests.get(baseurl + search + "&apikey=" + API_KEY)
    movies = {}
    if response.status_code == 200:
        movies = json.loads(response.text)
    else:
        raise ValueError("Bad request")

    return movies

movies = make_request(title)

Exploring the data

The data from the API is returned in a python dictionary.

Python Dicts

Python Dicts have keys and values, mapped together. We can explore the keys of a dict by typing:

print (movies.keys())

dict_keys(['Search', 'totalResults', 'Response'])

And the values by:

print (movies['Search'])

[{'Title': 'Peter Pan', 'Year': '1953', 'imdbID': 'tt0046183', 'Type': 'movie',...]

Making use of it

def get_poster(_title, _link):
    try:
        print ('Downloading poster for', _title, '...')
        _file_name = _title + ".jpg"
        urllib.request.urlretrieve(_link, _file_name)
        return _file_name
    except:
        return ''

file_name = get_poster(movie['Title'], movie['Poster'])

Yields something like:

https://m.media-amazon.com/images/M/MV5BMzIwMzUyYTUtMjQ3My00NDc3LWIyZjQtOGUzNDJmNTFlNWUxXkEyXkFqcGdeQXVyMjA0MDQ0Mjc@._V1_SX300.jpg

Make the gif

We will use imageio:

import imageio

for filename in list_movies:
    images.append(imageio.imread(filename))
    imageio.mimsave(join(getcwd(), args.title + '.gif'), images)

Putting it all together!:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#!/Users/macoscar/anaconda2/envs/python3v/bin/python3
import os
from os import getcwd, pardir
from os.path import join, abspath
import requests, json
import urllib
import argparse
import re
import imageio


baseurl = "http://omdbapi.com/?s=" #only submitting the title parameter
with open(join(getcwd(), '.env')) as environment:
    for var in environment:
        key = var.split('=')
        os.environ[key[0]] = re.sub('\n','',key[1])

API_KEY = os.environ['apikey']

def make_request(search):
    #OPT!1:
    url_search = baseurl + search + "&apikey=" + API_KEY
    # http://omdbapi.com/?s=peter&apikey=123456
    response = requests.get(url_search)
    movies = dict()
    if response.status_code == 200:
        movies = json.loads(response.text)
    else:
        raise ValueError("Bad request")

    return movies

def get_poster(_title, _link):
    try:
        print ('Downloading poster for', _title, '...')
        _file_name = _title + ".jpg"
        urllib.request.urlretrieve(_link, _file_name)
        return _file_name
    except:
        return ''

if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument("--title", "-t", help="Movie title query")

    args = parser.parse_args()
    movies = make_request(args.title)

    list_movies = list()
    images = []

    if movies:
        for movie in movies['Search']:
            print (movie['Title'])
            print (movie['Poster'])
            file_name = get_poster(movie['Title'], movie['Poster'])
            if file_name != '':
                list_movies.append(file_name)

        for filename in list_movies:
            images.append(imageio.imread(filename))
            imageio.mimsave(join(getcwd(), args.title + '.gif'), images)

The result!

>>> ./api_request.py -t "Peter"
...