Python vs the internet¶
Python can connect with the internet and interact with it. It can post and download files, make http requests and many more things. Some ways in which python can interact with the internet that we will cover here:
- HTTP Requests: requests allows sending HTTP/1.1 requests, without the need for manual labor. Keep-alive and HTTP connection pooling are 100% automatic.
- URL lib: this module provides a high-level interface for fetching data across the World Wide Web. In particular, the urlopen() function is similar to the built-in function open(), but accepts Universal Resource Locators (URLs) instead of filenames. Some restrictions apply — it can only open URLs for reading, and no seek operations are available.
- BeautifulSoup: beautiful Soup is a package for parsing HTML and XML documents. It is perfect for making HTTP requests to a website and analyse the content of the HTML.
What is parsing?
Parsing, syntax analysis, or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.
WEB SCRAPING¶
One way to get content from the internet is directly scrapping the code from a web via HTTP request. This is called WEB SCRAPING.
Definition
Web scraping is a technique used to retrieve data from websites with a direct HTTP request. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Read before start -> BeautifulSoup quickstart.
Planning an application in Python for web scraping¶
Code an application to get BBC News Headlines and show them in an interface:
- How do we request the data?
- How do we parse it?
- Where will you store the data?
- How often do we request the information?
- How do we show it?
Requesting the data¶
For this, we will use the requests library:
from requests import get from requests.exceptions import RequestException def simple_get(url): """ Attempts to get the content at `url` by making an HTTP GET request. If the content-type of response is some kind of HTML/XML, return the text content, otherwise return None. """ try: with closing(get(url, stream=True)) as resp: if is_good_response(resp): return resp.content else: return None except RequestException as e: log_error('Error during requests to {0} : {1}'.format(url, str(e))) return None def is_good_response(resp): """ Returns True if the response seems to be HTML, False otherwise. """ content_type = resp.headers['Content-Type'].lower() return (resp.status_code == 200 and content_type is not None and content_type.find('html') > -1) simple_get('http://www.bbc.com/news')
Parsing the content¶
For this, we use the result from the previous point:
from bs4 import BeautifulSoup raw_html = simple_get('https://www.bbc.com/news') html = BeautifulSoup(raw_html, 'html.parser')
Storing it¶
We will use a list. Lists in python are used to store data in a sequential way. They can be defined as:
>>> a = list() >>> b = [] >>> print (type(a), type(b)) <type 'list'> <type 'list'> >>> c = [1, 2, 3, 4, 'hello', 'goodbye'] >>> print (c) [1, 2, 3, 4, 'hello', 'goodbye']
In the example we iterate over the items in the html and put the text field (p.text
) in the bbcnews
list:
bbcnews = [] raw_html = simple_get('https://www.bbc.com/news') html = BeautifulSoup(raw_html, 'html.parser') for p in html.select('h3'): if p.text not in bbcnews: bbcnews.append(p.text) bbcnews.remove("BBC World News TV") bbcnews.remove("News daily newsletter") bbcnews.remove('Mobile app') bbcnews.remove('Get in touch') bbcnews.remove('BBC World Service Radio')
Refreshing the information¶
In the example, the function refresher is called every 2s
import tkinter def Refresher(frame=None): print ('refreshing') frame = Draw(frame) frame.after(2000, Refresher, frame) # refresh in 2 seconds Refresher()
And displaying it¶
Using tkinter, we make a window, with yellow background and print each item in the list:
def Draw(oldframe=None): frame = tkinter.Frame(window,width=1000,height=600,relief='solid',bd=0) lalabel = tkinter.Label(frame, bg="yellow", fg="black", pady= 350,font=("Times New Roman", 20),text=bbcnews[randint(0, len(bbcnews)-1)]).pack() frame.pack() if oldframe is not None: oldframe.pack_forget() #.destroy() # cleanup return frame window.geometry('{}x{}'.format(w, h)) window.configure(bg='yellow') ###To diff between root & Frame window.resizable(False, False) # to rename the title of the window window.title("BBC Live News") # pack is used to show the object in the window Refresher() window.mainloop()
Putting it all together¶
from requests import get from requests.exceptions import RequestException from contextlib import closing from bs4 import BeautifulSoup import tkinter from random import randint def simple_get(url): """ Attempts to get the content at `url` by making an HTTP GET request. If the content-type of response is some kind of HTML/XML, return the text content, otherwise return None. """ try: with closing(get(url, stream=True)) as resp: if is_good_response(resp): return resp.content else: return None except RequestException as e: log_error('Error during requests to {0} : {1}'.format(url, str(e))) return None def is_good_response(resp): """ Returns True if the response seems to be HTML, False otherwise. """ content_type = resp.headers['Content-Type'].lower() return (resp.status_code == 200 and content_type is not None and content_type.find('html') > -1) def log_error(e): """ This function just prints them, but you can make it do anything. """ print(e) def Draw(oldframe=None): frame = tkinter.Frame(window,width=1000,height=600,relief='solid',bd=0) lalabel = tkinter.Label(frame, bg="yellow", fg="black", pady= 350,font=("Times New Roman", 20),text=bbcnews[randint(0, len(bbcnews)-1)]).pack() frame.pack() if oldframe is not None: oldframe.pack_forget() #.destroy() # cleanup return frame def Refresher(frame=None): print ('refreshing') frame = Draw(frame) frame.after(2000, Refresher, frame) # refresh in 10 seconds bbcnews = [] raw_html = simple_get('https://www.bbc.com/news') html = BeautifulSoup(raw_html, 'html.parser') for p in html.select('h3'): if p.text not in bbcnews: bbcnews.append(p.text) bbcnews.remove("BBC World News TV") bbcnews.remove("News daily newsletter") bbcnews.remove('Mobile app') bbcnews.remove('Get in touch') bbcnews.remove('BBC World Service Radio') window = tkinter.Tk() w = '1200' h = '800' window.geometry('{}x{}'.format(w, h)) window.configure(bg='yellow') ###To diff between root & Frame window.resizable(False, False) # to rename the title of the window window.title("BBC Live News") # pack is used to show the object in the window Refresher() window.mainloop()
API requests¶
Sometimes websites are not very happy when they are scrapped. For instance, IMDB says in their terms and conditions:
Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.
For this, other means for interacting with online content is provided in the form of an API:
A Web API is an application programming interface for either a web server or a web browser. It is a web development concept, usually limited to a web application’s client-side (including any web frameworks being used), and thus usually does not include web server or browser implementation details such as SAPIs or APIs unless publicly accessible by a remote web application.
We can connect to an API directly by it’s endpoints:
Endpoints are important aspects of interacting with server-side web APIs, as they specify where resources lie that can be accessed by third party software. Usually the access is via a URI to which HTTP requests are posted, and from which the response is thus expected.
An example of an open API is the SmartCitizen API:
Data Format
The data is available generally in JSON format. Json is done by packing data in between {}:
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
With Python, we can make requests to APIs via the Requests Library and store the data in the
Planning an application for API requests¶
We’ll make an application that gets a word, looks for all the movies in OMDB that contain that word in the title and makes a gif animation with the posters of those movies. For example, with the word Laboratory
we want this:
To plan for this:
- We need to find the right API (in this case OMDB) and understand how the data is stored
- Request the data
- Explore the received data and store it
- Make use of the data
- Download images and make the gif
Exploring the API data¶
First, in some cases, we will need to have an API key to access the data. Normally, we would like to store the key in a secret file (in this case .env
file):
with open(join(getcwd(), '.env')) as environment: for var in environment: key = var.split('=') os.environ[key[0]] = re.sub('\n','',key[1]) API_KEY = os.environ['apikey']
In this example, we’ll have a look at the API’s data from OMDB.
Basic API Request Structure
The way we request data to an API comes with the following format:
- Base URL:
http://www.omdbapi.com/
- Query:
?
+parameter
+queryname
. Theparameter
can be found in the API documentation. Several parameters can be separated by&
. An example:http://www.omdbapi.com/?s=jose&plot=full&apikey=2a31115
Requesting data¶
Using the same library as before, we make the get request to the API:
import requests, json title = 'peter' baseurl = "http://omdbapi.com/?s=" #only submitting the title parameter API_KEY = XXXXXXX def make_request(search): response = requests.get(baseurl + search + "&apikey=" + API_KEY) movies = {} if response.status_code == 200: movies = json.loads(response.text) else: raise ValueError("Bad request") return movies movies = make_request(title)
Exploring the data¶
The data from the API is returned in a python dictionary.
Python Dicts
Python Dicts have keys and values, mapped together. We can explore the keys of a dict by typing:
print (movies.keys())
dict_keys(['Search', 'totalResults', 'Response'])
And the values by:
print (movies['Search'])
[{'Title': 'Peter Pan', 'Year': '1953', 'imdbID': 'tt0046183', 'Type': 'movie',...]
Making use of it¶
def get_poster(_title, _link): try: print ('Downloading poster for', _title, '...') _file_name = _title + ".jpg" urllib.request.urlretrieve(_link, _file_name) return _file_name except: return '' file_name = get_poster(movie['Title'], movie['Poster'])
Yields something like:
https://m.media-amazon.com/images/M/MV5BMzIwMzUyYTUtMjQ3My00NDc3LWIyZjQtOGUzNDJmNTFlNWUxXkEyXkFqcGdeQXVyMjA0MDQ0Mjc@._V1_SX300.jpg
Make the gif¶
We will use imageio:
import imageio for filename in list_movies: images.append(imageio.imread(filename)) imageio.mimsave(join(getcwd(), args.title + '.gif'), images)
Putting it all together!:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | #!/Users/macoscar/anaconda2/envs/python3v/bin/python3 import os from os import getcwd, pardir from os.path import join, abspath import requests, json import urllib import argparse import re import imageio baseurl = "http://omdbapi.com/?s=" #only submitting the title parameter with open(join(getcwd(), '.env')) as environment: for var in environment: key = var.split('=') os.environ[key[0]] = re.sub('\n','',key[1]) API_KEY = os.environ['apikey'] def make_request(search): #OPT!1: url_search = baseurl + search + "&apikey=" + API_KEY # http://omdbapi.com/?s=peter&apikey=123456 response = requests.get(url_search) movies = dict() if response.status_code == 200: movies = json.loads(response.text) else: raise ValueError("Bad request") return movies def get_poster(_title, _link): try: print ('Downloading poster for', _title, '...') _file_name = _title + ".jpg" urllib.request.urlretrieve(_link, _file_name) return _file_name except: return '' if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument("--title", "-t", help="Movie title query") args = parser.parse_args() movies = make_request(args.title) list_movies = list() images = [] if movies: for movie in movies['Search']: print (movie['Title']) print (movie['Poster']) file_name = get_poster(movie['Title'], movie['Poster']) if file_name != '': list_movies.append(file_name) for filename in list_movies: images.append(imageio.imread(filename)) imageio.mimsave(join(getcwd(), args.title + '.gif'), images) |
The result!
>>> ./api_request.py -t "Peter" ...