Web Scraping
We always talk about data right! We say we have so much data that now the world is driven by the bits and bytes or even gigs of data every day.
So... from where all this data is coming from huh? Internet is the most powerful technology, right? internet not only reduces the communication distances but, is now a huge provider of knowledge, and wisdom, In technical terms the information! And the notion is such that, Information is the collection of organized data.
Now, if we talk about data, we have gigs and gigs of data available on the
internet today, but how to extract them?
We have land mining tools to extract the gems and diamonds from the mother earth! But how to mine the data from such a source of wisdom and information,
the Internet?
And here comes the concept of Web Scraping. So, we scrap the website bit by bit until we got the data we want. That's it. This is the notion of web scraping.
Web scraping is the tool by which you can extract the data you want from any
website.
Let me show you how cool this is, Just a single line of code and boom! you got
the data.
you have heard of telnet right? It is used to remotely log in to a
remote computer. But what if I tell you, that you can use the telnet to get the data from a website? scary enough huh.
We will talk about the technical terms shortly but right now see how we can
use the simple telnet tool to extract the web-page of a website.
Write-In terminal Of Linux
telnet 127.0.0.1 80
Output :
Trying 127.0.0.1...Connected to 127.0.0.1.
Escape character is '^]'.
GET http://127.0.0.1/demoHtm.html HTTP/1.0
HTTP/1.1 200 OK
Date: Fri, 07 Aug 2020 11:13:25 GMTServer: Apache/2.4.41 (Ubuntu)
Last-Modified: Fri, 07 Aug 2020 11:02:39 GMT
ETag: "ff-5ac478b0d89fb"
Accept-Ranges: bytes
Content-Length: 255
Vary: Accept-Encoding
Connection: close
Content-Type: text/html
<html>
<head>
<title>
Welcome To The Local-Host Website
</title>
</head>
<body>
<h1>Heading 1, whats up folks looking at my website huh?</h1><p>Web scrapping is amazing, telnet is just an example to show howresponse cycle works</p>
</body>
</html>
Connection closed by foreign host.
See, With a single telnet request, we got the source code of the requested webpage. This is known as the response cycle.
HTTP Response Cycle
In the above telnet demo, Notice, I have written, telnet 127.0.0.1 80. What is the meaning?
The 127.0.0.1 is the IP address of the local-host. i.e your own computer, Here in actual, you should write the domain name or IP address of the website you want to scrape.
For the tutorial purpose, I scrape my own webpage hosted in my local machine through the apache server. therefore I mentioned the localhost IP address.
Now, let's talk about the 80, It is the port number that is used by webservers to listen to the HTTP requests.
So, In real-life experiments, Suppose if you want a webpage or a document from Amazon website, then you should write,
telnet www.amazon.in 80
Now, What this command is doing?
Simple, This command opens a connection between you and the webserver of the intended website.
This is known as opening a socket connection.
It's just like the tunnel connecting your machine with the webserver.
Now, after establishing the connection, you will get the following message:
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
Now, Here you will have to enter the link of the webpage you want the webserver to serve you.
There is a GET keyword for that, the GET keyword tells the server that we want to access the webpage.
write:
GET http://127.0.0.1/demoHtm.html HTTP/1.0
Here, I created a webpage demoHtm.html for the tutorial. So, this command tells the server that please serve demoHtm.html file to me.
Then, if the server accepts your request, then you will get all the details including the response header, like this:
Response Header:
HTTP/1.1 200 OK
Date: Fri, 07 Aug 2020 11:13:25 GMT
Server: Apache/2.4.41 (Ubuntu)
Last-Modified: Fri, 07 Aug 2020 11:02:39 GMT
ETag: "ff-5ac478b0d89fb"
Accept-Ranges: bytes
Content-Length: 255
Vary: Accept-Encoding
Connection: close
Content-Type: text/html
Webpage
<html>
<head>
<title>
Welcome To The Local-Host Website
</title>
</head>
<body>
<h1>Heading 1, whats up folks looking at my website huh?</h1>
<p>Web scrapping is amazing, telnet is just an example to show how response cycle works
</p>
</body>
</html>
We will talk in detail about this Response Cycle in further posts, Cause it requires some discussion about the concepts like 7 layers in networking protocols and etc...
So, what now? after getting the source code of the webpage you want?
Now, this is up to you how you extract the data, use a regular expression to extract the anchor tags or whatever logic you wanna perform.
We will use python for all the further discussion about web scraping. There is a module called BeautifulSoup which can make our work easy.
We will discuss various basic modules such as sockets, urllib.
These modules will do the above telegram task within a single call of their function, And you will get the code of the website within a file.
Now, the further is up to you, suppose you want to extract the data store in the table, then use a regular expression to list out all the <td> and <tr> tags then scrap out the data.
But, HTML is very buggy. Even if you miss to close a tag, the HTML will execute no matter what.
So, a regular expression can not work in some cases.
For such misbehavior of HTML, we have an amazing tool of BeautifulSoup.
BeautifulSoup will make your task so simple and easy, that you just have to name the tag you want, remaining logic headache is the responsibility of the BeautifulSoup module.
This post is just an introduction about what is web scraping, And how we can use this concept in general.
I hope you will find this helpful.
Do share the post if you like.
Do follow and support the Blog.
0 Comments