Introduction to web protocols and http request and response cycle.

Web Protocols


So, Here the thing, to understand web scraping, we must first understand the web.

How things are working so perfectly on the web? I mean we have gigs of gigs data flowing all around us still the web manages to handle them.. isn't this amazing!

One important thing to note is that even the web is working on the principle of modularization.
Now, What is modularization?

Nothing, but a notion to divide certain kinds of work or categorize different tasks so that they can fall under one roof and only that particular block of the code or method, the function is responsible for managing them.

So how the web is doing modularization?

Now, to look it into deeper we have to think the things at its root level.

In earlier days, Remember people use a kind of thread or wire sort of thing, connect them onto the ends of cups, and then one person speaks and the other one listens and vice versa.

This communication looks something like this:

telecan communication, creationcodes.org web scraping understanding the http response cycle and the working of web.
Yeah.. from this to now! we have come a long way!

But, the roots are the same. Initially, communication over the internet starts from here only, the only difference is that now the cans are replaced by our computers or what we say, clients or hosts, and the wire or the medium is now what we study as LAN, MAN and WAN concepts.

All the things are sorted now, each level is organized in communication, Every module or each level in communication has its own protocols to follow and this makes the whole lot of the work easier.

Look at the below diagram, This is the simple web architecture, right?

TCP/IP layers, TCP IP protocols and block daigram creationcodes.org web scraping introduction

The data flows from one computer to following and passing through different layers of protocol and then hops through routers to reach the destination computer. And That's it, communication is over! within a second depends on the speed of your network.

Every, function on the web you perform has to follow a particular protocol, we may not aware of them, but these architectures do them for us.

You want to send a file over the internet, follow an FTP protocol, Want to send and receive emails? follow SMPT/ POP protocols.

And the main protocol for which we are studying these is HTTP protocol, used to surf the internet or use to retrieve webpages.

We will concentrate mainly on the HTTP protocol for web scraping. Now, Let's talk about one more thing which we are going to use, The Port numbers.

Port Numbers

understand port as a telephone or mobile number. If someone wants to call you, they have to call you to a specific mobile number right? you will be listing to them on that phone number only.

Similarly, the application on the web or the different services on the internet has their own mobile number or port number they are listing to.

Now, what are the services? They are nothing but servers, always alert to listen to your requests, An FTP server will continuously be active and listen to any request of file transfer if he got one.

To reach that server, we need the port number to which that server is responding or listening.

So, we know that websites are managed by web servers right? And what is the common port that web servers listen to? it's 80.

Below are the port numbers for different services you will encounter in your web development career.

Telnet:    23 Login
SSH:    22    Secure login
HTTP:     80    Web services
HTTPS:    443    Secure HTTP
SMTP:    25    Mail Sending
POP:    109/110    Mail Retrieving
FTP:    21

Now, the main protocol on which we focus is the HTTP protocol.

HTTP Protocol

The HTTP protocol is used to surf the web, to retrieve or request information, webpage, documents and images from a webserver on a website.

HyperText Transfer Protocol, The HTTP has certain rules that users have to follow, and one of the most important protocols that we follow each day is URL or Uniform Resource Locator.

In earlier days, There is was a specific way of writing or requesting documents from the web, for example:

https://www.creationcodes.org/document.html 80

The above URL says that follow, HTTP protocol, go to the webserver serving the website creationcodes.org and get me the file document.html, And remember that the webserver is listing to the port number 80.

But, Now, we have defined and new protocol which allows us to just enter the URL and boom, the work is over.

Today, when we type the URL, the browser will automatically do a request/response cycle and get you the requested page.

Socket Connection

Look at the layers of the protocol we discussed above. While making any request to webservers we have to open a socket connection.

A socket connection is like a hotline or just like a direct connection we make at the Application And Transport layer.

By establishing a socket connection, we connect our computer to the port number of the webserver they are listing to.

So, a socket connection is a connection between your port and the server port.

HTTP Request-Response Cycle.

So, when we type any URL, say www.creationcodes.org/document.html there's a lot of work going at the backend of the browser.

The browsers first open up a socket connection between you and the webserver at port 80.
A socket connection is similar to the connection we made during making a call to someone.

Then, after the connection got established, It will send a request through GET keyword which looks something like this:

GET http://www.creationcodes.org/demohtm.html HTTP/1.0

If the webserver approves your request, then It will send a Response to you with all the details such as response Header and the requested document or the web page.

If you wanna see how Response header looks like then click here read the post where I demonstrated how we can send a manual GET request.

What if I tell you that you can also read the Response Header using your browser! The browser shows us only things relevant to us. But if you wanna know about what is really going on, Then open the developer tool provided by user browser.

It's usually the ctrl+shift+i hotkey combination to open that.

There you can view every detail and request you make by your browser. Go to every section and explore what they have to tell you.

The console section will provide you the details of javascript codes running, the network section will provide you the response header and other data you may want.

So, Now you are familiar with the HTTP Protocols and how we communicate with the webserver and request a particular webpage.

I hope you like the post.
Do follow and subscribe to frequent updates.