Web Scraping data from a Table in Web Page using python


Web Scraping

Web Scraping data from a Table in Web Page using python

 (Graphical view of coronavirus live update - Using python )


In this article, we are going to extract data from the table in a website (https://www.worldometers.info/coronavirus/)  and store it into a CSV or JSON and visualize using D3.js
What is web scraping?
In simple terms, it is the process of gathering information or data from different webpages (HTML sources). The information or data thus gathered can be used in building datasets or databases for different applications like (Data Analysis, Building a price comparison application, etc.  )
Prerequisite:- 
1.     Basic understanding of Python 3.0 programming.
2.     Python 3.0 or above installed in your pc(Don’t forget to ADD python to the path while installing).
Libraries we are using:-
1.    BeautifulSoup.
2.    Pandas.
3.    Requests.

The following are the steps to proceed with the project.
Step-1:-  Creating the Virtualenv( Same for Windows and Linux ).
Creating the Virtualenv enables us to make our project independent (we install all the libraries required for this project into this Virtualenv.)
1.                #Upgrading pip  
2.                python -m pip install --upgrade pip  
3.                #installing Virtalenv  
4.                pip install virtualenv  
5.                #creating Virtualenv   
              virtualenv [Name of environment] #enter the name of env without [].  
6.                virtualenv env 

Step-2:- Activating the Virtualenv and installing the required libraries.
            Windows:-
If required 
               Open Windows PowerShell as administrator and ‘Set Access for activating env in PowerShell window By below command’.
                              Set-ExecutionPolicy RemoteSigned )
Now to activate the env :- 
               env/Scripts/activate  

Now if the env is activated you will See (env) at the beginning of the next line. 

In Linux(env/bin/activate)   
Installing Required  Libraries:-
1.      #installing BeautifulSoup  
2.      pip install bs4  
3.        
4.      #installing pandas.  
5.      pip install pandas   
6.        
7.      #installing requests.  
8.      pip install requests   
It is always best practice to freeze required libraries to requirements.txt
pip freeze > requirements.txt
Step 3:- Open a web page and navigate to the table you want to collect data from > right-click > click on   Inspect.

understand the HTML structure now.


Step 4:- now proceed with the program.

1.  #importing libraries   
2.  from bs4 import BeautifulSoup  
3.  import requests  
4.  import pandas as pd  
5.    
6.  #creating function   
7.  def getPageSource():  
8.    
9.      # goto to url   
10.     url=requests.get("https://www.worldometers.info/coronavirus/")  
11.   
12.     #get the html page   
13.     soup =BeautifulSoup(url.content)  
14.   
15.     #Navigating to table you want.   
16.     table=soup.find("table", attrs={'id' : 'main_table_countries_today'})  
17.     rows=table.tbody.findAll('tr')  
18.   
19.     #creating empty lists to store data   
20.     countries=[]  
21.     Old_cases=[]  
22.     New_cases=[]  
23.     Old_Deaths=[]  
24.     New_Deaths=[]  
25.     Total_recovered=[]  
26.     Active_cases=[]  
27.   
28.     #scraping the data from each row of the table and storing into lists   
29.     for row in rows:  
30.         country = row.findAll('td')[0].text  
31.         Old_case = row.findAll('td')[1].text  
32.         New_case = row.findAll('td')[2].text  
33.         Old_Death = row.findAll('td')[3].text  
34.         New_Death = row.findAll('td')[4].text  
35.         recovered = row.findAll('td')[5].text  
36.         Active_case = row.findAll('td')[6].text  
37.         countries.append(country)  
38.         Old_cases.append(Old_case)  
39.         New_cases.append(New_case)  
40.         Old_Deaths.append(Old_Death)  
41.         New_Deaths.append(New_Death)  
42.         Total_recovered.append(recovered)  
43.         Active_cases.append(Active_case)  
44.       
45.       
46.     #Creating the pandas DataFrame from lists   
47.     df = pd.DataFrame({'countries':countries,'Old_cases': Old_cases,'New_cases':New_cases,'Old_Deaths':Old_Deaths,'New_Deaths':New_Deaths,'Total_recovered':Total_recovered,'Active_cases':Active_cases})  
48.       
49.     #replacing all the characters except digits with empty.  
50.     df['New_cases'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')  
51.     df['New_Deaths'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')  
52.     df['Old_cases'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')  
53.     df['Old_Deaths'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')  
54.     df['Total_recovered'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')  
55.     df['Active_cases'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')  
56.   
57.     #Converting all the columns to numeric   
58.     df['New_cases']= pd.to_numeric(df['New_cases'])  
59.     df['Old_cases']= pd.to_numeric(df['Old_cases'])  
60.     df['New_Deaths']= pd.to_numeric(df['New_Deaths'])  
61.     df['Old_Deaths']= pd.to_numeric(df['Old_Deaths'])  
62.     df['Total_recovered']= pd.to_numeric(df['Total_recovered'])  
63.     df['Active_cases']= pd.to_numeric(df['Active_cases'])  
64.   
65.     #filling NAN/None by Zeros   
66.     newdf=df.fillna(0)  
67.   
68.     #naming the index column   
69.     newdf.index.name = 'Sl_No'  
70.   
71.     #adding two columns of the dataframe   
72.     newdf['Total_cases']=newdf['Old_cases'] + newdf['New_cases']  
73.     newdf['Total_Deaths']=newdf['Old_Deaths'] + newdf['New_Deaths']  
74.   
75.     #creating new dataframe with required columns   
76.     Data=newdf[['countries','Total_cases','Total_Deaths','Total_recovered','Active_cases']]  
77.   
78.     #Saving data to csv    
79.     Data.to_csv('../Data.csv')  
80.   
81.     #saving data in json format   
82.     Data.to_json('../Data.json',orient='records')  
83.   
84.     #opening the D3.js chart html   
85.     html_file = open('../Chart.html''r')  
86.   
87.     #reading its content   
88.     html_content = html_file.read()  
89.   
90.     #coping text to temp vairable   
91.     replacecont=html_content  
92.   
93.     #opening the json data file.  
94.     json_file = open('../Data.json','r')  
95.   
96.     #reading the data from json file   
97.     json_content = json_file.read( )  
98.   
99.     #replacing the string record in html   
100.             newrec=replacecont.replace('record',json_content)  
101.           
102.             #Update new html with data to visiulize   
103.             update_html=open('../Chart2.html''w')  
104.             update_html.write(newrec)  
105.           
106.             #empty the varibale and close the documents   
107.             replacecont=" "  
108.             update_html.close()  
109.             html_file.close()  
110.           
111.           
112.         getPageSource()  



output:- D3.js image output



JSON Data



Comments

Post a Comment

Popular posts from this blog

Graphical view of coronavirus live updates - Using python

Create a Simply chat bot easily no coding skills required