Graphical view of coronavirus live updates - Using python
Web Scraping data from a Table in Web Page using python
(Graphical view of coronavirus live updates - Using python )
In this article, we are going to extract data from the table in a website (https://www.worldometers.info/coronavirus/) and store it into a CSV or JSON and visualize using D3.js
What is web scraping?
In simple terms, it is the process of gathering information or data from different webpages (HTML sources). The information or data thus gathered can be used in building datasets or databases for different applications like (Data Analysis, Building a price comparison application, etc. )
Prerequisite:-
1. Basic understanding of Python 3.0 programming.
2. Python 3.0 or above installed in your pc(Don’t forget to ADD python to the path while installing).
Libraries we are using:-
1. BeautifulSoup.
2. Pandas.
3. Requests.
The following are the steps to proceed with the project.
Step-1:- Creating the Virtualenv( Same for Windows and Linux ).
Creating the Virtualenv enables us to make our project independent (we install all the libraries required for this project into this Virtualenv.)
1. #Upgrading pip
2. python -m pip install --upgrade pip
3. #installing Virtualenv
4. pip install virtualenv
5. #creating Virtualenv
virtualenv [Name of environment] #enter the name of env without [].
6. virtualenv env
Step-2:- Activating the Virtualenv and installing the required libraries.
Windows:-
If required
( Open Windows PowerShell as administrator and ‘Set Access for activating env in PowerShell window By below command’.
Set-ExecutionPolicy RemoteSigned )
Now to activate the env :-
env/Scripts/activate
Now if the env is activated you will See (env) at the beginning of the next line.
In Linux(env/bin/activate)
Installing Required Libraries:-
1. #installing BeautifulSoup
2. pip install bs4
3.
4. #installing pandas.
5. pip install pandas
6.
7. #installing requests.
8. pip install requests
It is always best practice to freeze required libraries to requirements.txt
pip freeze > requirements.txt
Step 3:- Open a web page and navigate to the table you want to collect data from > right-click > click on Inspect.
understand the HTML structure now.
Step 4:- now proceed with the program.
D3.js Chart templet:-
Save this into .html file
1. <!DOCTYPE html>
2. <html lang="en">
3. <head>
4. <meta charset="utf-8">
5. <title>D3: A simple packed Bubble Chart</title>
6. <script type="text/javascript" src="https://d3js.org/d3.v4.min.js"></script>
7.
8. <style type="text/css">
9. /* No style rules here yet */
10. </style>
11. </head>
12. <body>
13. <script type="text/javascript">
14. dataset = {
15. "children": record
16. };
17.
18. var diameter = 600;
19. var color = d3.scaleOrdinal(d3.schemeCategory20);
20.
21. var bubble = d3.pack(dataset)
22. .size([diameter, diameter])
23. .padding(1.5);
24.
25. var svg = d3.select("body")
26. .append("svg")
27. .attr("width", diameter)
28. .attr("height", diameter)
29. .attr("class", "bubble");
30.
31. var nodes = d3.hierarchy(dataset)
32. .sum(function(d) { return d.Total_cases; });
33.
34. var node = svg.selectAll(".node")
35. .data(bubble(nodes).descendants())
36. .enter()
37. .filter(function(d){
38. return !d.children
39. })
40. .append("g")
41. .attr("class", "node")
42. .attr("transform", function(d) {
43. return "translate(" + d.x + "," + d.y + ")";
44. });
45.
46. node.append("title")
47. .text(function(d) {
48. return d.countries + ": " + d.Total_cases;
49. });
50.
51. node.append("circle")
52. .attr("r", function(d) {
53. return d.r;
54. })
55. .style("fill", function(d,i) {
56. return color(i);
57. });
58.
59. node.append("text")
60. .attr("dy", ".2em")
61. .style("text-anchor", "middle")
62. .text(function(d) {
63. return d.data.countries.substring(0, d.r / 3);
64. })
65. .attr("font-family", "sans-serif")
66. .attr("font-size", function(d){
67. return d.r/5;
68. })
69. .attr("fill", "white");
70.
71. node.append("text")
72. .attr("dy", "1.3em")
73. .style("text-anchor", "middle")
74. .text(function(d) {
75. return d.data.Total_cases;
76. })
77. .attr("font-family", "Gill Sans", "Gill Sans MT")
78. .attr("font-size", function(d){
79. return d.r/5;
80. })
81. .attr("fill", "white");
82.
83. d3.select(self.frameElement)
84. .style("height", diameter + "px");
85.
86.
87.
88. </script>
89. </body>
90. </html>
Python Programming:-
1 . #importing libraries
2. from bs4 import BeautifulSoup
3. import requests
4. import pandas as pd
5.
6. #creating function
7. def getPageSource():
8.
9. # goto to url
10. url=requests.get("https://www.worldometers.info/coronavirus/")
11.
12. #get the html page
13. soup =BeautifulSoup(url.content)
14.
15. #Navigating to table you want.
16. table=soup.find("table", attrs={'id' : 'main_table_countries_today'})
17. rows=table.tbody.findAll('tr')
18.
19. #creating empty lists to store data
20. countries=[]
21. Old_cases=[]
22. New_cases=[]
23. Old_Deaths=[]
24. New_Deaths=[]
25. Total_recovered=[]
26. Active_cases=[]
27.
28. #scraping the data from each row of the table and storing into lists
29. for row in rows:
30. country = row.findAll('td')[0].text
31. Old_case = row.findAll('td')[1].text
32. New_case = row.findAll('td')[2].text
33. Old_Death = row.findAll('td')[3].text
34. New_Death = row.findAll('td')[4].text
35. recovered = row.findAll('td')[5].text
36. Active_case = row.findAll('td')[6].text
37. countries.append(country)
38. Old_cases.append(Old_case)
39. New_cases.append(New_case)
40. Old_Deaths.append(Old_Death)
41. New_Deaths.append(New_Death)
42. Total_recovered.append(recovered)
43. Active_cases.append(Active_case)
44.
45.
46. #Creating the pandas DataFrame from lists
47. df = pd.DataFrame({'countries':countries,'Old_cases': Old_cases,'New_cases':New_cases,'Old_Deaths':Old_Deaths,'New_Deaths':New_Deaths,'Total_recovered':Total_recovered,'Active_cases':Active_cases})
48.
49. #replacing all the characters except digits with empty.
50. df['New_cases'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
51. df['New_Deaths'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
52. df['Old_cases'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
53. df['Old_Deaths'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
54. df['Total_recovered'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
55. df['Active_cases'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
56.
57. #Converting all the columns to numeric
58. df['New_cases']= pd.to_numeric(df['New_cases'])
59. df['Old_cases']= pd.to_numeric(df['Old_cases'])
60. df['New_Deaths']= pd.to_numeric(df['New_Deaths'])
61. df['Old_Deaths']= pd.to_numeric(df['Old_Deaths'])
62. df['Total_recovered']= pd.to_numeric(df['Total_recovered'])
63. df['Active_cases']= pd.to_numeric(df['Active_cases'])
64.
65. #filling NAN/None by Zeros
66. newdf=df.fillna(0)
67.
68. #naming the index column
69. newdf.index.name = 'Sl_No'
70.
71. #adding two columns of the dataframe
72. newdf['Total_cases']=newdf['Old_cases'] + newdf['New_cases']
73. newdf['Total_Deaths']=newdf['Old_Deaths'] + newdf['New_Deaths']
74.
75. #creating new dataframe with required columns
76. Data=newdf[['countries','Total_cases','Total_Deaths','Total_recovered','Active_cases']]
77.
78. #Saving data to csv
79. Data.to_csv('../Data.csv')
80.
81. #saving data in json format
82. Data.to_json('../Data.json',orient='records')
83.
84. #opening the D3.js chart html
85. html_file = open('../Chart.html', 'r')
86.
87. #reading its content
88. html_content = html_file.read()
89.
90. #coping text to temp vairable
91. replacecont=html_content
92.
93. #opening the json data file.
94. json_file = open('../Data.json','r')
95.
96. #reading the data from json file
97. json_content = json_file.read( )
98.
99. #replacing the string record in html
100. newrec=replacecont.replace('record',json_content)
101.
102. #Update new html with data to visiulize
103. update_html=open('../Chart2.html', 'w')
104. update_html.write(newrec)
105.
106. #empty the varibale and close the documents
107. replacecont=" "
108. update_html.close()
109. html_file.close()
110.
111.
112. getPageSource()
output:- D3.js image output
JSON Data
Github link:-
https://github.com/saicharankr/WebScrap
.
Nice work :)
ReplyDelete