Web Scraping data from a Table in Web Page using python

Web Scraping

Web Scraping data from a Table in Web Page using python

(Graphical view of coronavirus live update - Using python )

In this article, we are going to extract data from the table in a website (https://www.worldometers.info/coronavirus/) and store it into a CSV or JSON and visualize using D3.js

What is web scraping?

In simple terms, it is the process of gathering information or data from different webpages (HTML sources). The information or data thus gathered can be used in building datasets or databases for different applications like (Data Analysis, Building a price comparison application, etc. )

Prerequisite:-

1. Basic understanding of Python 3.0 programming.

2. Python 3.0 or above installed in your pc(Don’t forget to ADD python to the path while installing).

Libraries we are using:-

1. BeautifulSoup.

2. Pandas.

3. Requests.

The following are the steps to proceed with the project.

Step-1:- Creating the Virtualenv( Same for Windows and Linux ).

Creating the Virtualenv enables us to make our project independent (we install all the libraries required for this project into this Virtualenv.)

1. #Upgrading pip

2. python -m pip install --upgrade pip

3. #installing Virtalenv

4. pip install virtualenv

5. #creating Virtualenv

virtualenv [Name of environment] #enter the name of env without [].

6. virtualenv env

Step-2:- Activating the Virtualenv and installing the required libraries.

Windows:-

If required

( Open Windows PowerShell as administrator and ‘Set Access for activating env in PowerShell window By below command’.

Set-ExecutionPolicy RemoteSigned )

Now to activate the env :-

env/Scripts/activate

Now if the env is activated you will See (env) at the beginning of the next line.

In Linux(env/bin/activate)

Installing Required Libraries:-

1. #installing BeautifulSoup

2. pip install bs4

4. #installing pandas.

5. pip install pandas

7. #installing requests.

8. pip install requests

It is always best practice to freeze required libraries to requirements.txt

pip freeze > requirements.txt

Step 3:- Open a web page and navigate to the table you want to collect data from > right-click > click on Inspect.

understand the HTML structure now.

Step 4:- now proceed with the program.

1. #importing libraries

2. from bs4 import BeautifulSoup

3. import requests

4. import pandas as pd

6. #creating function

7. def getPageSource():

9. # goto to url

10. url=requests.get("https://www.worldometers.info/coronavirus/")

11.

12. #get the html page

13. soup =BeautifulSoup(url.content)

14.

15. #Navigating to table you want.

16. table=soup.find("table", attrs={'id' : 'main_table_countries_today'})

17. rows=table.tbody.findAll('tr')

18.

19. #creating empty lists to store data

20. countries=[]

21. Old_cases=[]

22. New_cases=[]

23. Old_Deaths=[]

24. New_Deaths=[]

25. Total_recovered=[]

26. Active_cases=[]

27.

28. #scraping the data from each row of the table and storing into lists

29. for row in rows:

30. country = row.findAll('td')[0].text

31. Old_case = row.findAll('td')[1].text

32. New_case = row.findAll('td')[2].text

33. Old_Death = row.findAll('td')[3].text

34. New_Death = row.findAll('td')[4].text

35. recovered = row.findAll('td')[5].text

36. Active_case = row.findAll('td')[6].text

37. countries.append(country)

38. Old_cases.append(Old_case)

39. New_cases.append(New_case)

40. Old_Deaths.append(Old_Death)

41. New_Deaths.append(New_Death)

42. Total_recovered.append(recovered)

43. Active_cases.append(Active_case)

44.

45.

46. #Creating the pandas DataFrame from lists

47. df = pd.DataFrame({'countries':countries,'Old_cases': Old_cases,'New_cases':New_cases,'Old_Deaths':Old_Deaths,'New_Deaths':New_Deaths,'Total_recovered':Total_recovered,'Active_cases':Active_cases})

48.

49. #replacing all the characters except digits with empty.

50. df['New_cases'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

51. df['New_Deaths'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

52. df['Old_cases'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

53. df['Old_Deaths'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

54. df['Total_recovered'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

55. df['Active_cases'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')

56.

57. #Converting all the columns to numeric

58. df['New_cases']= pd.to_numeric(df['New_cases'])

59. df['Old_cases']= pd.to_numeric(df['Old_cases'])

60. df['New_Deaths']= pd.to_numeric(df['New_Deaths'])

61. df['Old_Deaths']= pd.to_numeric(df['Old_Deaths'])

62. df['Total_recovered']= pd.to_numeric(df['Total_recovered'])

63. df['Active_cases']= pd.to_numeric(df['Active_cases'])

64.

65. #filling NAN/None by Zeros

66. newdf=df.fillna(0)

67.

68. #naming the index column

69. newdf.index.name = 'Sl_No'

70.

71. #adding two columns of the dataframe

72. newdf['Total_cases']=newdf['Old_cases'] + newdf['New_cases']

73. newdf['Total_Deaths']=newdf['Old_Deaths'] + newdf['New_Deaths']

74.

75. #creating new dataframe with required columns

76. Data=newdf[['countries','Total_cases','Total_Deaths','Total_recovered','Active_cases']]

77.

78. #Saving data to csv

79. Data.to_csv('../Data.csv')

80.

81. #saving data in json format

82. Data.to_json('../Data.json',orient='records')

83.

84. #opening the D3.js chart html

85. html_file = open('../Chart.html', 'r')

86.

87. #reading its content

88. html_content = html_file.read()

89.

90. #coping text to temp vairable

91. replacecont=html_content

92.

93. #opening the json data file.

94. json_file = open('../Data.json','r')

95.

96. #reading the data from json file

97. json_content = json_file.read( )

98.

99. #replacing the string record in html

100. newrec=replacecont.replace('record',json_content)

101.

102. #Update new html with data to visiulize

103. update_html=open('../Chart2.html', 'w')

104. update_html.write(newrec)

105.

106. #empty the varibale and close the documents

107. replacecont=" "

108. update_html.close()

109. html_file.close()

110.

111.

112. getPageSource()

output:- D3.js image output

JSON Data

Search This Blog

Just Python