Web Scrappers (webscrappers)
a set of functions that uses the BeautifulSoup module to scrap data from various websites
Attention
Source Code: GH/sqlutils
In the fast developing world, data is crucial! The webscrappers contains a set of functions that can be used to
easily retreive data into python that uses the BeautifulSoup at its core along with core modules like requests for retreiving data.
git clone https://gist.github.com/ZenithClown/809642277fba2d8d2309e55ab307615f.git webscrappers
export PYTHONPATH="${PYTHONPATH}:webscrappers"
Basics of Web Scrapping
“Web scraping is the process of using bots to extract content and data from a website.” Given a HTML page, a webscrapper
tends to extract information from a HTML tag or elements into a desired format. In python,
Beautiful Soup is popular python
package for parsing HTML and XML documents. Some good tutorials on bs4 that I personally followed:
Beautiful Soup: Build a Web Scraper With Python - RealPython,
A Practical Introduction to Web Scraping in Python - RealPython,
Web Scraping with Python - Beautiful Soup Crash Course - freeCodeCamp.org
In addition, one might need Google Chrome Dev-Tools or Microsoft Edge Dev-Tools introduction.
Web Scrapping BOTs
- tablescraper.chittorgarh(weburl: str, **kwargs) DataFrame
Scrape Web Table(s) from https://www.chittorgarh.com/ Website
The website is a gold mine for informations about an IPO that is to be listed in the NSE and/or BSE website. The function scrapes tables from the page and returns them as a list of DataFrames.
Keyword Argument(s)
The function accepts all the keyword arguments which are accepted by
readtable()function. In addition, the following are explicit for data formatting from the website:dropcols (list): A list of original column names which are to be dropped from the parsed dataframe object. This is the first step in data cleaning, thus the name should be same as that from the website and is case sensitive. Default values are
"Listing at▲▼", "Lead Manager▲▼", "Compare"which are droped.columns (list): A list of column names to be assigned to the parsed final pandas dataframe object. Default uses title snake case, value are
"CompanyName", "OpeningDate", "ClosingDate", "ListingDate", "IssuePrice", "TotalIssueAmount". This is the second step in data cleaning, and from this step onwards all column names are referred by transformed names.dropindex (list): A list of index values which are to be dropped from the parsed dataframe object. Default value is
[0]which is the first row of the table, as it is always blank in the source table.
- tablescraper.readtable(weburl: str, html_class_tag: Dict[str, object], selenium: bool = False, **kwargs) List[str]
A Generic Function to Scrap Simple Table Data from Web-URLs
The function scraps a table specified as a <table class=?> in the specified URL. The function is made generic, and is called by each underlying function.
- Parameters:
weburl (str) – A generic URL of a site from where Tables are to be scrapped. The page can have multiple tables, which is accepted by the code.
html_class_tag (dict) – General attribute to search for in the webpage using
bs4module. The attribute should be restricted to class attributes, check underlying functions for more details.
- ..versionadded:: 2025-08-23
Switch between
bs4andseleniummodules to load contents from the underlying webpage.
- Parameters:
selenium (bool) – Boolean value to switch between
bs4andseleniummodules. Default value isFalse. Modern webpages often load data content from javascript files, which are typically not fetched directly using thebs4module. To resolve this, use selenium webdrivers to fetch the content after all the contents are loaded. Set the selenium driver options using the keyword arguments.
Keyword Arguments
The function is designed to work with minimal keyword arguments which are typically passed to the underlying functions.
markup (str): A string which is passed to
bs4module to parse the webpage. The default value ishtml.parsersignifies default parser.verify_https_request (bool): A boolean which is passed to
requestsmodule to verify the HTTPS request. The default value isTrue. Typically in a restricted environment, useFalseto bypass security checks.verbose (bool): Boolean value to print connection status to console, or set to
Falseto disable. Default value isFalse(no print).webdriver (object): A selenium driver object to load the contents of the webpage. The driver is typically initialized using
selenium.webdriver()function. Defaults towebdriver.Chrome(...)driver. The default option isselenium.webdriver.chrome.options.Options()which is made to work in the background.waitime (int): A time value (in seconds) to wait after the driver is initialized. The default value is
0, i,e, do not make the page wait.
from selenium import webdriver from selenium.webdriver.chrome.options import Options # by default, the function is set to work in the background # else, enduser can set or define custom driver as required driver = webdriver.Chrome(options = Options(...)) tables = readtable( ..., selenium = True, webdriver = driver )
Example Usage
Consider a webpage of the below format with more than one embedded table but with the same class attribute:
... <table class="bar ...", ...> ... </table> ... <table class="bar foo ...", ...> ... </table> ...Then we can either use the
BeautifulSoup.find()to find the first element, or get all as an iterable list. The function defaults to finding all the elements of the same attribute.from tablescraper import readtable tables = readtable( weburl = "https://www.example.com", html_class_tag = {"class" : "bar"} )
- tablescraper.wikitable(weburl: str, **kwargs) Iterable[DataFrame]
Scrap Table(s) from Wikipedia the Free Encyclopedia Pages
Wikipedia often contains detailed information about a topic, and often times these informations can be scrapped using various tools to create tables, charts, etc. The utility function provided below extracts informations using requests and BeutifulSoup python modules.
The function searches for all the table present in a wikipedia webpage and returns each table as a pandas dataframe. For example, on providing the URL for “List of Countries” (https://en.wikipedia.org/wiki/List_of_cities_in_India_by_population) the query returns a list of dataframes, based on populations count as provided in the page.
- Parameters:
weburl (str) – A Wikipedia URL from where Tables are to be scrapped. The function scrapes all the tables present.
Keyword Arguments
The function accept all keyword arguments which are passed to the
readtable()function. Check parent function for details.- verbose (bool): Boolean value to print informations
at important steps of the process. Default value is
False(no print). The same attribute is used by bothwikitable()andreadtable()functions.