Web Scrappers (webscrappers)

a set of functions that uses the BeautifulSoup module to scrap data from various websites

Attention

Source Code: GH/sqlutils


In the fast developing world, data is crucial! The webscrappers contains a set of functions that can be used to easily retreive data into python that uses the BeautifulSoup at its core along with core modules like requests for retreiving data.

git clone https://gist.github.com/ZenithClown/809642277fba2d8d2309e55ab307615f.git webscrappers
export PYTHONPATH="${PYTHONPATH}:webscrappers"

Basics of Web Scrapping

“Web scraping is the process of using bots to extract content and data from a website.” Given a HTML page, a webscrapper tends to extract information from a HTML tag or elements into a desired format. In python, Beautiful Soup is popular python package for parsing HTML and XML documents. Some good tutorials on bs4 that I personally followed:

In addition, one might need Google Chrome Dev-Tools or Microsoft Edge Dev-Tools introduction.

Web Scrapping BOTs

tablescraper.chittorgarh(weburl: str, **kwargs) DataFrame

Scrape Web Table(s) from https://www.chittorgarh.com/ Website

The website is a gold mine for informations about an IPO that is to be listed in the NSE and/or BSE website. The function scrapes tables from the page and returns them as a list of DataFrames.

Keyword Argument(s)

The function accepts all the keyword arguments which are accepted by readtable() function. In addition, the following are explicit for data formatting from the website:

  • dropcols (list): A list of original column names which are to be dropped from the parsed dataframe object. This is the first step in data cleaning, thus the name should be same as that from the website and is case sensitive. Default values are "Listing at▲▼", "Lead Manager▲▼", "Compare" which are droped.

  • columns (list): A list of column names to be assigned to the parsed final pandas dataframe object. Default uses title snake case, value are "CompanyName", "OpeningDate", "ClosingDate", "ListingDate", "IssuePrice", "TotalIssueAmount". This is the second step in data cleaning, and from this step onwards all column names are referred by transformed names.

  • dropindex (list): A list of index values which are to be dropped from the parsed dataframe object. Default value is [0] which is the first row of the table, as it is always blank in the source table.

tablescraper.readtable(weburl: str, html_class_tag: Dict[str, object], selenium: bool = False, **kwargs) List[str]

A Generic Function to Scrap Simple Table Data from Web-URLs

The function scraps a table specified as a <table class=?> in the specified URL. The function is made generic, and is called by each underlying function.

Parameters:
  • weburl (str) – A generic URL of a site from where Tables are to be scrapped. The page can have multiple tables, which is accepted by the code.

  • html_class_tag (dict) – General attribute to search for in the webpage using bs4 module. The attribute should be restricted to class attributes, check underlying functions for more details.

..versionadded:: 2025-08-23

Switch between bs4 and selenium modules to load contents from the underlying webpage.

Parameters:

selenium (bool) – Boolean value to switch between bs4 and selenium modules. Default value is False. Modern webpages often load data content from javascript files, which are typically not fetched directly using the bs4 module. To resolve this, use selenium webdrivers to fetch the content after all the contents are loaded. Set the selenium driver options using the keyword arguments.

Keyword Arguments

The function is designed to work with minimal keyword arguments which are typically passed to the underlying functions.

  • markup (str): A string which is passed to bs4 module to parse the webpage. The default value is html.parser signifies default parser.

  • verify_https_request (bool): A boolean which is passed to requests module to verify the HTTPS request. The default value is True. Typically in a restricted environment, use False to bypass security checks.

  • verbose (bool): Boolean value to print connection status to console, or set to False to disable. Default value is False (no print).

  • webdriver (object): A selenium driver object to load the contents of the webpage. The driver is typically initialized using selenium.webdriver() function. Defaults to webdriver.Chrome(...) driver. The default option is selenium.webdriver.chrome.options.Options() which is made to work in the background.

  • waitime (int): A time value (in seconds) to wait after the driver is initialized. The default value is 0, i,e, do not make the page wait.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# by default, the function is set to work in the background
# else, enduser can set or define custom driver as required
driver = webdriver.Chrome(options = Options(...))
tables = readtable(
    ...,
    selenium = True,
    webdriver = driver
)

Example Usage

Consider a webpage of the below format with more than one embedded table but with the same class attribute:

...
<table class="bar ...", ...>
    ...
</table>
...
<table class="bar foo ...", ...>
    ...
</table>
...

Then we can either use the BeautifulSoup.find() to find the first element, or get all as an iterable list. The function defaults to finding all the elements of the same attribute.

from tablescraper import readtable
tables = readtable(
    weburl = "https://www.example.com",
    html_class_tag = {"class" : "bar"}
)
tablescraper.wikitable(weburl: str, **kwargs) Iterable[DataFrame]

Scrap Table(s) from Wikipedia the Free Encyclopedia Pages

Wikipedia often contains detailed information about a topic, and often times these informations can be scrapped using various tools to create tables, charts, etc. The utility function provided below extracts informations using requests and BeutifulSoup python modules.

The function searches for all the table present in a wikipedia webpage and returns each table as a pandas dataframe. For example, on providing the URL for “List of Countries” (https://en.wikipedia.org/wiki/List_of_cities_in_India_by_population) the query returns a list of dataframes, based on populations count as provided in the page.

Parameters:

weburl (str) – A Wikipedia URL from where Tables are to be scrapped. The function scrapes all the tables present.

Keyword Arguments

The function accept all keyword arguments which are passed to the readtable() function. Check parent function for details.

  • verbose (bool): Boolean value to print informations

    at important steps of the process. Default value is False (no print). The same attribute is used by both wikitable() and readtable() functions.