If you want to explore the final product of this project, please visit: https://pinktaty.github.io/EpidemiologicLibraryWeb/
This project was carried out as part of the “Carreras con Impacto” program during the 14- week mentorship phase. You can find more information about the program in this entry.
Summary
A website called "Epidemiologic Library" was developed to gather information on infectious diseases in Mexico over the past 25 years. The main objective is to facilitate the analysis of epidemiological data to identify trends and, based on them, to create risk management plans for future health crises.
Problem Description
Pandemics are considered a Global Catastrophic Risk due to their potential to cause devastating consequences for society, the economy, and, most critically, public health, with the loss of life being the most severe impact. A recent and significant example of this impact is the COVID-19 pandemic, which, by 2021, had resulted in more than 2 million deaths worldwide. Moreover, in the past decade, the frequency of new epidemiological outbreaks has increased due to factors such as environmental changes and technological capabilities, suggesting that the likelihood of future pandemics could rise in the coming years.
To effectively manage future pandemics, it is crucial to analyze data from recent decades to identify potential trends and prepare adequately for future health crises. However, although having this information is vital, there are currently no adequate structures for its collection. The available data are often in formats that do not facilitate large-scale analysis, such as reports in PDF format rather than structured databases, which significantly slows down the analysis process and, consequently, the ability to respond in real-time or create risk management plans.
For this reason, the "Epidemiologic Library" project aims to use epidemiological data from Mexico as a pilot, collecting the available information from the past 25 years and making it accessible in formats suitable for databases. This will save time and improve the efficiency of analysis by leveraging computational techniques such as web scraping.
Project Description
The project aims to address the area of biological risks by recording responses to diseases in the last 25 years in Mexico. By organizing the data, its analysis in order to identify patterns will be facilitated. By understanding how diseases emerge and spread we can strengthen our ability to anticipate and address the challenges they pose, thereby preparing a more effective and adaptive response to future public health emergencies.
Mexico has been chosen for its considerable population and diverse epidemiological history. The collection of data will help the future analysis of events in order to not only facilitate the response of Mexico in the face of these risks, but from other countries: comparing our response against theirs will facilitate the creation of modified strategies according to their particular cases. In addition, this data collection will be accessible through an interactive web page with the aim of reaching biosafety professionals, making the information available to the individual so that they can use it for their investigations, data analysis and/or decision-making.
Information Sources
All the information collected was from PDF documents made by the Undersecretary of Prevention and Health Promotion of Mexico and the General Directorate of Epidemiology of Mexico.
Motivation
Since the epidemiological information is presented in PDF format by the issuing organizations, this project addresses the need for a presentation that is conducive to effective data analysis, so that researchers can focus on generating strategies and knowledge in a faster and more optimal way, thus helping to save time that now can be spended in achieving the main goals of their projects instead of using it to collect data.
Project goals
- Create a solid data source as a website on epidemic management in the last 25 years in Mexico for future analyzes.
- Increase the availability and transparency of epidemiological information in Mexico.
Personal goals:
- Demonstrate my capacity of problem resolution through strategic and logical thinking as well as programming.
- Demonstrate my skills to plan and develop a project in a limited time.
Methodology
Following the proposed theory of change, an initial survey was conducted with biosafety professionals to identify their information needs and preferences for features and design of the web page. Based on the survey results, a sketch of the final product was created.
After this, Carreras con Impacto mentored me to select the diseases with the most epidemiologic value to gather information about them. For the pdf scraping, Python's pdfplumber library was used, along with the ChatGPT API to process and interpret the scraped information, and the Google Sheets API to collect and organize the data. The code can be accessed through this link: https://github.com/pinktaty/EpidemiologicLibrary.
Results
The final project is the Epidemiologic Library web, accessible through this link: https://pinktaty.github.io/EpidemiologicLibraryWeb/. This web page is the repository of the information collected during the development of the project in format xlsm.
This project was concluded successfully, providing valuable insights into the importance of well-structured information and how seemingly simple details can significantly impact the time and resources required for an investigation or project. Through this experience, I developed skills in data scraping, information organization, and user-centered design.
Limitations
Since this project was executed by a single person, its scope was limited to what could be accomplished within three months. Additionally, tracking the origin of information presented outside of government platforms proved to be challenging, as the links often led to non-existent pages despite supposedly originating from government sources. It is essential to ensure transparency and secure access to information regardless of the year it is accessed, as there is a noticeable trend that the further back in time the information is, the harder it is to find an existent link on government platforms.
Perspectives
The project still has potential to grow further: it can be expanded with information on more diseases and kept up-to-date with upcoming outbreaks. Additionally, interactive graphs could be created and more features added, but this requires additional work beyond the initially established development timeframe.