until LLM Bootcamp: In-Person (Seattle) and Online Learn more

Data sustainability

It’s not easy going through thousands of SEC filings when you’re part of that SASB. Sustainability data and machine learning can make that job easier.

Imagine being an investor before Volkswagen’s recent emissions scandal. Now imagine pricing in the risk of Volkswagen’s governance controls relative to their peers before that. Or, imagine being able to account for Chipotle’s food safety risks in their supply chain before their issues with E.coli in recent years.

United States Securities
Logo of United States Securities and Exchange Commission

Hindsight is 20/20, yes. But there were signs in both cases from their public Securities and Exchange Commission (SEC) filings that these risks were evident relative to their peers. There were also signs that sustainability data or Environmental Social Governance (ESG) data could have made this more transparent to others.

Too often, ‘sustainability’ is often associated with environmental issues. Sustainability data also encompasses issues related to company self-governance and company product safety. In fact, the first ESG quantitative investment fund, Arabesque Partners, has been using this type of ESG data to exclude companies (read more about Arabesque Partners here). In their recent case study, they mention that they use this data to  “not include Toshiba, Valiant and Volkswagen.”[1] These are just a few examples of companies they were able to identify with ESG risks.

Yes, the benefits of sustainability data have become more mainstream. Although, there is still a lot of this data locked away in unstructured text form in a company’s SEC filings. One can’t simply read through an industry or sector’s worth of lengthy disclosures, along with that of their peers to find these differences and compare them on an apples-to-apples basis.

This becomes more difficult when you consider trying to read the text for all of these companies and then classify them into decision-useful data for sustainability, or sustainability data. Important questions arise:

  • What topic or category would you come up with?
  • How would you know those categories are important?
  • How would you group companies? By industry? By sector?
  • What classification system would you use and would that apply to sustainability issues?
Logo of Sustainability Accounting Standards Board

This is where the work of my organization, the Sustainability Accounting Standards Board (SASB), has provided guidance for what topics may likely be material (relevant) for a company. This is based on its industry classification within our SICS company classification system.  Our mission has been to provide industry-specific sustainability standards based on exhaustive research and industry working group participation. They focus disclosure on what is likely to be material and relevant.

Our internal SASB research team did exactly this work of classifying corporate SEC disclosures on just a small sample of 5-10 companies and industries. It has been very useful to get this data. Unfortunately, it has taken over 2 years plus a team of researchers to get even a fraction of the total economy of companies. While sampling can attempt to represent the overall distribution of the sustainability data, we knew from our experiences interacting with external stakeholders that a large amount of untapped, valuable data existed.

The goal


We realized that we needed a way to look at the tens of thousands of companies that make annual filings with the SEC. This is in order to find a way to measure qualitative text disclosure. This way we could show the changes in disclosure over time on SASB’s sustainability standards. In the words of senior management, creating a way to measure corporate sustainability data disclosure at scale in SEC filings would be “SASB’s performance metric for success” and show our impact on society.

We reasoned that by showing existing disclosure on our topics, this could better incentivize companies to improve not only disclosure but actual management of these issues because of heightened attention from investors, regulators, and other key stakeholders. In the spirit of Justice Brandeis’ philosophy on transparency, disclosure would bring the “sunlight” to parts of corporate disclosure that investors and the general public would not be able to find without our standards and lens for finding this sustainability data.

The solution: Sustainability data

data sustainability
A calculator and data on paper

We chose to look into machine learning as a way to scale our efforts. We started our project as a small pilot with just one sector. This was to see the feasibility of this project and we had the assistance of two amazing data science contractors.

However, we found that bigger challenges awaited us with scaling this effort from a single sector to the full economy of 10 sectors that SASB has standards for. As the program manager for our efforts with this project, I realized that I needed deeper and practical understanding of data science. This was in order to make better decisions related to the design and implementation of this pipeline.

Now I had taken some data science courses online and gained practical experience working with our data science contractors before. But to run the pilot program, I realized that I needed more training and exposure on how to apply machine learning in practice.

Just five short months ago, I was fortunate enough to be accepted as the first nonprofit fellow for Data Science Dojo. In one short week I was able to build a data pipeline end-to-end, compete in my first Kaggle contest, the crowd-sourced data science platform, (read more here), and develop a core fundamental understanding of the core steps of building a classifier.

Before this bootcamp, I had taken plenty of data science courses and had worked with data scientists but it was hard to connect the pieces from data exploration & analysis and feature engineering to developing performance metrics and testing different hyper-parameters.

After the bootcamp, I went back to my organization understanding how to modify existing parts of our machine learning pipeline to scale and also integrate it with manual components such as Mechanical Turk. Using skills I learned at the Data Science bootcamp, I was able to tweak parameters of our classifier and test an entirely different approach to how to approach semi-supervised learning. In particular, I was able to use my feature engineering skills to create and select features that I would not have considered for the classifier. In addition, I could augment our process by building checkpoints that were able to help with data quality validation. I attribute this to the fact that I could now see how things were all connected.

Without the fundamentals that I learned at the Data Science Dojo bootcamp, it would have been challenging to be at the point we are at today. We are nearing the completion of a machine learning pipeline with almost 1M classified excerpts and a corresponding web application displaying this data that we will launch to the public in the fall.

Our hope is that this data will change the way capital markets view sustainability and that investors will be able to use this sustainability data to influence the decisions that are made by companies in regard to the material sustainability issues that my organization has researched. This can help to shift the allocation of funds from organizations that focus on less sustainable outcomes to organizations that account for the greatest challenges in climate change, air quality, water management, hazardous materials, material sourcing, and other important sustainability issues.


Michael D’Andrea: As the former data science program manager for SASB, Michael managed and analyzed large, diverse and unstructured sustainability datasets for trends that support the greater disclosure of sustainability information for the public. He has a M.A. in computer education and was a data science dojo fellow.

Sabrina Dominguez: Sabrina holds a B.S. in Business Administration with a specialization in Marketing Management from Central Washington University. She has a passion for search engine optimization and marketing.

June 14, 2022

Related Topics

Machine Learning
Generative AI
Data Visualization
Data Security
Data Science
Data Engineering
Data Analytics
Computer Vision