Estimating coal mine workforce size

From Global Energy Monitor
This article is part of the
Global Coal Mine Tracker, a project of Global Energy Monitor.
Download full dataset
Report an error
Sub-articles:
Related-articles:

Global Energy Monitor's Global Coal Mine Tracker has gathered data on the number of coal miners currently employed at operations in its dataset.

When information on employment is inaccessible, the Global Coal Mine Tracker uses a machine-learning model to estimate the workforce size of a particular coal mine based on other input features.

Global Energy Monitor first published its coal mine employment estimates in April 2023 and a report on its findings, "Scraping By: Global Coal Miners and the Urgency of a Just Transition" in October 2023.

Coal mine employment

Coal mine employment information is crucial to coal phase-outs and just transitions.[1] But that information is not always accessible or transparent. Global Energy Monitor has gathered employment data on coal mines since 2021 using publicly available information published in corporate, government, and reliable media sources.

As of April 2023, the Global Coal Mine Tracker covers 4,300 active and proposed coal mines and projects responsible for 90%+ of global coal production.

Need for granular data

Top energy and labor organizations have identified the necessity for granular information about mining jobs.

In 2020, after conducting a series of listening sessions with fossil fuel workers and stakeholders, the Labor Network for Sustainability pointed out the necessity for information on “where fossil-fuel activity is occurring, such as fossil-fuel power plants and extraction sites” and “the timeline for drawing down these activities” to ensure that “communities plan proactively for transition ahead of closure, rather than dealing with the situation reactively once a closure has been announced.”[2]

In 2022, International Energy Agency (IEA) similarly noted the inherent limits of current overreliance on national figures for coal mining workforce statistics, and has stated that there is no “substitute for better data collection, especially when it comes to subnational data.”[3]

Previous assessments gaps

Without project-level information, direct coal mine employment data had been difficult to determine. As of September 2023, most recent global assessments of coal mining employment, from the World Bank and the International Energy Agency (IEA), have relied on nationally reported job figures from 2019.[3][4] Both organizations parsed and factored out direct versus indirect jobs to varying degrees, and the IEA sought to distinguish cross-sector employment, such as extraction, transport, washing, processing, construction, and equipment manufacturing, and employment in operating mines versus employment building new projects or modernizing existing operations.

The IEA published its inaugural World Energy Employment Report in 2022 and reported that around 6.3 million employees worked in coal supply (extraction, transport, washing, processing, construction, and equipment manufacturing) in 2019. Coal mining itself, or the “raw materials” sector, comprised 60% (3.78 million) of those jobs, though some of those workers were involved in building new projects rather than operating existing assets, though it's unclear exactly how many. The World Bank had previously put coal and lignite mining employment at 4.7 million for the same year (2019), but included formal and informal workforces, based on aggregated data from the 20 largest coal producing countries (out of 70 coal producing countries).[4][3]

As such, the tallies of coal mining jobs presented by the World Bank and IEA for the same year (2019) differ by one million workers (4.7 million and 3.7 million respectively).[4][3]

Methodology

In an effort to help bridge this data gap, Global Energy Monitor built a machine learning tool in 2021-2022 to estimate the size of coal mine workforces when that data was otherwise unavailable. The project wanted to identify workers directly employed in extraction, rather than coal processing and transportation or informal workforces and contracts.

In 2023, the Global Coal Mine Tracker added information on direct mine-level employment for 3,233 operating coal mines. Those figures include 1,349 coal mines for which company or government data on coal miners was publicly available, and 1,884 coal mines for which GEM used a machine learning model to estimate the operating workforce size, prepared in collaboration with in house data scientists and postgraduate researchers at Massachusetts Institute of Technology (MIT).

GEM’s estimates of coal mine employment used a random forest model, which staff found performed better than a linear model to infer the relationship between the input features and the employment size.

What is a Random Forest model?

Random forest is an ensemble of "decision trees". A decision tree can be thought of as a flowchart to make a decision, e.g., "should I bring a rain jacket today?" In such a case, the tree structure depends on the information we have at hand, such as the weather forecast, the season, the wind, and so forth. These are called the "features," and each decision point of the flowchart is called a "node" (e.g. "the forecast value is more or less than 50%").

The starting point of a decision tree is often called the "root node." The end point, when you arrive at "yes, I should bring a rain jacket," or "no, I don’t need a rain jacket," is called the "leaf nodes".[5][6]

A collection of decision trees make up the "forest" of a random forest model.

The "random"-ness has two parts: one is in the random selection of a subset of data from a whole dataset for constructing a tree; the other is in the random selection of subset of features from the whole set of features available. While decision trees are prone to overfitting--that is, the model is too specific to the training data and does not generalize well to a new data--the ensemble nature of random forest can make the model robust. [7]

The previous example of rain jacket has discrete outcomes: "yes" or "no," called a "classification." In contrast to a "classification," a "regression" task would be to estimate the value of a continuous variable, e.g., "what is the price of an egg tomorrow?" Estimating the number of coal mine workforce is a regression task.

Methods and data

GEM's random forest machine learning tool uses "supervised training" to estimate workforce size at coal mines. The supervised training, in this case, consisted of a routine where a known workforce size was withheld from the model and then compared to its prediction to assess the accuracy of the estimate. In this project, we used scikit-learn’s random forest regressor to train an estimator for coal mine workforce size. Scikit-learn is a Python module, published in the Journal of Machine Learning in 2011, that integrates "a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems."[8]

Features

The data used we used for predicting coal mine workforce size consisted of two categories: label and input data.

The label data is the quantity that our dataset needs to predict -- in this case, the size of a coal mine workforce.

The input is the data that the label depends on, that is data variables such as coal production, mine size, and other factors. In our model, GEM used 12 input features: 7 numerical and 5 categorical.

Table 1: Features of random forest model

Feature Type Categories Unit(s)
Coal output Numerical Mt
Mine depth Numerical m
Reserve total* Numerical Mt
Reserve to production ratio (R/P)* Numerical years
Coordinates Numerical Latitude and Longitude deg
Gross Domestic Product Numerical $
Population Numerical
Mine type Categorical Underground, Surface, Mixed
Coal output type Categorical Production or Capacity
Mining method Categorical Longwall, Shovel & Spade, Semi-mechanized,

Strip Mining, Mixed, Continuous,

Bord and Pillar, Open Pit

Coal Type Categorical Lignite, Subbituminous, Bituminous, Anthracite
World region Categorical East Asia, South Asia, Southeast Asia, US & Canada,

Latin America, Europe (EU-27), Europe (non-EU),

Australia & NZ,

Africa & Middle East, Eurasia

*Note that missing data points for 'Reserves total (Proven & Probable)' and 'Reserve to production ratio (R/P)' features were assigned the mean of the existing data.

Cross validation

We performed 10-fold cross validation. One portion of the data was set aside for testing (or validation) and a model was trained using the rest of the data. After the training, model accuracy was evaluated using the test portion. This train-and-evaluate process was repeated for all of the 10 portions of the dataset. The model with the best evaluation result was selected as the best model.

There was total 1349 sample data, 1217 was used for training, 132 for evaluation.

Due to the data availability, we had an extra 'approximate' element to the 10-fold dataset. The available training data was skewed heavily towards the lower values of the label. In other words, the label data, i.e., workforce size, was dominated by relatively smaller values. The smallest workforce size was on the order of 1 and the largest is in the order of 11,000, but the majority of the data was under 2000.

For this reason, we strategically sampled training data into 10 portions. For data with more than 100 samples in the workforce size, the data was divided into 10 equal portions. The data with larger than equal to 20 samples and smaller than 100 samples in the workforce size was also separated into 10 disjointed approximately equal sizes. Two samples were drawn randomly 10 times for data with greater than equal to 10 samples and smaller than 20 samples in workforce size. One sample was drawn randomly ten times for data with greater than equal to 5 and smaller than 10 samples in workforce size. None were drawn for the test sample when there are smaller than 5 samples in workforce size.

Results

We trained a random forest model to estimate coal mine employment size and estimated the workforce size for 1,884 operating coal mines.

Our result demonstrated that a random forest model outperformed a linear estimator. This was especially true for coal mines with smaller employment, primarily due to the lower number of data samples for larger coal mines. For larger mines with less data, our predictions were typically smaller than "truth" samples (Figure 1).

Figure 1: Coal mining workforce size prediction

Global Energy Monitor Workforce Size Prediction.png

The analytic results, including aggregated data at the country level, are published in "Scraping By: Global Coal Miners and the Urgency of a Just Transition" (October 2023).

Future work

In the future, GEM's staff may analyze feature importance from the tree structures, and continue to refine its workforce estimates.

The Global Coal Mine Tracker will also continue to update and find new data on coal mine employment as it becomes available in public sources.

References

  1. "For a Just Transition Away from Coal, People Must Be at the Center". World Bank. Retrieved 2023-10-06.
  2. "Just Transition - Labor Network for Sustainability". Labor Network for Sustainability - Making a Living on a Living Planet. 2020-02-10. Retrieved 2023-10-06.
  3. 3.0 3.1 3.2 3.3 "World Energy Employment – Analysis - IEA". IEA. Retrieved 2023-10-06.
  4. 4.0 4.1 4.2 "Global Perspective on Coal Jobs and Managing Labor Transition Out of Coal". World Bank. Retrieved 2023-10-06.
  5. Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).
  6. Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to Algorithms, 3rd Ed., MIT Press, Cambridge, MA, 2009.
  7. Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. Cambridge, MIT Press, Cambridge, MA, 2016.
  8. Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay; , 2011. (2011). "Scikit-learn: Machine Learning in Python". Journal of Machine Learning Research. 12(85): 2825−2830.