Artificial intelligence to improve the graduation rate of Quebec colleges (part 2)

Simon Bouchard

Wednesday, January 12, 2022

If you want to read or revisit the first part of our blog, here's the link : Artificial intelligence to improve the graduation rate of Quebec colleges (part 1)

The Optania team produces quality artificial intelligence algorithms for the Quebec educational system. To do so, a large part of the development time is allocated to the preparation of quality data. Indeed, in his campaign for a data-centric approach to artificial intelligence, Andrew Ng observes that data preparation occupies 80% of the time of artificial intelligence developers. The other 20% of the effort will essentially be spent training the model, evaluating its performance and deploying it.

But why spend so much time preparing the data, rather than refining the choice of a model and its hyper-parameters?

For many applications in artificial intelligence, the predictive model chosen is not the main source of mitigation of prediction quality. Indeed, now that models (neural network, gradient boosting, etc.) have evolved in terms of performance, an effort must be put on the quality of the data used to train the models (Gil Press, 2021).

A snowball effect

A team of researchers at Google is going even further by elaborating on the concept of data cascades. Data cascades appear when one or more problems in an initial phase of predictive model development occur without being corrected before going into production. In this case, the accumulation of errors that occurred in the phases prior to algorithm deployment are seen as technical debt (Sambasivan et al., 2021). This debt will be paid in the form of incongruence in predictions once the model is deployed.

Initial phases of predictive model development :

The problem statement requiring the application of artificial intelligence;
Data preparation: pre-processing, data collection, labeling, analysis and cleaning;
Model selection;
Model training.

An example of a data cascade would be a model predicting high school grades trained on historical data from before 2020. The problem with this data would be that it does not consider the impact of COVID-19 on the grading system of the Quebec educational system. Indeed, many high school courses were offered in hybrid and virtual mode. Departmental tests were cancelled and some grading systems were completely changed: some groups of students did not have final numerical grades and only received a "pass" or "fail" grade. Thus, the changes in the Quebec educational environment have been so great, that an artificial intelligence team that fails to consider these many variations in the design of their algorithm risks deploying one with questionable performance.

Avoidable problems

Different mistakes can be made leading to data cascades:

Training a model with un-noisy data and putting it into production with noisy data(data drift);
Interactions between the trained model in a closed space and the real world;
Inadequate application of domain expertise in data preparation;
Poor data collection;
Poor inter-organizational documentation.

Figure 1: (Sambasivan et al., 2021)

These problems can result in limited performance when the model is put into production, detrimental to the beneficiaries of the predictions and even the abandonment of an entire AI project. Thus, it is essential to address each of these points in order to avoid prediction problems when deploying models.

AI product evolution and research perspective

As data quality is a necessity in the development of quality algorithms, Optania implements many strategies to achieve this end. The company strives to design sound and beneficial algorithms using the latest advances in statistics and computer science. Partnerships are established with government and higher education institutions to advance research in predictive analytics applied to education:

Research projects :
- Université du Québec à Trois-Rivières. Ongoing project on the explainability of predictions;
- Université Laval. Master's thesis to develop a method for the imputation of missing data, adapted to educational data;
- Université du Québec à Chicoutimi. Exploratory analysis of the data used in the prediction of Quebec college students academic success.
Support from the National Research Council of Canada in research activities.

In this way, Optania ensures that the scientific rigor in the development of our artificial intelligence technologies meets the highest standards.

Simon Bouchard

Data scientist

Références

Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021, May).“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-15)
Gil Press. (2021, 16 juin ). Andrew Ng Launches A Campaign For Data-Centric AI..