How to better control the environmental impact of data-science projects - Part 2
- Posted on October 11, 2021
- Estimated reading time 6 minutes
In physics, entropy* corresponds to the state of disorganization of matter. The minimization of entropy allows the organization of elements furthest from randomness. It has become a major issue of our time, which directly concerns the organization of human activities. What if we applied it to our data projects? Here are some concrete ideas, introducing the concept of data-entropy.
In Part 1 of this blog, I wrote about some concrete ideas, introducing the concept of data-entropy that could be applied to data projects. I will delve into that a little more in this one.
Data structuring must be driven by uses cases
When data is used for analysis or modelling purposes, it is first studied, cleaned up and, above all, prepared to communicate with other data. The most common example of preparation is to put at the same time scale two data sets. Only then can these data “communicate” and we can infer a relationship between these variables or phenomena.
However, on multiple projects, the same data sources are prepared according to the other sources and their respective granularity. Thus, it is not uncommon to find in the same data lake, the repetition of the above-mentioned steps on the same data source. In fact, more work can be shared, and we could avoid the redundancy of some transformations and storage.
The keys to avoid this redundancy are based on:
- Pipeline sharing: The visual cycles of loading, preparation and data transformation associated with a project that allows the reuse of all or part of it
- Intelligent access to documentation (intelligent search) and project visualization
- Ongoing analysis of project interactions
If used properly, the new tools for pipeline design and the construction and follow-up of DataOps and MLOps projects can meet at least some of these objectives.
Sampling and data sampling: What volume for what usage?
For each analysis undertaken or modelling exercise, it is necessary to determine what volume of data is needed to establish a sufficient approximation of what is to be demonstrated. To take a decision, we very rarely need a precise analysis to the decimal point. Depending on the scenarios, sometimes 10% of the data are sufficient to establish the desired observation and to follow the evolution of a phenomenon.
Population-based sampling techniques can be applied to data: single, stratified, or clustered random draws. Depending on the case, a method may be used to estimate a result based on the desired level of precision.
Unlike political surveys, the survey techniques applied to the data do not save the cost of collecting the data but save some of the resources related to their processing. As with conventional surveys, the desired level of precision determines the sampling rate needed to achieve this and a confidence interval (or margin of error) is indicated.
Let’s take the example of the latency of the dashboards. Given the volume of data now available for many projects, it is not uncommon to see Dashboards which are sluggish. Sometimes, the mistake is to want to connect to a large volume of data (aggregated or not) and we therefore look for the solution on the infrastructure side, mobilizing new IT resources, while a subset of this intelligently selected data would be enough to solve this latency problem, keeping same resources.
The marginal usefulness of the data
The amount of detail provided by an additional unit of information corresponds to the marginal usefulness of the data. The gain can be evaluated against the incremental energy required for its treatment. For each scenario, there is a threshold at which the precision contribution of an additional data is zero, close to 0, or even negative. There is then no interest in harvesting or using this data.
There are several scenarios for applying this indicator to arbitrate on the use of additional data:
- At the level of granularity in the production and transmission of information: For example, an IoT system that produces and transmits temperature information every second while the actual use of this temperature is established at the level of the minute.
- In terms of obsolescence: Does the use of an additional, more distant historical data improve or degrade the quality of my forecast? This 2nd situation is called negative marginal predictability of the data.
- At the level of the volume of data estimated to be necessary according to the desired level of precision (see previous paragraph on sampling and data sampling).
Thus, the production, transmission and use of a data must always be calibrated in relation to an objective, a need, and the desired level of precision.
Conclusion: The need for frugal approaches
The rise of Big Data sometimes creates environments whose complexity and multiplicity limit the mastery of data structuring and use. The underlying material and energy resources are often very important. Given the volumes of data available, it is essential to develop techniques to minimise energy costs related to the implementation of these projects. Especially since the results of this research lead to the establishment of new services that spread very quickly and massively.
The statistical modelling that we talk about a lot indirectly today when we talk about artificial intelligence or data science is primarily a set of methods of approximations. Take the example of weather forecasts. We can only estimate a future weather situation and not predict it accurately. The volume of data to arrive at an estimate is certainly important but beyond a certain threshold, the gain brought by an additional data becomes decreasing. Frugal approaches are therefore needed to avoid generating more data than necessary.
Beyond the classical documentation, new tools to build and share project pipelines allow for better sharing of data transformations already established in information systems and avoid redundancies related to data preparation, for instance.
Faced with all these issues of structuring and better control of negative externalities generated by data uses, the notion of data entropy seems to me to respond to all the dimensions on which we could act on several scales. But since I am not a specialist in physics, I remain totally open to discussions and criticisms about the use of this term in this article.
(*): within the meaning of the second law of thermodynamics of Sadi Carnot and not of the theory of information of Shannon.