How to better control the environmental impact of data-science projects - Part 1
- Posted on September 30, 2021
- Estimated reading time 6 minutes
In physics, entropy* corresponds to the state of disorganization of matter. The minimization of entropy allows the organization of elements furthest from randomness. It has become a major issue of our time, which directly concerns the organization of human activities. What if we applied it to our data projects? Here are some concrete ideas, introducing the concept of data-entropy.Moving out of the “more data = more value” paradigm
The human tendency to want to produce and store more food is rooted in our ancestral behavior to anticipate shortages. It seems that this trend is particularly true in our data management! As a precaution, our practice is to collect and store as much data as possible to maximize the likelihood of “nothing forgotten” when it comes to starting the analysis and use of the data. However, before any data build-up, the question could now be: do we need all this data and for what intended uses?
In other words, if a data is not useful today, its value is actually negative because it already generates an energy expenditure because of its harvest, storage and circulation.Data is knowledge
The data is transmitted, disseminated, copied to multiple destinations without disappearing from its origin, as the knowledge is. Thus, the fact of generating a data is only a proof of its cost and the multiplicative potential of it and not of its value.
However, all experience is based on assumptions. To verify these, the generation of data is therefore essential. However, there is no need to generate or process more than necessary! Data itself is simply information, its transformation allow it to get its value and to become knowledge.Verify and assess the value of a data item
The questions related to the tangible interest of a given data are numerous:
- What purpose justifies the collection and computerization of this data?
- To what set of phenomena does a data allow to contribute?
- Which data is the most relevant to represent a phenomenon at best?
- What data is rendered obsolete by the harvest of a new data?
It is therefore necessary to ascertain the relevance of a given data and its ability to represent a phenomenon in a faithful manner, but also to determine the appropriate scales of interest: how often, what format, what transmissions, what duplications or multiplications and finally what infrastructure to maximize the interest of this data while minimizing its environmental impact?A “useful” volume of data must be rich in a diversity of observations
Just because I have a lot of data in a specific field doesn’t mean my predictions based on that data will be better. To achieve better models, I will need repetitions of situations (to make relevant my conclusions) but also a diversity of observations. Too much information on a fixed situation leads to an over-learning phenomenon. In this case, the quality of the overall forecast may be affected by additional data.Allocate your “energy budget” according to the priority of the projects
Development and production of machine learning algorithms can be particularly energy intensive. From simple linear regression to convolutive neural networks, resource requirements can be easily increased tenfold, which is not systematically desirable. Also, the choice of the algorithm should be realized according to the gain of precision it brings, the required power of explanation, considering the necessary resources and according to the priority level of the project.Modelling sometimes proves to be the compensation for a lack of connectivity between information systems
If so much data is recorded and processed by companies, it is not often for the purpose of statistical modelling but indeed for the purpose of a personalized and directly connected service. Modelling has its value in the generalization of a phenomenon and should not be confused with the direct connection of information systems.
Let us take a simple example. The collective restaurant of an office building wishes to provide the best possible number of diners for lunch on the day and the assortment of meals to be provided. He will unpack and prepare the same morning the number and variety of the planned meals by including a margin of error in the objective to be able to satisfy all the guests, including those whose behaviors are the most difficult to anticipate. To achieve this, the restaurant has two options.
Option 1: It can develop one or more predictive models based on the observation of historical meals and data correlated to these observations such as weather, season. It can also agree with the building companies to collect a certain amount of information impacting the number and type of guests (presence of employees at the workplace, external training, etc.) and thus improve the quality of its forecasts.
Option 2: It can develop an application allowing employees of companies present in the building to communicate both their arrival and their choices, until the same day.
Thus, the first option is based on probabilistic statistical models whose accuracy varies according to the ability to retrieve the data correlated with the activity. That is, the ability to connect different information systems to each other (anonymized communication of the electronic agendas of the employees of the various companies in the building).
The second option feeds directly into the restaurant’s information system and thus ensures that its needs are optimized while minimizing waste. Statistical modelling can then find its place in the longer-term forecast of consumption with a view to optimizing its stocks. In reality, it is still rare to see such direct connections of supply assessment directly through the harvest of the demand. Statistical models sort of compensate for this lack of declarative tools or non-existent connections between information systems.Conclusion: Integrate energy levers to redesign your data projects roadmap
First, it is important to continually consider the value of using AI for a particular topic. And the answer to this question must be to prioritize the fundamental issues of our time: transparency, ethics, social and environmental responsibility. Secondly, the value of using an AI must be considered in the light of other possible solutions to a problem. Thus, when direct personalization is desirable for a given service, it is sometimes unnecessary to go through the modeling of the underlying phenomena and save costs for other AI projects. Middle term anticipation is key to evaluate where AI really is useful.
Thus, arbitrations are necessary according to the priorities of our time and to make it possible, there is nothing like a human and collective intelligence.
We often talk about the energy improvements frequently made on the hardware side, especially through the optimization of the energy efficiency of the data centers, however it is important to also exploit the existing levers on the software and projects management sides. This involves technical architectures, the selection and implementation of data projects by systematically integrating the constraint of minimization of the amount of Energy used. Especially since the results of this research lead to the establishment of new services that spread very quickly and massively.
Are you ready to redefine your data projects roadmap by integrating this dimension? Check out Part 2 of this blog to know more.
(*): within the meaning of the second law of thermodynamics of Sadi Carnot and not of the theory of information of Shannon.