Lets talk about the GenAI Sustainability problem:
How to scale GenerativeAI integration sustainably
Is AI content generation actually sustainable, and what happens when we scale this up?
The adoption of generative AI is rapidly expanding across brands, agencies, and production studios, and reaching an ever-growing number of industries and sectors worldwide.
Its not just sector or industry, or even purpose specific anymore - it will become a ubiquity.
We're not just talking about text prediction anymore - we're talking image, audio, music, video, and so much more!
LLM's are giving way to multi-modal models featuring much richer physical, visual, and audio datasets, all designed to make their capabilities more realistic, responsive, and useful. But as the complexity of these generative AI models increases, it's easy to forget the impact this is having on the compute process.
The first LLM's were about 1.5 billion parameters (which is around 7GB), GPT4 has about 1.8 trillion parameters (which is about 6.6TB). But, as we start to scale out multimodal components, this is pushing the datasets up exponentially. The latest Anthropic Constitutional AI is rumoured to be over 10 trillion parameters (which would be over 37TB).
And its the continuous training and inference of these new models that is becoming extremely energy dependent.
Energy scaling is super linear - meaning that energy consumption grows faster than the increase in model size. So, doubling an LLM's parameters led to a 4-6 times increase in energy use.
Training the first LLM's consumed the equivalent energy of running 120 households for a full year. And projections suggest that by 2027, the AI sector's annual energy consumption could be equivalent to that of the Netherlands, driven largely by the pursuit of multi-modal LLM's.
So, what's the solution here? Are GenAI solutions naturally hampered by their gross training scales, and are there ways of expanding the capabilities of LLM's without losing control of the storage and compute demands?
Tackling the Gen AI scaling issue in scope01.
Replacing traditional production processes with modern virtual and generative production processes is one of the best ways to immediately cut Carbon Emissions across the content architecture. But, when we get into Scope 03 is this really sustainable at scale?
Add GenAI to the Sustainable table.
Use low carbon intense energy in off-peak.
This will seem straight forward, but IT demands (and compute power) can have radically different peak periods to other utility office functions.
This method has been used in the CGI and VFX industry for decades. Where rendering would require days of compute time and energy to deliver finished frames; studios would distribute rendering to off-peak and overnight systems, or package the files for offsite rendering capacities or render farms, in cheaper and more renewable energy locations.
,Use managed services for base models.
Depending on your expertise and specific use case, consider opting for more established serverless, fully managed systems that provide access to a range of foundation models (LLM's) through any number of API's, and build locally around these infrastructures.
Using a managed services is a great way to shift the responsibility of maintaining high utilisation, scaling demand, and sustainability optimisation logistics, to somewhere with the architecture to really tackle these challenges. They will also have more accurate data to support the validation process for NetZero ambitions.
Define the right level of customisation.
With content and operationalAI functions, there are several strategies to enhance the efficiency of your AI results without adding further training data to the base models.
These range from simple prompt control and engineering all the way to full scale T-RAG (Tree enhanced Retrieval-Augmented Generation) fine tuning.
And picking the right solution can be specific to the demand you have for the AI function. The other added bonus of a modular augmentation allows you to scale or minimise the function demand at anytime.
Selecting the right base model
In almost every "generative" function, the base model is the LLM. These may vary based on function and purpose (multimodal or GAN's) but in this context we are going to talk about "base models" and augmentations. There are a number of factors that dictate choosing the right base model for sustainable growth:
We've talked about the sustainability problems with generating our own base model, so let's have a look at how we could leverage a managed service LLM's, and augment this to perform better / faster for our specific purposes. There are a number of great models currently available, and these are continually being enhanced and improved. It can be difficult to understand where to place the foundations of infrastructure. But first, what makes a good LLM, and what factors should a business evaluate in selecting the right foundation?
Accuracy:
The model's ability to generate correct, relevant, and coherent responses. More accurate models tend to perform better across multiple tasks and sectors. There are multiple ways of validating and benchmarking base LLM models.
Performance:
This includes things like "thinking" speed and handling complex queries efficiently. Faster models with better performance can make significant energy and token savings - by arriving at the desired response quicker. There are a bunch of great Performance bench makers out there.
Cost:
This can be initial setup costs, ongoing usage fees, increased tokenisation for deeper queries, and maintenance expenses. This is a tricky one - as you need to consider the expected demands of the LLM, and where your fine-tuning will abate.
Customisation potential:
The ability to fine-tune or adapt the model is the reason we are here. Some commercial LLM's make these difficult to achieve, whereas others have invested significant effort into streamlining this connection. A streamlined API is a more sustainable solution in the long run.
Data Security and privacy:
Most major models support and conform to the highest security and accreditation for handling of sensitive data, However, if your use case is extremely specific, then additional data wrangling at client side can keep the LLM options more open down the line.
Model size and computational requirements:
There is a place for smaller model usage. Look at GPT 3.5 vs GPT 4 Plus. These models are very different in their capacity and capability. However, for certain commercial tasks a smaller model will be more sustainable and cheaper than a larger one.
Specialisation:
Multimodal / video / audio models may excel in specific services or tasks, making them superior choices for those particular use cases. Therefore, how you tap into these specialists will help in lowering the overall expectations of a single LLM. We will eventually use Agents to query these specialists and augment the data into a centralised hub.
Update frequency:
The frequency of updates is crucial, and ongoing improvements to the model can impact its long-term viability. If the business is using the data from the base model as a reference or analytics tool - this update frequency becomes extremely vital.
Ease of integration:
This factor is a mixture of the Customisation potential and the specialisation. The base model will be linked to various components in the business, and how it handles these connected interactions can have significant impact on sustainability factors and speed.
Ethical heritage:
This is a constantly evolving field. Factors such as bias mitigation, hallucination management, and alignment with ethical AI governance and national (and international) legislation become crucial for effective control of the outputs.
This seems like a lot to consider! How are we ever going to get started?
This is really true, there are loads of factors that will influence the scalability of the AI architecture that a business implements. However, there are some easy steps to enough we don't need to replace foundations later down the line:
- Build your foundation with node and modular inputs – Most of what we do here should be built on a framework that can be interchangeable. As our understanding and commercial relationship with AI develops so will our expectations. Its like learning to drive again! We don't start with a racing car - but the foundations of how to drive, and what fuels it are interchangeable.
- Reduce the need for expensive customisation – Make sure to gather information by using public resources such as open LLMs leaderboards, holistic evaluation benchmarks, or model cards to compare different LLM's. Depending on your use cases, consider domain-specific or multilingual models to reduce the need for additional customisation.
- Start with a small model size and small context window – Large model sizes and context windows (the number of tokens that can fit in a single prompt) can offer more performance and capabilities, but they also require more energy and resources for inference. Consider available versions of models with smaller sizes before integrating ownable development - and develop "knowledge bases" to augment these expectations.
Let's have a look at how various changes to interacting with LLM's can effect the cost, time, and sustainability of the AI workflow:
Prompt engineering
Effective prompt engineering can enhance the performance and efficiency of generative AI models.
By carefully crafting prompts, you can guide the model’s behaviour, reducing unnecessary iterations and repeat calls.
We have a full series on effective prompt engineering across image, video, and animation sources. Here's a short guide for effective prompting:
- Keep prompts concise and avoid unnecessary details – I know this is probably opposite to your experience thus far- BUT longer prompts lead to a higher number of tokens. As tokens increase in number, the model consumes more memory and computational resources. Consider incorporating zero-shot or few-shot. This methodology will become more effectively as the Larger LLM's adapt to more semantic and contextual inferences (think about Apple Intelligence or the GPT 4o releases).
- Experiment with different prompts gradually – Refine the prompt architecture slowly. Its really easy to remove factors that you feel are irrelevant before you should. Depending on your task, try looking at self-consistency prompting. There are loads of examples of great prompt architecture available.
- Use reproducible prompts – With templates such as LangChain prompt templates, you can save or load your prompts history as files. This enhances prompt tracking, versioning, and reusability. When you know the prompts that produce the best answers for each model, you can reduce the computational prompt iterations and redundant experiments across different usecases.
What if we need more control over the output parameters?
This is likely going to be the case in every commercial application of AI. So, how do we integrate an AI infrastructure into our pipeline that delivers consistency and control, reduces the chance of hallucinations, but doesn't require the demands on energy that building our own would need?
Retrieval-Augmented Generation (RAG), Parameter Efficient Fine-Tuning (PEFT), and traditional fine-tuning are three distinct methods used to enhance the performance of large language models (LLMs). Each with their own benefits and potential capabilities. The choice between RAG, PEFT, and fine-tuning depends on the specific needs of the outcome (and the value / scale of the data demand), the availability of data, and the desired balance between precision and adaptability. Which is more sustainable?
Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG) is a really effective way of augmenting model capabilities by retrieving and integrating classified external information from a predefined dataset. On a smaller scale these are known as "knowledge bases" and shortens both the time to solution + energy used to arrive at the solution.
Because existing LLM's are used as is, this strategy avoids the energy and resources needed to train the model on new data or build a new model from scratch. RAG is also evolving to include the ability to support additional types of data alongside text. This multimodal capability is where the greatest benefits are achieved in a computer vision capacity - using video / audio / imagery to drive query inputs!
Parameter-Efficient Fine-Tuning (PEFT)
Parameter-Efficient Fine-Tuning (PEFT) is also known as LORA's (low rank adaptations) or Embeddings are a fundamental aspect of sustainability in generative AI.
This will achieve similar benefits to full fine-tuning, using fewer trainable parameters. By freezing known parameters of the pre-trained LLM's using prompt management of adaptors.
As an example, studies have shown that utilisation of LORA's can reduce the number of trainable parameters by multiples of 10,000 and the GPU memory requirement by 3X - while protecting the consistency.
Fine-tuning
Fine-tune the entire pre-trained model with the additional data. This approach may achieve higher consistency and performance but is more resource-intensive than PEFT.
For example, if you anticipate reusing the model within a specific function or business unit, you may prefer a multi modal focused adaptation. On the other hand, instruction-based fine-tuning is better suited for general use across multiple tasks.
Fine-tuning is extremely useful if you're augmenting multimodal LLM's with source data changes, sure as targeted data components in view. Where we need to control consistency across certain inputs within a video or image.
Sustainability is a growing consideration in how we build, integrate, and deploy AI procceses.
As generative AI models become bigger and bigger, it's inevitable that environmental impact is going to influence how we commission these models.
As we discussed earlier, the comparable to "VFX and CGI rendering" is apt in describing these 'hand-off' points throughout the process. As humans we are not built to 'know everything'...We augment our knowledge with teams, staff, and other factors to help us protect relevancy, efficiency, and consistency across tasks - for the time being this going to be the same for LLM's.