Sooooo…How Are You Getting Your AI Back After a Disaster?
After decades of being in IT and data protection, I can tell you the moment I dread most in such meetings. It’s a moment coming to your IT meeting rooms very shortly. You know that moment where a seemingly simple question gets asked and none of the smart people around the table can even begin to answer it, or even guess at an answer. Yeah, that one. That moment is always a sign your organization just tripped over a big IT stumbling block. And now your job is to fall forward without getting hurt.
This moment really hit me while I was reading through an NVIDIA March 18, 2024, press release and came across this statement:
“(NVIDA) provide industry-standard APIs for domains such as language, speech and drug discovery to enable developers to quickly build AI applications using their proprietary data hosted securely in their own infrastructure.”
Ummmm, what’s this “proprietary data hosted security in their own infrastructure” stuff? Huge red flags went up for me making a May Day parade look black and white. AI solutions are far more complex than preceding solutions with data set sizes potentially beyond anything the typical data protection administrator has ever dealt with. That’s when I asked myself the technical question that I sadly knew would be met with silence: how do you fully protect, and restore, an AI solution? I sat there awkwardly alone in my meeting of one with no answer.
Oh sure, the challenge of coordinating on-prem and remote (i.e. “cloud”) data and resources has been around forever. Plus, everyone has DR plans to coordinate the restoration and reactivation of the on-prem part of every IT solution with remote staff and resources they depend on, right? Saas adds a spin to these challenges, but still not scary for those who prepare. Hopefully you’re thinking, “sure” to all of this. But that’s where the similarities with those old-school DR challenges we know so well end with AI solutions. Don’t let the comfortable part of this architecture fool you. AI has far more to it, and I’ve found no-one who has put together a minimal, let alone comprehensive document on how to protect it and get it back. The silence continued in the virtual IT meeting in my head about AI backups and restores.
I started asking those I knew in data protection who’re much smarter than me and even some AI developers what they knew about DR for AI. I heard cricket noise in return. The awkward silence just kept going. That’s when I felt my foot hit the AI protection-and-recovery stumbling block hard and very squarely.
AI is a relatively new, technically mysterious to most, and rapidly expanding part of IT that requires the DR of something new: “learning.” All forms of AI must be trained or learn how to do their jobs. Learning is also the most valuable and therefore most prized part of an AI system. And hackers will stop at nothing to steal your learning and resell it to others who don’t want to create their own.
How does one backup and restore this learning? What does this learning even come from or go back to? Hopefully the learning is stored in a commonly known database with a nice safe squishy backup API and protocols to make that easy. If not, how do you quiesce it to get a consistent snap or backup of it? Can you quiesce it without damaging it? Do you even know where/in what your organization plans to store its own proprietary AI data? Or will you find out later when the data’s lost or damaged and it’s your job to put it back? And the learning part is just the first AI protection/restoration challenge.
What about neural networks, also called AI neural networks (ANNs) and simulated neural networks (SNNs)? Those are custom software and hardware decision-making configurations storing hard won and expensive learning. They’re connected units/nodes called artificial neurons. Strange, I can’t find any documentation on how to backup/restore neural networks anywhere, and I’m a PM for Veritas with 25 years of data protection experience.
If the above wasn’t enough, what are you doing about your AI storage? Large learning modules (LLMs) are the first thing on your list to protect, because they’re what’s used to train your AI code. Building LLMs is not something for the timid. They can be immensely huge which means you’re not restoring them over lunch on a whim. AI administrators also don’t have the networking bandwidth or real-time luxury to have just one copy of an LLM. Most likely they’ll have multiple storage nodes getting “just in time” delivery of limited data to keep multiple AI solutions running. There isn’t time to completely synchronize all the copies. One’s probably a primary copy and the needed changed bits then go out to the secondary copies. Don’t confuse that with a true synchronization. Just because it’s running in memory on the secondaries doesn’t mean it’s been committed to disk, let alone identical to all the other nodes. Most likely that’s not the case. As a data protection administrator, you must remember AI administrators put AI performance first, data storage is second at best. They don’t have time to worry about your job.
A lot of your AI storage will be in the cloud. Now is the time to make very sure you understand the data protection clauses of your contract with you cloud vendor(s) for your AI assets. Do not make the mistake of assuming your vendors can recover everything your AI relies on to function. We’ve seen many customers learn the hard way what their contracts really covered, especially in recovery time windows. Cloud vendors do their best to meet the terms of their agreements, but stuff happens. If you’ve been in IT more than five years and read the news headlines you know there are no IT assets that’re 100% reliable. Even the best people and the best equipment get into unrecoverable positions. That’s not incompetence or laziness on a vendor’s part. It’s called real IT life. Partner very heavily with your cyber security team(s) and ask the hard questions of “what do we do if we lose that AI asset for a while?" Work out a plan A, B, C, etc. You will need them eventually.
Now hackers don’t just want to encrypt or delete your LLM or AI learning data (of course, they’ll still do those), they want to steal a copy of it and resell or reuse it for their own nefarious ends. Training an AI can cost millions of dollars which makes them as valuable as digital currency. Congratulations, your AI is now not only a bigger target but an even bigger cash cow to hackers than your other data was before. That means you’re now dealing with a completely new suite of attacks you haven’t previously dealt with centered on data integrity. And data integrity is absolutely key to proper AI functionality.
AI hacks are much more subtle and malicious than previous attacks. Hackers use so-called “data poisoning” to bias AI models, create inaccurate or biased answers/returns, and basically undermine the reliability and usefulness of AI-generated outputs. Oh, and have you heard of prompt injection attacks? That’s where hackers create bogus input prompts/questions that manipulate how your AI learns and grows to corrupt its learning. It’s like the modern version of a denial-of-service (DoS) attack where they pound your AI with skewed or nonsense questions, so it learns to answer the wrong questions or answer them incorrectly. Eventually the AI administrators will discover this “drift” in their AI, and you’ll get an emergency call that the learning data has to go back to a given state in a hurry. Better have a strong versioning recovery tool like NetBackup to pick the exact data protection rollback moment you want, have already scanned, or can scan it for malware, and put your learning back in a hurry. These attacks will mean you’d better up your data restore game way beyond what you were doing for VMs, databases, mail systems, etc. The “good old days” of simple malware hacks are now old-school, simple, easy hacks to prepare for. AI hacks are a whole new frontier.
So, now I’ll ask you that awkward question again a little differently in our virtual meeting as you finish reading this post. Considering the above, what will you do if malware corrupts or wipes out your AI data or systems? How will you version your learning back to a reliable state at very high speeds to avoid critical losses in service? How will you identify the point in time at which your AI data was reliable? What reporting will you use to make sure you’ve got the data protection canned on the shelf in your digital pantry so you can use it in a hurry? What extra administrative steps are you adding to your DR and cyber recovery plans to accommodate AI? Answer these soon before your phone rings for an AI restore.