Machine Surveillance is Being Super-Charged by Large AI Models


Imagine an America where multiple police officers and security guards stand watch on every block, in every park, in every store, and in every other public space around the clock. Imagine these officers watching us constantly, not only scrutinizing our every move for signs of “suspicious” behavior, but also noting many details about us — such as what we’re wearing and carrying, who we’re with and what our relationship appears to be with them — and recording those details in a searchable database.
That’s not going to happen. Nobody wants to pay for human officers to stand watch in places where almost nothing ever happens. But we are moving toward a future where we might end up with an automated machine equivalent. Video analytics, the technology to make that possible, has been developing for many years. Now, the same generative artificial intelligence (AI) techniques that have revolutionized large language models (LLMs) like ChatGPT are in the process of creating a new, more powerful generation of this technology that could super-charge video surveillance.
In 2019, the ACLU published a report on how video analytics makes it possible for machines to not jut record oceans of video, but to “watch” that video — in the sense that they’re able to analyze, in real time, what’s happening in a video feed — and send alarms to humans when certain conditions are met. Video data, which was formerly very difficult to search and analyze, has also become increasingly searchable through queries such as “find me a male wearing a purple shirt and carrying a violin case” that (like face recognition) can now be run across vast amounts of video data. Since then, video analytics technologies have become widely available, with most commercial surveillance cameras including some form of the technology built in.
The older generation of video analytics, however, is limited to detecting a narrow set of objects on which it has been laboriously trained, and often performs poorly — “sold and marketed way beyond real-life performance,” as one industry player put it. But today the revolutionary advances in large language models are in the process of spawning a new generation of the technology. While language models, per their name, are mostly focused on text, the techniques and advances that led to those models’ breakthrough success are spilling into machine vision as well — specifically programs dubbed “Vision Language Models” (VLMs) that can understand both visual and natural-language textual inputs. In computer science terms, these new machine vision programs are based on the same technology as language models, called transformers, as opposed to classic machine vision work, which is based on a technology called convolutional neural networks (CNNs). While both technologies continue to be used and sometimes combined, and video technologies are still evolving fast, this appears to be a big change.
The advent of vision language models will have three important effects.
1. They Make the Technology More Powerful and Capable
VLMs are able to generalize much better than the older, CNN-based video analytics programs because they combine image recognition with the general world-knowledge that large language models gain as part of their training on all the Internet’s textual data.
In the older form of machine vision, for example, a CNN might be shown millions of pictures of horses and elephants and thus laboriously learn to identify and distinguish them. An LVM, on the other hand, might be able to find a zebra in a video even if it had never seen a photo of a zebra before, simply by leveraging its world knowledge (that a zebra is like a horse with stripes). Instead of being limited to a closed set of predefined things, VLMs are able to recognize an enormous variety of objects, events, and contexts without being specifically trained on each of them. VLMs also appear to be much better at contextual and holistic understandings of scenes.
The CEO of security analytics company Ambient.ai declared this shift “the most significant technology evolution in the history of video analytics ever,” saying “it solves all the problems that kept traditional analytics from getting the last miles to large-scale adoption.” Logan Kilpatrick, manager of Google DeepMind, told a podcaster that
my guess is that as [VLM-based] vision becomes more and more prominent, we’re going to see [startups] go after all of these eco-systems and industries where they’re using domain-specific vision models and not using a general purpose model. And… you unlock all these use-cases which those models are just not actually capable of doing; they’re very very rigid and can’t be fault tolerant in a lot of those cases.
Anybody can gain a sense of the power of the new models using this site created by a former Google engineer to teach people just how much information can be extracted from their photos by AI. Or by going directly to a site like Google’s AI Studio and play around with uploading photos and videos. In addition to detailed descriptions of objects and people, the models can make observations on things like emotional state and even social class.
2. They make analytics much cheaper and more broadly available.
In December the technologist Simon Willison calculated that to analyze all of the 68,000 images in his personal photo library using the Google Gemini model would cost $1.68. It’s also possible to stream videos to models like Gemini and have them analyze the contents, which appears to cost roughly 10 cents per hour of video. Such low costs mean that as the technology is refined, and as understanding of these capabilities spreads, it will not be confined to Google and a handful of other AI developers. The technology will become easily accessible to a broad variety of security companies and find its way into the products that are used to monitor us across a wide variety of contexts, from private spaces like stores and shopping malls, to those public spaces where police departments have deployed surveillance cameras.
As with LLMs, the models may also increasingly become possible to run locally, without having to connect to the servers of, and share data with, OpenAI, Google, or other big companies. It’s good if AI technologies are democratized rather than being controlled by big players, but that also means that guardrails — such as those we recommended in our report — are going to become vital as various parties, well-intentioned and not, deploy them.
3. Their natural language interfaces make machine vision much more approachable and easy to use.
Instead of being confined to precisely worded menus or tags of objects and behaviors that a model has been trained to recognize, users can just issue commands using everyday speech, such as “text me if the dog jumps on the couch,” “let me know if any kids walk on my lawn,” or troublingly, “alert me if a Black man enters the neighborhood” or “if someone is behaving suspiciously.”
The tech still fails
It’s important to keep in mind that like large language models, vision language models are unreliable. The surveillance industry analysis firm IPVM tested one security company’s new LVM-powered product and observed that it “returned some results that were incredibly impressive but also some results that were incredibly bad.” A group of academic and industry experts explained in a recent paper that
connecting language to vision is not completely solved. For example, most models struggle to understand spatial relationships or count… They often ignore some part of the input prompt [and] can also hallucinate and produce content that is neither required nor relevant. As a consequence, developing reliable models is still a very active area of research.
As with face recognition (which is actually a subset of video analytics), there are reasons to worry about this technology when it works poorly — and other reasons to worry when it works well. If LVMs remain unreliable, but just reliable enough that people depend on them and don’t double-check that results are accurate, that could lead to false accusations and other injustices in security contexts. But to the extent it becomes more intelligent, that will also allow for more and richer information to be collected about people, and for people to be scrutinized, monitored, and subjectively judged in more and more contexts.
In the end, nobody knows how capable this technology will become or how quickly. But policymakers need to know that advancing AI means surveillance cameras no longer the classic cameras of yesterday that do nothing more than record. Already we’re seeing AI used for monitoring in an increasing number of contexts, including vehicle driver monitoring, workplace monitoring, gun detection, and the enforcement of rules. If we let it happen, we can expect that nearly every rule, regulation, law, and employer dictate that can be enforced through visual monitoring of human beings will become subject to these unblinking and increasingly intelligent yet unreliable artificial eyes.
Stay informed
Sign up to be the first to hear about how to take action.
By completing this form, I agree to receive occasional emails per the terms of the ACLU's privacy statement.
By completing this form, I agree to receive occasional emails per the terms of the ACLU's privacy statement.