FROM THE PAGE: An Excerpt from Matthew Connelly’s The Declassification Engine

By Luis Diaz | March 13 2023 | Humanities & Social Sciences

Using the latest techniques in data science, historian Matthew Connelly analyzes the millions of state documents both accessible to the public and still under review to unearth not only what the government does not want us to know, but what it says about the very authority we bequeath to our leaders. By culling this research and carefully studying a series of pivotal moments in recent history from Pearl Harbor to drone strikes, Connelly sheds light on the drivers of state secrecy—especially consolidating power or hiding incompetence—and how the classification of documents has become untenable. What results is an astonishing study of power: of the greed that develops out of its possession, of the negligence that it protects, and of what we lose as citizens when it remains unchecked. 


Should This Book Be Legal?

There I was, sitting at a massive conference table inside of a multibillion-dollar foundation, staring at the wood-paneled walls. I was facing a battery of high-powered attorneys, including the former general counsel to the National Security Agency, and another who had been chief of the Major Crimes Unit at the U.S. Attorney’s Office in the Southern District of New York. The foundation was paying each of them about $1,000 an hour to determine whether I could be prosecuted under the Espionage Act. 

I am a history professor, and all I had done was apply for a research grant. I proposed to team up with data scientists at Columbia University to investigate the exponential growth in government secrecy. Earlier that year, in 2013, officials reported that they had classified information more than ninety-five million times over the preceding twelve months, or three times every second. Every time one of these officials decided that some transcript, or email, or PowerPoint presentation, was “confidential,” “secret,” or “top secret,” it became subject to elaborate protocols to ensure safe handling. No one without a security clearance would see these records until, decades from now, other government officials decided disclosure no longer endangered national security. The price for keeping all these secrets was growing year-by-year, including everything from retinal scanners to barbed wire fencing to personnel training programs, and already totaled well over eleven billion dollars. But so too were the number and size of data breaches and leaks. At the same time, archivists were overwhelmed by the challenge of managing just the first generation of classified electronic records, dating to the 1970s. Charged with identifying and preserving the subset of public records with enduring historical significance, they were recommending the deletion of hundreds of thousands of State Department cables, memoranda, and reports sight unseen. The costs in terms of democratic accountability were incalculable but included the loss of public confidence in political institutions, the proliferation of conspiracy theories, and the increasing difficulty even historians now have in reconstructing what our leaders do under the cloak of secrecy. 

We wanted to assemble a database of declassified documents and use algorithms to reveal patterns and anomalies in the way bureaucrats decide what information must be kept secret and what information can be released. To what extent were these decisions balanced and rule-based, as official spokesmen have long claimed, consistent with federal laws and executive orders requiring the preservation of public records, and prompt disclosure absent compelling national security reasons? Were the exceptions so numerous as to prove the existence of unwritten rules that really served the interests of a “deep state?” Or was the whole system so dysfunctional as to be random and inexplicable, as other critics insist? 

We were not just satisfying our own curiosity. We wanted to determine whether we could reverse-engineer these processes, and develop technology that could help identify truly sensitive information. If we assembled millions of documents in databases, and harnessed the power of high-performance computing clusters, it might be possible to train algorithms to rank order records requiring the closest scrutiny and accelerate the release of everything else. If someone did not start prototyping a “declassification engine” of this sort, the exponential increase in government secrets—more and more of them consisting of data rather than paper documents—might make it impossible for public officials to meet their own legal responsibilities to maximize transparency. Even if we failed to get the government to adopt this kind of technology, testing these tools and techniques would reveal gaps and distortions in the public record, whether from official secrecy or archival destruction. 

But what if this research revealed specific things the government was trying to hide for good reason? The lawyers in front of me started to discuss the worst-case scenarios, and the officers of the foundation grew visibly uncomfortable. What if my team was able to reveal the identity of covert operatives? What if we uncovered information that would help someone build a nuclear weapon? If the foundation gave us the money, their lawyers warned that the foundation staff might be prosecuted for aiding and abetting a criminal conspiracy. Why, the most senior program officer asked, should they help us build “a tool that is purpose-built to break the law?” 

The only one who did not seem nervous was the former ACLU lawyer who Columbia had hired to represent us. He had argued cases before the Supreme Court. He had defended people who published schematics of nuclear weapons—and won. He had shown how any successful prosecution required proving that someone had possession of actual classified information. How could the government go after scholars doing research on declassified documents?

The ex-government lawyers pointed out that we were not just academics making highly educated guesses about state secrets—not when we were using high-performance computers and sophisticated algorithms. True, no journalist, no historian, can absorb hundreds of thousands of documents, analyze all of the words in them, instantly recall every one, and rank each according to one or multiple criteria. But scientists and engineers can turn millions of documents into billions of data points and use machine learning to detect patterns and make predictions. Every time you watch the movie Netflix recommends, or buy the book that Amazon suggests, the results of machine learning speak for themselves. If we threw enough data at the problem of parsing redacted documents, couldn’t these techniques “recommend” the words most likely to be hiding behind the black boxes which presumably were hidden for good reason? 

We tried to explain that this was beyond the scope of our project. This kind of research is really difficult. It can take years just to aggregate and “clean” messy data, and there is nothing more disorganized than millions of government documents spread across multiple archives and online databases. So even if we were to stray into classified realms, there would be plenty of time to ponder the results and pull the plug before any supercomputer started revealing state secrets. Lawyers, however, are paid to dream up other people’s nightmares, and these former prosecutors had succeeded in spooking the program officers. 

Then one of the program officers—the youngest one—asked a simple question, the most important question of all: Why did we even want to do this research? He reminded me of one of my seminar students, and for a moment it felt like I was back in the classroom. I realized he had handed me an opportunity to remind everyone why we were there—why, that is, professors do research, why foundations try to help them, and why we need a government that protects all of us from people who would take away this kind of freedom—the freedom to seek out the truth wherever it takes us. 

We were and are trying to better understand state secrecy because it is a matter of enormous and growing public importance. A whole series of intensely polarizing incidents have made the American people ever more skeptical of whether they can trust government officials—from Wikileaks and Iraq war crimes to Edward Snowden’s revelation of NSA “exploits,” from Hillary Clinton’s private email server to FBI surveillance of the Trump campaign. Suspicion that a secretive “deep state” is unaccountable even to presidential power fuels conspiracy theories that are sapping the strength of our democracy. 


Now, with computational methods, we could not only help ensure citizens are informed of government actions, we could start doing advanced research to better inform public policy on a whole range of vital topics, whether international trade, early warning of wars and revolutions, or the spread of weapons of mass destruction. For decades scholars have struggled to piece together scraps of archival evidence in order to probe the inner workings of what we call “the official mind.” Now, with data science and enormous collections of declassified documents, we can carry out the functional equivalent of CT scans and magnetic resonance imaging. Should the risks to national security of forgoing these insights, and forgetting the lessons of the past, not also weigh in the balance?

Even in terms of the more narrow concern about information security, the government’s current policies and practices are increasingly ineffective. China has exfiltrated tens of millions of records revealing personal information about virtually everyone who has ever worked for the U.S. government, and Russia has infiltrated hundreds of government and corporate networks. A whole series of government committees and commissions have identified the same fundamental problem: Officials have not clearly and consistently specified what information actually requires safekeeping, making it impossible to prioritize what is truly sensitive. They will painstakingly study forty-year-old military service records page-by-page because of the infinitesimal risk that a nuclear bomb design might slip through. Meanwhile, sniper manuals and recipe books for high explosives—documents that could easily kill people—are accidentally released and left on the open shelves of the National Archives. 

Unless someone is allowed to systematically analyze what kind of information is classified, and why, we cannot begin to develop practical techniques required for a more rational, risk management approach to releasing nonsensitive records. Without such techniques, the CIA was telling us that the whole system for declassification would shut down. It was simply impossible to manually review and redact billions of classified emails, text messages, and PowerPoint presentations. If instead these records were withheld indefinitely, or destroyed, it would be impossible even for historians to reconstruct what officials did decades ago under the cloak of secrecy. And if, at the end of the day, our government is not even accountable in the court of history, it truly is accountable to no one.  

I looked around the room, and it seemed that some of the tension had lifted. People began to joke about the idea of Columbia professors run amok. If we did uncover anything we thought posed a potential risk to national security, we assured them we could first show it to people with the knowledge and the responsibility to act, much as newspaper editors have done for decades. Even the ex-NSA lawyer admitted we had a very strong defense against any over-reaching prosecutor: The First Amendment to the U.S. Constitution. As Chief Justice Earl Warren wrote in 1957, “Scholarship cannot flourish in an atmosphere of suspicion and distrust. Teachers and students must always remain free to inquire, to study and to evaluate, to gain new maturity and understanding; otherwise our civilization will stagnate and die.”

In the end, there were handshakes all around, and the most senior program officer took me aside to discuss next steps. He confided that the foundation president himself had said he did not understand why the project had been subject to this kind of scrutiny. He would support the grant and send it to their Board of Trustees, though with some conditions to be specified later. Soon after, the Board gave its approval, subject to final legal review. In the meantime, we received provisional approval for a much bigger grant from the MacArthur Foundation. It seemed we were on our way. 

We were astonished when, three months later, the foundation told us what conditions they planned to impose through an intellectual property agreement: Every member of our team would have to sign nondisclosure agreements that would bar us from discussing the work with anyone else. Even student volunteers could not talk about the project with their professors. A steering committee of national security experts would first have to approve any such communication. And while the grant was only for one year, this confidentiality agreement would apply forever. A project tackling the problem of excessive secrecy would itself be the subject of extraordinary secrecy.

We were told that all the conditions were nonnegotiable. The program officer seemed very blunt about his aim: “The point in this case is to control the use of the resource that we are funding and to put mechanisms in place to do so.” It would be a poison pill, because no one would want to participate in the project if they could never even talk about it. But if we lost this funding we might lose everything. How could we expect the MacArthur Foundation to move ahead when another foundation, with more time to consider the risks, had already backed out? With no funds to hire engineers or software developers, our “declassification engine” would begin to shut down before it even started.  

But in that moment I understood better than ever before why it was essential that we persevere. I could now see with my own eyes that official secrecy had grown completely out of control. It has created fear far outside the confines of government offices, defense contractors, and even newsrooms. Even in the halls of academia, and inside elite foundations, people have come to fear prosecution just for doing research on state secrecy. I have seen this fear—seen, for instance, an eminent computer scientist break down in tears, explaining that colleagues had warned him to immediately withdraw from the project or risk deportation. I’ve seen graduate students told by their professors that they dare not present this research in public. My own students regularly joke—nervously—about whether we might all be under surveillance. Many data scientists depend on contracts from the Pentagon and the Intelligence Community, so even the perception of official displeasure can be enough to drive them away from this kind of work.

But in the end, our university stood by us, and the MacArthur Foundation was undeterred. Columbia’s president, Lee Bollinger, is a scholar of the First Amendment, and he would go on to make the danger of exponential growth in official secrecy a keynote in his 2014 commencement address. Robert Gallucci, MacArthur’s leader at the time, personally approved the project. As a former diplomat, one who previously led nuclear weapons negotiations with North Korea, Gallucci was not the kind of person to be intimidated by high-priced lawyers preaching about national security. 

With the MacArthur grant, I created History Lab, a team of data scientists and social scientists dedicated to exploring the past in order to discover lessons for the present and the future. I still get warnings from people with security clearances that some government officials think I could be prosecuted under the Espionage Act. There are risks in data-mining declassified documents, and this book will take some risks in revealing what is possible. But the technology for surveilling government secrecy, machine learning, is technology the government itself began to develop decades ago in order to surveil us. It starts with aggregating all the data we can gather, much as the NSA assembles communications and personal information in databases. When we can’t access the content of those communications, such as when they are still classified, we analyze the metadata, just as the data scientists do at Fort Meade. And like them, we use all the tools in our toolbox to sift and sort through the information to generate insights. It ranges from traffic analysis, which can reveal “bursts” of communications corresponding to unusual events, to anomaly detection, which uncovers documents that have been misclassified or inconsistently redacted, to predictive analytics, which can identify what kind of information government reviewers seemed particularly intent on concealing. Rather than a simple machine, The Declassification Engine described in this book is best understood as a platform that combines big data, high-performance computing, and sophisticated algorithms to reveal what the government did not want us to know, and why they did not want us to know it.

Obviously we don’t have the resources of the NSA. But over the last seven years, we have built the world’s largest database of declassified documents. The data scientists I have been privileged to work with in academia are at least as good as those working for the government. And while university-based, high-performance computing clusters are not as fast as those employed by the government, they are much more powerful than anything available to the NSA when it was creating these secrets decades ago.  

It may still seem like a story of David versus Goliath. But the goal is not to vanquish government secrecy, but to knock some sense into the thick skull of “the official mind.” By combining data science with deep historical research, The Declassification Engine demonstrates how we can and must take a more rational, risk management approach to government secrecy. We need to distinguish the kind of information that really does require protecting from the information that citizens urgently need to hold their leaders to account. This is the only way to uphold both democratic accountability and national security. There is nothing more dangerous—both to itself, and to others—than a nuclear-armed superpower that is not even answerable to its own people.

As citizens, we have been on the losing end of struggles over state secrecy for far too long. We have to use every legal means at our disposal to reclaim our right to know what leaders do in our name. That does not necessarily mean we need to know the details of particular secrets, or specific things the government still has good reason to keep classified. It means we need to know the kinds of things officials guard most closely. Most importantly, we need to understand why they have insisted on keeping those secrets for so long. 

As I write this in 2021, the government can no longer even estimate how many secrets it creates each year, or how much it is spending to try to protect them all. The National Archives cannot cope with the impossibly large task of reviewing all the government’s classified records, and next year will not even accept any more paper documents because it has no place to put them. Meanwhile, potentially dangerous information is continuing to slip out. And historically vital information is being destroyed, or buried under a mountain of useless data, making it impossible even for future historians to hold our leaders to account, or for future generations to learn what they have lost. 

To save America’s old and honorable tradition of open government—and to save history itself—we must arm ourselves with the power that knowledge gives, including the power of artificial intelligence. If we do, we can turn the weapons of state surveillance into tools to rebuild our republic.

Copyright © 2023 by Matthew Connelly. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.


MATTHEW CONNELLY is a professor of international and global history at Columbia University and the principal investigator at History Lab, an NSF-funded project to apply data science to the problem of preserving the public record and accelerating its release. He received his B.A. from Columbia and his Ph.D. from Yale. His previous publications include A Diplomatic Revolution: Algeria’s Fight for Independence and the Origins of the Post-Cold War Era, and Fatal Misconception: The Struggle to Control World Population.

The Declassification Engine
What History Reveals About America's Top Secrets
A captivating study of US state secrecy that combines data science and incisive history to uncover the vast system government officials use to hoard power and evade democratic accountability
$32.50 US
Feb 14, 2023
560 Pages