It was, above anything else, an editorial imperative. The right thing to do for our readers.
As The Atlantic’s deputy executive editor Sarah Yager puts it, making journalism is really about creating a record of the world around us. It’s about documenting what is happening and hopefully helping readers to make sense of it. And there’s enormous value in being able to learn from what was happening in the past, and how history was interpreted in real time.
The Atlantic had wanted to digitize its archive ever since it started publishing online in 1995. At that moment, a website for our magazine introduced the new opportunity for people to read the journalism we were writing that day, online. Our magazine, however, had been continuously printed and published since the autumn of 1857. There were a lot of stories that our readers couldn’t access online. From time to time editors would manually reproduce archival articles, but the massive task of making the archive available remained elusive for almost three decades.
How does one take tens of thousands of words from our print pages and publish them on the internet? How would The Atlantic suddenly go from not offering this content to having it on its data structure, its website presence? What are the mechanical steps to getting there? These were some of the questions that our product and technology colleagues asked themselves in May 2021, when the magazine committed to the challenge of bringing the archive online.
At the beginning of a project of this scale, there are many different visions of what it should look like, of all the features it could include. As Executive Director of Product Carson Trobich explains, it can be hard to figure out how to put something that large into its first steps.
“You need to identify the limits to your ambition and put the initial excitement into research.”
Carson Trobich, Executive Director, Product
To orient our visions and find first steps, our product colleagues researched 20 publishers to see how they were resurfacing and repackaging archival content.
The team identified that some publishers’ archives only consisted of scans of printed pages, while others transformed pages into digital text. An archive could be fully available online or just in part. It could live on a publisher’s website adjacent to modern content, or it could be its own separate product, with additional functionalities. Some archives even live off platform.
The Atlantic decided early on that it was our ambition to make the full archive available. For transparency to our readers, and for the historical record, we wanted to share it all — from our most enduring reporting to some stories that have rightly fallen into obscurity. As our editor in chief Jeffrey Goldberg wrote in an editor’s note introducing the project, “It’s all here: the good, the bad, the brilliant, the offensive, the ridiculous. We knew from the start that we would engage in no censorship, trimming, or dodging.”
By building space for the archive to live on the current website, our product colleagues worked on digitizing and presenting past articles in our modern article template. These are the steps they followed to get there:
1) Transcribing the content: The Atlantic came into this project with PDF scans of all the pages that it had ever published. To make sense of all that information, our engineering team worked with a vendor specializing in digitizing media magazine archives. The contractors used optical character recognition and high resolution scans to identify different regions and zones within each page — mapping the position of everything The Atlantic ever printed.
This first step also required a schema definition, which taught the vendor to recognize what they were digitizing and laid the foundation for content ingestion. This way, the vendor learned how to identify content types (e.g. headlines or page numbers) and tag them in a way that our own internal systems could understand.