Personal Details and Political Bias of 198 Million US Citizens Leaked
The data was left exposed to the view of everyone by Deep Root Analytics, a contractor used by the Republican Party.
This is not a case of hacking, but rather misconfiguration. All those records were kept in an unprotected Amazon S3 bucket.
What is the data?
The data exposed included "including names, dates of birth, home addresses, phone numbers, and voter registration details, as well as data described as “modeled” voter ethnicities and religions." according to Upguard, the firm who discovered the at-risk bucket.
Over 1.1 Terabytes of data was present in the S3 bucket dubbed 'dra-dw', Deep Root Analytics Data Warehouse, and all of it was publicly downloadable. There was no password needed to access it. But over 24 terabytes of additional data was also stored in the data warehouse, even though this part was configured in a way to make public access impossible.
Where does the data come from?
The RNC paid Deep Root Analysis $983,000 last year, but its S3 data warehouse contained records from multiple sources, such as The GOP Data Trust, which received over $6.7 million from the party in 2016.
Other sources include TargetPoint, i360, and the Koch brothers' Americans for Prosperity. Other files containing in-depth information about political ads (costs, audience, demographics) come from The Kantar Group.
Finally, 170Gbs of data was scraped from Reddit, including the highly controversial r/fatpeoplehate. Two hypothesis emerge about the presence of so much scraped subreddits in the data trove: it could either be training data for a natural language processing AI, or it could be an effort to associate Reddit users with voter registration records.
But why r/fatpeoplehate ? FiveThirtyEight used a machine learning technique called latent semantic analysis to analyze over 50,000 active subreddits. This allowed them to seek commenter overlap and do what they call "subreddit algebra": adding two subreddits and seeing if the result's commenters are similar to a third subreddit.
What happens when you filter out commenters’ general interest in politics? To figure that out, we can subtract r/politics from r/The_Donald. The result most closely matches r/fatpeoplehate, a now-banned subreddit that was dedicated to ridiculing and bullying overweight people. ~ FiveThirtyEight
What is the impact?
The most worrying impact of this leak is identity theft. Even though some states like Ohio make voter data available online, having so much data compiled in one folder is makes it dangerously powerful. Moreover, such records are what powered re-identifications of anonymous records, as in the famous case of Professor Latanya Sweeney who re-idenfied Governor William Weld.
in 2000, she [Sweeney] showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex. ~ Ars Technica
This data was collected by only one party, yet it includes private details about voters from the whole political spectrum.
Does prevention exist?
If your company manages large confidential datasets, applying security best practices will lower your risks.
- Start with asset discovery : you can't protect what you don't know about. Even though asset used to mean physical servers, with the cloud, the definition must be broadened to include VMs, containers, databases, along with your network infrasctructure.
- Start with a secure configuration. This may include policy compliance.
- Monitor changes: nothing is static. Throughout the lifecycle of your assets, policy violations and accidental misconfigurations might happen. If you can detect changes when they happen, you can apply remediation faster. In the case of this leak, Deep Root say they're confident nobody accessed the data when it was vulnerable. But how can they know?
- Apply Vulnerability Management: new software vulnerabilities and threats constantly emerge. Moving your infrastructure to the cloud does not make it automagically immune to vulnerabilities: even though you don't have to worry about hardware woes anymore, software vulnerabilities are still up to you
- Log monitoring. Incidents may happen, even if you do everything right. Having logs allows for more complete investigations.
For the full technical analysis of the leak, visit Upguard's website.