So much data, why?


There have been a lot of reports lately about the loss of large volumes of data (thousands, tens of thousands, and hundreds of thousands of records in the public space and enterprise space), but one question I don't see often asked: Why so much data? And what should be done about it?

I understand the want to take some work home with you, or to bring some data from one location to another, but much of the data being grabbed in large quantities is unnecessary. When using data, individuals tend to use it in three ways: bulk processing, aggregate analysis, detailed analysis.

For each of these three uses, I believe there are simple rules that would stop most of these problems before they start. These could be implemented at the database level and thus enforced programmatically in ways that are very difficult to get around and require no human intervention to activate, just human intervention to circumvent. Here is my modest proposal for data access restrictions:

Bulk processing--bulk processing is for things like sending out notifications that your data has been compromised or mailing out checks. This kind of processing should always be done on secure computers with access limited to those programs which have been certified to run in special environments that are used for this purpose. These jobs tend to involve very large numbers of recipients, often require special hardware or printers and are generally not the kinds of things that people need to use laptops for. However, they do require access to huge volumes of user-specific data and therefore need to be protected by the most secure processes.

Aggregate analysis--I believe most of these breeches in security and privacy are caused by people who move data to their laptops in order to perform aggregate analysis. In these cases, the people performing analysis may indeed need large amounts of data, but as the amount of the data increases, the need to have identifying information diminishes. By removing just the name, street address, SSN, and other personally identifying information most of these breeches would have been unimportant, and they would not have impeded the ability of the employees to do their jobs. Even if some of the street data (ZIP or even ZIP+4) were to be kept, it would provide the ability to do sophisticated geospatial analysis and comparison against census data without compromising the privacy of the people covered by the data. In this case, certain elements of the database description should be marked as "personally identifying" and limited to export only if data qualifies for the next section. De-personalized data (data containing no personally identifying information) can be downloaded by personnel to perform analysis.

Detailed analysis--here is where your personal information actually needs to be in the hands of a government (or corporate) worker. If a service provider is going to be visiting (or visited by) a series of customers over the course of a day or a week, they might legitimately need full copies of their records (consider someone from a government agency who is physically visiting farms or veterans) in order to discuss detailed issues. Although the amount of detail necessary is large and will include personally identifying information, the volume of records is small, and thus if the laptop were to be misplaced or stolen, the number of people affected by it would be tiny.

The new OMB guidelines (reported about in this article from the Washington Post this morning) are a good step to securing the small amount of data that should actually live on employees laptops, but it stops short of encouraging a review of how much data is retrieved from databases for any given purpose.

Guidelines like this are already being implemented by some jurisdictions that allow internet access to public data (such as tax rolls). In these cases, organizations will allow aggregated tax data for analysis purposes to be exported in large volumes or will allow detailed data to be looked up by a particular address or street, but never both at the same time. In these systems, it is not impossible to get the all of data, but it is much more difficult, time consuming, and likely to be watched. And by limiting the number of queries per day from certain IP addresses (as an example), you can stretch out the time necessary to get a full database dump to years.

I'll also note that the OMB Security Guidelines still omit one other thing that my proposed guidelines do not: nefarious access by government workers. I believe that most workers have no intention of accessing data for personal gain, but I don't see any reason why such a tempting morsel should be left out sitting in the open. If database security limited access to either bulk depersonalized data or small amounts of personal data, the temptations would be gone and breeches like the ones from various credit card companies and ISPs (like the sale of personal data to a spammer by an employee of AOL) and other sources that have been traced to employees would be a thing of the past.

Further, these kinds of database-level restrictions would also limit the ability of intruders on most systems to access bulk data. If appropriate security is maintained on the computers that can process bulk transactions (they, for example, shouldn't be accessible from the Internet), then the access limitations would prevent the theft of bulk data by somebody who illegally accesses a single front-end computer that has access to the same databases only for inserting customer data and reporting on order status (one customer at a time).