Hoarding in the Cloud
You can tell a lot about yourself and your co-workers by looking at their home directory, specifically how they organize their data (or not). Some people have never used folders or created a directory. There’s only one place to store things – at the top of their personal storage space. Others go to great lengths to create descriptive hierarchies and even potentially over-categorize their documents, blindly staring like a confused animal unsure of whether the latest reports should go in “…CorporateInternalMarketingDrafts” or “…PersonalDraftsWorkMarketing”. I will not speculate here about what either of these habits says about us as individuals. However, there is a common trait that both camps share – a lack of desire to archive data until the current working capacity reaches its maximum. In other words, no one bothers deleting unnecessary files until presented with the error “Unable to save – Disk is full.”
As disks and storage capacities continue to grow, this behavioral pattern leads us to hoarding of data simply because we can. We are confronted with the decision to assign value to individual items of data less and less frequently and therefore avoid having to choose one piece over the other. We become the technological equivalent to hoarders at worst, or at best lazy house owners who never clean up.
The recent trend for cloud-storage-based “bottom-less disks” will likely exacerbate three related problems. First, the signal-to-noise ratio in our stored data will decrease as more and more low-value data accumulates. Second, inefficient application search features will bog down due to large data sets. And third, long term data retention issues such as format compatibility will become the bane of document management systems as less and less individual thought goes into what and how data should be stored.
I’ll save the latter two issues for another day. However, I would like to focus on high- vs. low-value data proposition since that was our starting point. Expensive data is content that is original or possibly would take an excessive amount of time to rebuild such that the price of regeneration exceeds the cost of the data. Some examples may include a manuscript, source code for an application or operating system, and video output from a digital rendering compute farm. Low value data on the other hand can easily be regenerated from its high value counterpart. For example, if we already maintain a revision control repository, then a specific, unchanged revision has no additional value. In terms of transfer bandwidth and storage space, the “checkout” is simply noise.
The closing question is whether or not everyone can be trained to “Save Responsibly” or if we should require a “License to Store in the Cloud”. The counterpoint is that moving to the cloud should require no change in existing behavior. Rather it is our data management software techniques that should be fixed to scale to larger and larger data sets. Personally, as a software developer and as an end-user, I believe it is a combination of both.
By Gerald (Jerry) Carter – Manager of the Likewise Open project for Likewise Software.