Containing endless data growth

20 September 2021 - Containing endless data growth

One of the inevitable consequences of living in a “data-driven” world is that we all need to store and manage increasing amounts of data. As someone who has used a PC for years, over time I have accumulated a myriad of different files and a growing quantity of digital content stored on other devices such as tablets, mobiles, cloud storage and USB drives.

If I were to have the time and inclination, I would no doubt find that much of the data I have stored is either redundant, out of date or duplicated elsewhere. However, I never find the task of deleting this data rises to the top of my “to do” list. Even when faced with the prospect of running out of disk space, I often simply upgrade, migrate all my current data and carry on.

From a personal perspective this behaviour and the consequent accumulation of “cold data” is not a major issue. However, for universities and research establishments who have thousands of users, millions of files, countless different IT systems and laboratories generating huge quantities of data, this “never delete” approach can be a major problem. Especially as the rate of data growth continues unabated.

At NetApp, we are not in the business of telling people which files they should remove, or which research data should be retained or for how long. However, for our customers, we can help them manage their digital content, avoid data duplication, and reduce the storage costs associated with cold data.

So how do we recommend that Universities, colleges, and research driven organisations approach the challenge of data growth and cold data in particular?

Firstly, it starts with the recognition that the use of quotas is a tried and tested method of containing data growth. This allows soft or hard limits to be imposed on shared storage systems, either in terms of disk space or number of files. However, as average file sizes have grown over time, these quotas can quickly become prohibitive and lead users to explore alternative options, leading to data duplication and increased risk from cyber-crime and ransomware.

“Personally, I have noted an almost 10-fold increase in the size of the individual presentations I have created in the last 5 years.”

For example, if one examines the impact of data duplication, it goes without saying that it can be one of the major causes of wasted storage consumption. The ability to distribute files via email, other messaging systems and file transfer protocols across a myriad of different platforms and data centre locations can easily lead to full or partial duplication on a mass scale. Recent innovation in search and discovery tools, such as NetApp Cloud Data Sense, can be used to identify these duplicate files in Microsoft One Drive, AWS S3 and other data stores, and owners can be identified so these can be archived or deleted as necessary.

A more in-depth discussion on this subject - and the increased data privacy and compliance risks that arise - will be the subject of a future blog.

Secondly, there is a strong temptation to consider data storage as a commodity and to use a simple cost per GB as a measure of value when considering options for research data and enterprise file sharing. However, this may be too simplistic when considering the “whole life” cost of data. Many Enterprise-class storage arrays and software-defined storage offerings have built-in measures, such as compression, de-duplication, to reduce the consumption of the underlying storage media. Vendors such as NetApp now offer cloud-native software with similar advanced storage efficiency capabilities that can also lead to lower costs for cloud storage.

Even though these features have been offered for many years, some organisations still issue tenders based on “raw” or “installed” capacity rather than “effective” capacity. Given that a modern storage solution is likely to be at least 3 times more efficient than raw disk, this can be a false economy. So, we recommend that organisations allow vendors to propose technologies that offer this benefit, and to seek assurance through efficiency guarantees and the like. By adopting this approach, IT functions can deliver services which can be “thin provisioned,” to provide users and faculties with larger quotas without increasing the underlying storage consumption.

Lastly, it should be recognised that solutions are now also available that can automatically “tier” data from expensive primary storage to lower-cost options - in the data centre or in the cloud. Tiering is an excellent way to manage the data lifecycle and lower the costs of hosting data which is never or infrequently accessed. It is also an easy way for organisations to start to use public cloud resources without the need to invest in significant staff re-training.

The concept of data lifecycle management is not new. The IT industry has been using the term Information Lifecycle Management (ILM) since the early 2000s and before that, IBM introduced Hierarchical Storage Management (HSM) on their mainframe computers in 1974. However, more recent implementations of Storage Tiering are typically much easier to implement and flexible in terms of supported data types. They are also more sensitive to the varying demands of end-users and modern applications and work more dynamically than the post-process batch-operated implementations of the past.

In NetApp terms, storage tiering is designed to work with all types of data, be seamless to end-users and require no change to the application layer. Several pre-defined policies are provided to allow system administrators to “set and forget” the most appropriate policy for their organisation. These policies can vary between volumes -or sub-sets - of data and can apply to active data sets as well as snapshot copies, as well as volumes containing finished projects, historical data, backups or archive data. A wide variety of destinations are also offered, both in terms of on-premise storage and large-scale web stores – including Amazon S3, Azure Blob storage and Google Cloud Storage.

“In some cases, up to 90% of the data stored on a primary storage tier can be evacuated to a secondary storage platform.”

By utilising tiering to further reduce their primary storage footprint, universities can avoid expensive upgrades or can offer enhanced services to faculties without increasing their investment. It also allows them to exploit high-performance flash media more readily and this can lead to reductions in their environmental impact by lowering data centre power and cooling.

It goes without saying that the greatest benefit comes from the deployment of all the aforementioned capabilities stacked together. Improved utilisation of deployed systems, lower data storage costs, longer-term retention of valuable digital assets and easy consumption of public cloud resources. What’s not to like?

If you wish to contact NetApp please email George Duncan, HE Specialist at george.duncan@netapp.com

Adrian Cooper

Field CTO, UK Public Sector
NetApp