Data analytics has changed. Is the financial sector ready?
FYI, this story is more than a year old
Article by Hitachi Vantara A/NZ vice president and managing director Adrian Johnson.
Economists warn of a deep, long-lasting economic downturn as Australia copes with the COVID-19 pandemic. Financial institutions are going to need every edge they can get. It’s yet another reason why these data-rich companies must turn their vast amounts of unstructured data into valuable assets. Of course, they also have to comply with open banking, ensure data privacy and protect against an onslaught of cyberattacks.
If that wasn’t enough, here’s one more challenge: the very architecture of data analytics is changing.
Organisations with data lakes and Hadoop environments in place want to get the most out of existing investments, but they do need to consider storage and data management alternatives that will keep pace with changing data analytics requirements, tasks and challenges.
Data lakes giving way to dataops
Data scientists and analysts are now working with and within organisations to identify specific questions to answer, specific business objectives to accomplish and the data sets required to get the job done. It’s certainly a methodology that helps to get a handle on the massive amounts of data generated and collected by businesses today, and the increasing number of data sources they’ve got to consider. It’s a way to find value in the data and contribute directly to business objectives.
It also means that now there is a lot of work being done on smaller files, or on streaming data. Mining a data lake filled with unstructured data is, therefore, not the best route to effective use of your data.
Hitachi Vantara A/NZ CTO and presales director Chris Drieberg explains, “A modern dataops approach means that, while data still needs to be managed, categorised, enriched with metadata and governed, and then made accessible to the people who need it, you leave data where it is captured or created rather than spending time, money and effort pulling it into one centralised repository. This begins to make the case for handling processing and storage as separate operations with separate solutions.”
Separation of data processing & storage
Hadoop is a data lake environment that is running into challenges related to scale for many data-rich organisations. It just isn’t an economical storage solution for the sheer volumes of data involved – such as reams of daily data from ATMs, financial transactions, customer surveys and more. More and more nodes can be added, with the associated cost to purchase and manage them, but there is a tipping point after which it takes more effort and cost to maintain this environment, than the value you’re getting out of it.
As a processing engine and a significant existing investment, Hadoop isn’t going away. But the evolution of data analytics and the storage limitations of Hadoop necessitate pulling the storage engine out of the Hadoop environment in favour of object storage.
Object storage needs a performance boost
For years now, object storage has been understood to be a scalable and cost-effective solution for archival purposes and to house vast amounts of data to be accessed rarely, or where latency isn’t a concern.
As unstructured data threatens to overwhelm businesses and increasing storage costs have them looking for more economical solutions, more people are seeing object storage as part of the solution to evolving analytics needs.
Says Drieberg, “The beauty of object storage is the metadata. For every file you write into object storage it saves the file plus the metadata about the file. An object storage platform knows the context, file type and all the other machine-generated identifiers plus custom metadata, such as customer name, policy type etc, about each piece of data.”
So object storage stores data in a way that makes it ready and ripe for machine learning, AI and analytics. However, object storage on its own is not a high-performance processing environment.
Hadoop + object storage + VSP
“The super high-performance physical storage capacity, underneath the object storage, will come from the latest in powerful NVMe virtual storage platforms (VSP),” says Drieberg. “VSP uses block storage, which means it knows you’ve written blocks of data to it, and can guarantee you access, policy-based data protection and high performance for your millions of transactions, but it doesn’t know what the data is."
The object storage platform, which sits in front of the VSP, is the piece with the smarts about the data stored within it. Because of the way it stores data, it is a high capacity storage option to service big data analytics. From an infrastructure management point of view, the right VSP will manage everything and give a boost to your legacy systems as well as the object storage.
For the data scientists, analysts and engineers, object storage provides the contextual access to data that lives on the back-end, high-performance block storage, which is the VSP.
Integrate that with the Hadoop environment as the analytics processing engine and the financial institution’s infrastructure is now ready for next-generation data analytics.