IT Tips & Tricks
April 17, 2019
Distributed File Storage for the Cloud: The Good, the Bad and the Ugly
Just like in the 1966 Italian epic spaghetti western directed by Sergio Leone and starring Clint Eastwood, there is more than one side to any story, especially when you decide to migrate your servers to the Cloud including the use of a DFS (Distributed File Storage or Distributed File System).
If your organization is in the process of a data migration to the Cloud, or even just planning one, you are urged to read the entirety of this article so that you can be advised of not only some of the benefits (the good) but also of the pitfalls (the bad) that you would otherwise likely fall prey to, potentially resulting in downtime (the ugly), hardware failures, frustrated users and lost weekends (yours).
The good news is that if you pre-plan the right actions, you can avoid such consequences (well, most of the time anyway).
Meet the Three Horsemen: Benefits, Pitfalls and Downtime.
The fact that you’re jumping through flaming hoops, all in the name of attempting to better manage your distributed storage environment to store ever-increasing amounts of unstructured data, is of little concern to your average user.
But you may be comforted to know that to address this nerve-racking undertaking and reduce infrastructural costs, many experienced data-migration specialists opt to move their data and operations to the Cloud. This has the additional benefit of optimizing reliability.
A Brief Summary of the “Bad” (Issues and Potential Pitfalls)
As you know, users can share computing resources over the Internet through the use of elastic and scalable cloud resources such as physical servers, and other virtualized services that can be allocated dynamically. Although cloud computing applications are becoming more common, there are a few things that you would be very wise to consider when making this move — one being the skilled individuals needed to maintain this infrastructure, another being the potential price tag involved and the third being the time it takes to complete the migration. In addition, synchronization is important to ensure that the various resources stay up-to-date.
Although cloud computing applications are becoming more common, there are a few things that you would be very wise to consider when making this move . . .
Corporate mainframes are typically used by lots of employees, many of whom work in satellite offices (outside the main building).
Additionally, the rise and continuous hyper-adoption of the paperless workplace has resulted in increased demands for instant data access — demands that will only grow with time.
Modern IT departments are forced to contend with the consequences of these demands including fast scaling to varying workloads, scalable storage, performance unpredictability and data transfer bottlenecks, among others.
IT staffs also must contend with the complex challenges of document revision control and distributed file storage, especially where geographically dispersed teams are concerned.
Another area to watch out for is that distributed file storage can result in disparate silos of data. Protecting and managing these silos and their infrastructure (across several locations) adversely impacts IT budgets and business productivity.
In addition, such a decentralized model of storing and managing data can make it difficult for teams in separate locations to collaborate with each other. In a bid to eliminate this difficulty, most employees end up using consumer-grade “shadow IT” solutions such as Dropbox to collaborate and share files or, worse, they e-mail copies of said files to other teams opening the door to data inconsistency.
Such methods increase the overall storage footprint due to the duplication of files across locations — duplicate files which then must be secured and backed up.
The “Good”
All of these issues have solutions. There are efficient, scalable protocols and solutions that enable the creation of a single data set from which employees (no matter their location) can instantly access and perform operations on stored data. Such solutions should also be able to resolve the challenges of handling large-scale distributed data as well as storage-and-compute-intensive applications.
Distributed File System Structure
From your IT 101 textbook: A distributed file system enables small, medium and large enterprises to store, access and secure remote data exactly as they handle their local data. The Hadoop Distributed File System (HDFS) and the Google File System (GFS) are some of the more common systems leveraged by large-scale distributed systems such as Yahoo, Google, and Facebook. Let’s go beyond IT 101 and take a more detailed look at such systems.
In distributed file systems, each data file is partitioned into several parts known as chunks where each data chunk is stored on several different machines, thus enabling the parallel and simultaneous execution of applications. These machines may be in a local network, a private cloud or, in the case of the above-mentioned distributed services (Yahoo, Google, Facebook and others), a public cloud. Data is typically stored in files in a hierarchical tree where each node represents a directory.
Name nodes (often written as one word, “namenodes”) are used to maintain a list of all stored files in the cloud as well as their respective metadata. The name node also must manage several file-related operations such as open, delete, copy, move, update and so on.
It should be noted that such functions are generally not scalable and could result in name nodes becoming a resource bottleneck.
. . . enable the creation of a single data set from which employees (no matter their location) can instantly access and perform operations on stored data.
The name node is also a single point of failure. If it goes down, the file system goes offline, and when it eventually comes back up, the name node must replay all outstanding operations. For large clusters, this replay process could take hours.
Furthermore, Hadoop Distributed File System (HDFS) depends on TCP for data transfer operations. TCP goes through several rounds before it can send at the full capacity of the links in the cloud — often resulting in longer download times and low link utilization.
One Bugaboo in a Cloud-based DFS
What if multiple users are editing the same document at the same time?
Distributed file systems make use of a uniform naming convention and mapping scheme to keep tabs on file location. When a client machine retrieves the data from the server, it appears as a normal file that is stored locally. Once users complete operations on the file, the newly updated version is saved and sent back to the server.
When multiple users try to access the same file, there is always the challenge of making the most accurate version of the file available to everyone. The challenge lies in the fact that while users may be able to access the same file, they can’t view the changes that other users are making to their copy of the file. This type of live collaborative work can cause confusion as individuals can make their own changes in a silo.
By the time they upload their edits, they may not fit with another individual’s edits. Then based on who uploaded their edits last, that would make the last version the final version (and not necessarily the best version).
There is a solution.
Solution (Good): Document Revision Control
Consider a purchase order where several users with editing permissions add, remove and modify items, change the delivery time and location, and renegotiate additional services such as warranties.
You can (and should) have a system that ensures that the final version of the document contains all the changes as well as a record of who changed what and when.
Enter a document revision control mechanism. Don’t forget to get one of these! Document revision control is indispensable in user environments where everyone has editing permissions. In such environments, stored documents are continuously called up and changed resulting in numerous versions of the same document. Without document revision, there is no way to trace and track all the changes.
By implementing revision control, multiple versions of the same file are named and made distinguishable from each other, eventually leading up to the final version of the document.
. . . have a system that ensures that the final version of the document contains all the changes as well as a record of who changed what and when.
There is also a need for consistency protocols that immediately update all replicas of a file once a client modifies one of its versions.
Achieving this requires the protocol to prevent clients from opening outdated replicas. Pretty much any good revision control application can do this for you.
Performance in a Cloud-based Distributed File System
. . . distributed file systems leverage robust protocols to provide fault-tolerant and scalable services.
Aside from enabling the creation of a single data set where users (no matter their location) can instantly access and perform operations on stored data, distributed file systems must also achieve performance levels comparable to that of local file systems, whether distributed or not. Because a commonly tracked key performance metric is the amount of time it takes to satisfy service requests, improving the performance of cloud-based distributed file systems requires a minimization of system throughput.
In local file systems, this performance metric is measured by calculating the time to access storage devices plus CPU usage time. In cloud-based distributed systems, however, file operations are executed over a network infrastructure that exchanges messages between clients and servers. This means that both communication and processing time must be taken into consideration when determining system performance.
Caching isn’t the same as file replication.
Moreover, these types of distributed file systems leverage robust protocols to provide fault-tolerant and scalable services.
These protocols also impact performance since they increase request overhead through the use of network bandwidth and significant CPU cycles.
Improving Performance through Caching
In user environments where files are frequently accessed, distributed file systems would need to deal with a large number of requests. Obviously, this reduces performance since each request is sent as a message to the server and the result/response is also sent back as a message.
Here’s your solution for this: Caching.
The use of caching, rather than remote access, can significantly improve performance. Caching isn’t the same as file replication. It involves copying a part or even the whole file into main memory. It is particularly useful in providing fault tolerance and scalability. Make sure you set up you system to do this.
Handling a Large Number of Small Files
Another challenge that is intrinsic to these types of distributed file systems is the handling a massive number of small files. This seriously restricts the performance. This is because DFS systems, especially those adopted to support cloud storage, were primarily designed to optimize access to large files. They leverage the combined block storage technique to store files; however, such a technique is inefficient when used to access files randomly.
Solution: To efficiently manage massive amounts of small files, design a new metadata structure that will help decrease the size of the original metadata. This helps to increase file accessing speed.
Broken Links (Bad), and the Solution (Good)
The issue of broken links when users change file, folder or drive names where linked files are stored, can be major. This problem also occurs when the folder containing a linked file or the file itself is moved to a new location.
...the initial data migration to a DFS (whether Cloud or on-premises) will cause broken links by the thousands or even millions. Ah, but it doesn't have to.
Depending on the number of documents that rely on one or more linked files, such data operations can cause thousands of broken links. What’s more, the initial data migration to a DFS (whether Cloud or on-premises) will cause broken links by the thousands or even millions. Ah, but it doesn’t have to. And that’s good news because manually searching for and updating each file with one or more broken links is an incredibly time-consuming task — one that may take a lot of financial resources to execute, not to mention taking up your nights and weekends (certainly bad for your golf handicap, your vacation plans, your family…) So, what is this solution?
Cloud or Local?
Although there are speed bumps to the successful implementation of a Cloud-based distributed file system, it does provide a well-defined interface for applications that use large-scale persistent storage while their hierarchical namespace provides a flexible way to store large amounts of unstructured data.
A Cloud DFS also simplifies administration and sharing, improves reliability and enables simultaneous access from multiple clients. While the same can be generally said about a locally managed DFS system, it really comes down to your organization’s IT human and technical resources.
Who doesn't want improved admin, sharing and reliability, plus simultaneous access?
Simply put, a local DFS implementation may cost less if you have the talent and resources to manage it.
A Cloud-based DFS implementation, while seemingly more costly, outsources the expertise and technology requirements to your Cloud provider. This may be more advantageous if you do not have the IT requirements necessary to manage it locally.
The cost-benefit analysis must be weighed from this perspective. If you already have the resources internally, then a local DFS might be a better solution, saving on the cost of the Cloud provider based on your pre-existing IT budget. However, if this is not the case, then a Cloud-based DFS is likely a more cost-effective IT budget spend for you, than investing that budget in the necessary requirements to manage it locally.
As Cloud providers and the industry have matured over the last ten or so years, the fears that many organizations have had with turning their data over to someone else, have become less of an issue. Unless you have already made a significant investment in your existing IT department or have data so sensitive that it cannot be potentially shared, then a Cloud-based DFS may be the best route to follow.
Ed Clark
LinkTek COO
Feel free to share this article on your social media:
Copyright © 2014–2024 LinkTek Corporation and IntelliProp LLC. All Rights Reserved. LinkTek and LinkFixer Advanced are trademarks of IntelliProp. Microsoft is a registered trademark of Microsoft Corporation. Other trademarks are held by their respective owners. Not responsible for inadvertent errors. Privacy Policy.