Abstract

The purpose of this study is to test the effectiveness of FlexRR is a straggler mitigation technique in over the Parameter Server for important Machine Learning(ML) strategies including Matrix Factorization (MF), Logistic Regression (LR), and Latent Dirichlet Allocation (LDA). The study was conducted to test FlexRR as a straggler mitigation technique over different Machine Learning Algorithm such as MF/LR/LDA. It thus employed the Bulk Synchronous Parallel (BSP) computational model to examine the straggler problem in Parameter Server on Iterative Convergent Distributed Machine Learning. Moreover, the current research analyzes the experimental arrangement of the parameter server strategy concerning the parallel learning problems by injecting universal straggler patterns and executing latest mitigation techniques. The findings of the study are significant in that as they will provide the necessary platform for conducting further research into the problem and allow the researcher to compare different methods for various applications. The outcome is therefore expected to facilitate the development of new techniques coupled with new perspectives in addressing this problem.

 

Introduction

The Straggler Problem

Different applications underline iterative convergent algorithms. Parallel execution of such algorithms is typically expressed through the BSP model for computation. Nonetheless, implementation through this model is characterized by what is referred to in computing terminology as the straggler problem. This means that every brief delay for a given thread can result in a slowdown of other filaments [2]. Empirical studies concerning clusters for large-scale computing indicate that the completion period for jobs is often significantly and unreasonably extended by stragglers or else straggling tasks. These are tasks that are regrettably assigned to failing to overloaded or failing servers within clusters of a multitude of product servers [3]. In addressing the straggler problem, machine learning is emerging as an influential build block concerning contemporary computing services, enterprise processes, and scientific endeavors. ML algorithms are essential in Parameter Server on Iterative Convergent Distributed Machine Learning. They determine the necessary parameter values to support assumed mathematical models to fit an array of observations or input training data. Such an application can then be applied to different data points [1]. The diagram below is a visual representation of the straggler problem.

Figure 1: The straggler problem

Mitigation of the Straggler Problem

In moderating the straggler problem, new significant data structures have adopted different reactive or preventive speculation approaches through which a system initiates backup or extra copies for tasks on other machines via a prudent manner. Currently, two methods for speculative implementation stratagems exist. These including the Straggler-Detection based and the cloning approach [3]. In the Straggler-Detection-based procedure, the development of distinct tasks is examined through the system as well as backup copies when the stragglers are detected. In the cloning approach, on the other hand, extra task copies are organized in a parallel manner with the first job if only the anticipated cost involved in computation is low and the availability of system resources is guaranteed [3].

From existing research into the subject matter, single devices are incapable of fulfilling quick growth factors to augment their application via distributed computing systems. While, the initial applications of corresponding ML employ the approach of directly passing the message among the threads to get exchanges in the update, an architecture of parameter server has turned out to be a favorite method for making it easier to build scalable applications of the ML [2].

In reality, the figure of training data may range from 1 TB to 1 PB, despite the fact that training factors may nearly be 109 to 1012 [2]. High-tech structures for faster-performance parallel ML employ the architecture of parameter server which fails to accommodate redundancy of computation. Even with the existence of parameter server, the factors of the set by these systems still require constant assessment by all worker nodes that will cause substantial complications and contests [3].

Figure 2: Mitigation of straggler problem

Background

In Bulk synchronous parallel (BSP), the straggler problem is significant due to the recurrent and unambiguous synchronization. The implication here is that the proceeding of individual interactions is determined by the pace of the most sluggish thread. This problem exacerbates the parallelism level increases. Due to random differences in the implementation of time, and an increase in the number of threads occurs, the likelihood of that one thread will run in an unusual and often slowly within a particular iteration increase. The implication of this is that delays will occur in the execution of the whole application at every iteration. The straggler problem, however, occurs due to different reasons including but not limited to hardware heterogeneity, imbalances in data distribution between tasks, hardware failures high-level languages garbage collection and effects on operating systems [1]. The “Bulk Synchronous Parallel” (BSP) model represents one of the common ways to implement distributed applications of iterative convergent algorithms underlined by an “input-data-parallel” methodology. It separates the response data into operative threads and executes the task with that part of data as well as barricade synchronization at the iteration’s end.

Typically, the model parameter will be stored in the shared data store so that every worker will update throughout the iteration.  However, the design of BSP model guarantees the model data will be available to the worker after synchronization instead of immediately available to all work, and it would enable worker efficiently to use the cached copy. Another design is to ensure the allocation of work to the worker are similar across the iteration, and it would enable the worker to reuse the already loaded data, and thus reduce the overhead of input data movement.

Methodology and Experimental Set-Up

Disrupted Machine Pattern

Through this methodology approach, the study will model transient resource contention by simulating a high priority disruptor procedure that takes affect significant Memory or CPU resources. It can be duplicated by initiating a disruptor process with a probability of x% every d seconds. The disruptor process can be implemented by launching multiple threads that work on heavy calculation work (or run infinite loop). This implies that for d seconds, then a temporary delay concentration, which is a component of d, with a machine of p core running p worker threads, the disruptor therefor commences at d × p processes. For instance, 200% delay on an 8-core machine running eight worker thread will run 16 threads disruptor process [1].

Power-Law Pattern

The selection of this methodology was informed by the real-world straggler patterns as established by existing research. The gist of this approach is predicated on the understanding a straggler may occur with a power-law distribution [2] to finish an iteration. For instance,

p (t) ∝ t × α.

Whereby α represents an element of the distribution “skewness.” Sleep will, therefore, be employed with the comparable execution in the Slow Worker Pattern, with t × α added to the delay time. Lesser α leads to a “flatter” distribution. Each iteration employs the same α, and without considering the actual completion time of the iteration.

Persistent Straggler Pattern

This pattern is caused by A particular node that insistently performs sluggishly. The study also incorporates the concept persistent straggler pattern whereby some the machines receive 75% or even 50% of the work per iteration. Such unbalanced assignments could trigger the case where processing skew of the data is co-related with the data placement.

Speculative execution & task cloning (Predict Straggler and task cloning)

Research on large-scale computing clusters has shown that the completion time of a particular task is often unnecessarily delayed by one or a few “stragglers” or straggling task that are assigned with either a failing or overloaded server, but there may have a better choice within the cluster [4]. These findings triggered the study of speculative execution so that to diminish stragglers in the processing of data systems like Hadoop, MapReduce, and Spark. Wrangler [1] described an approach that predicts when the stragglers are going take place and make scheduling decisions to circumvent such situation.

To mitigate stragglers, recent big data framework, as, for example; the MapReduce system or its alternatives have implemented numerous reactive or proactive approaches under which the system will add additional duplicates of a job on other machines in a judicious method. This technique can significantly reduce delays as due to stragglers, as the output from the faster worker can be used without waiting for slower ones, however, with the trade-off of consuming additional resources.

Full cloning of small jobs, without waiting and speculation completely. Cloning small jobs increase the machine utilization since workload are small if jobs are small slightly and consume a small portion of the resources [1].

 

Predicting Straggler with ML Technique

Besides Job rescheduling, speculation is another standard method for straggler mitigation. It predicts stragglers based on progress score of the task and launch a clone of the job on different machines. Most recent research ML-NA is intended on an ML-based Node Performance Analyzer.  By leveraging actual task execution log, ML-NA will categorize cluster nodes into different classifications and predict their performance in the following task, so that scheduler can be better and minimize the possibility of task straggler. The experiment will include different injected frameworks, while the control setup will involve testing the FlexRR straggler mitigation technique without injected delays. In other words, the control experiment will encompass the Ideal situation representing perfect cases that assume every worker have no wait.

Injected Straggler Framework

The exercise will assess intensities as well as the effects of a straggler beyond what ascended for the duration of the precise instances of the trials. It will also measure attributable delays directly. The research will use some injected straggler simulations including transient as well as persistent stragglers. These are enumerated in the table below:

Slow Worker Pattern Random Transitory deferment, which within iteration job
Disrupted Machine Pattern Delay of Transient as a result of system disruptor method for inadequate time
Power Law Pattern The Random Transient delay within iteration job that varies with execution time
Persistent Straggler A particular node that insistently performs sluggishly
Ideal perfect cases that assume every worker have no wait

 

Straggler Mitigation Through FlexRR

 

FlexRR employs two mechanisms to mitigate the problem: temporary work reassignment through RapidReassignment (RR), and adjustable consistency parameters through the State Synchronous Parallel (SSP) model. SSP allows individual workers threads to lead the slowest worker by a quantified slack-bound iteration number. This flexibility permits mitigation of the straggler problem to s certain degree but more fundamentally, provides sufficient flexibility to RapidReassignment making it highly efficient. RR employs peer-to-peer communication thus enabling the self-identification of workers as stragglers. FlexRR, consequently, applies the increasingly popular parameter server method to storing and distributing normal states among worker threads [1]. During execution, it will comprise of one process prompting a worker thread for respective cores on the nodes as well as the number of threads in the background for internal functionality.

 

Straggler identification in FlexRR

Upon achieving the progress checkpoint relative to existing iteration set at 75% completion by default, a worker generates and disseminates a progress report in the form of a message conveyed to its associated helpers. On receipt of the same, the worker computes its progress relative to progress made by the qualified helper. The algorithm described below shows the logic informing the calculation of the difference in advancement. The core of the results of the calculations is to notify the worker how far behind or ahead it is from (%) compared to the sender of the progress-report [1]. This occurs through the algorithm below:

 

Helping with work

Workers search and identify for do-this messages during each message check. Upon receipt, the operative, or the likely helper will make comparisons between its timestamp and the one for the most recent cancel-help communication. If the do-this messages are higher than the one for the probable helper, then the possible helper will, in turn, compare its present iteration with iteration number describing the do-this message. If this number is smaller relative to the assistant’s present iteration, then the helper will instantaneously transmit a begun-helping and start addressing the allocation of work. The algorithm used is enumerated below:

 

On completion of the work allocation, the worker proceeds to convey a help-completed communication to the initial worker and searches for more do-this messages before resuming the work it was assigned initially.

Multinomial Logistic regression (MLR)

This is a regression model where DV is unconditional. Specifically, the model employed by this experiment is a binary dependent variable, this means that only two values such as “0” and “1”, representing output such as boy/girl, or else win/lose can be used. Similarly, the outcome would then be the likelihood of classifications predicated on one or more independent variable. MLR is a common model for, multiway classifications such as the last later of text classification or learning models employed used for image classification [1]. Through this straggler mitigation approach, the probability that respective (d-dimensional) observations x ∈ R d belongs to individual K classes as modelled by “softmax transformation” p(class=k) = exp (w T k x) ∑j exp (w T j x), where {wj} K j=1 represents the rectilinear (d-dimensional) weights related with respective classes and are viewed as the model limits [1].

The MLR investigations use the Avazu dataset, containing 40,428,967 training size underlined by a variable dimension of 1,000,000 as well as two classifications. The investigations will also encompass ImageNet datasets coupled with LLC elements comprising 64k experimental observations defined by 1000 classes and feature dimensions of 21,504.

Matrix factorization (MF)

This is a procedure employed in reference systems, such as endorsing flicks to consumers in networks such as Netflix. It is also referred to as collaborative filtering. The example is how to break the rating matrix down into smaller patterns that describe user preference for different types of items or characteristics of issues, and the extent to which elements express those characteristics. In numerical analysis, given a moderately complete matrix X, matrix factorization factorizes X into factor matrices L and R. consequently, their respective product estimates X as X ≈ LR [3]. We can, therefore, execute MF through the “stochastic gradient descent” (SGD) algorithm with the reference of other systems and individual worker threads.

This will be assigned entries subsets in X, and in every single iteration, the worker computes its part of matrix gradient and apprises to the L row as well as the R within the parameter server. The experiment will employ the “Netflix dataset” underlined by “a 480k-by-18k sparse matrix with 100m known features”. These elements are configured in such a way that they factor the data set into the result generated by two matrices underlined by “rank 500 for Cluster-A and rank 1000 for Clusters B, C, D and E [1]. Also an experiment on Cluster-F by employing an artificially enlarged form of the “Netflix dataset (256x the original). This version is a “7683k-by-284l sparse matrix” characterized by 4.24 billion recognized features underlined by rank 100.

Latent Dirichlet Allocation (LDA)

LDA represents a generative probabilistic corpus model. The essence of this ML strategy is predicated on the need to have documents served in random assortments over the possible topic. Distributions then describe the topics over words. This is a multiplicative arithmetical model that enables arrays of interpretations to be clarified by overlooked assemblages which elucidate the similarity between some components of the data [3].  The proposed LDA for the experiments will be predicated on the Reuters dataset. This dataset represents an assortment of documents appearing on Reuters newswire in 1987, based on a 21578 Text Categorization collection. The current research will, therefore, execute collapsed “Gibbs sampling” [1]. This experiment will employ the NYTimes dataset comprising 100m words within 300k documents containing 100k vocabulary size. They will be configured to categorize words, as well as reports into five hundred concerning Cluster-A and one thousand topics for clusters B, C, D, and E.  the expected results from this experiment, are demonstrated below:

Figure 4: EC2, LDA, no injected delay

 

Figure 5: Power Law Straggler (naturally collected)

 

Figure 6: Transient Straggler (naturally collected)

 

Figure 7: Injected Transient Straggler with proactive scheduling

Evaluation

Results for Naturally-Occurring Straggler

Experiments were executed on EC2 PRObE Nome and Microsoft clusters to assess FlexRR under the effects of naturally-occurring noted at specific times during the tests. Synthetic straggler impacts were not injected during the experimentation.

 EC2 results

The outcomes of the MF and LDA are presented in figures four above and eight respectively for Cluster-B (c4.2xlarge VMs) as well as Cluster-C (c4. x. large VMs).  Applying c4.2xlarge VMs, the results show that FlexRR decreases the time-per-iteration by 35% concerning Matrix factorization and 34% for Latent Dirichlet Allocation relative to the Bulk Synchronous Parallel (BSP) and 25% and 34% reductions relative to SSP respectively. When the c4. x. Large VMs is used, the decreases are 53%(BSP) and 39% (SSP) for MF and 49% (BSP) and 32% (SSP) for the LDA. As noted in the two experiments, the increase is more significant for c4. x.Large VMs since the less costly c4. x. Large VMs undergo more transient effects of the straggler problem.

The enhancements in EC2 were observed despite implementing comparatively short experiments on relatively costly EC2 situations expected minimal exhibit sharing of resources with other tenant actions while at the same time highlighting the realness of transient strugglers within cloud infrastructures.

Figure 8: Comparison of MF performance on EC2

 

Microsoft Cluster Results

From the application of the A4 VMS, the FlexRR reduction for time-per-iteration is 43% about BSP and 32% relative SSP. The A3 VMs on the other hand results in a FlexRR decrease of 56% corresponding to the BSP and 38% for SSP.  The experiment notes that the A4 VMs occurrences are more extensive and costlier, but the times-per-iteration are more prominent than those indicated in A3 instances because the performance of the A3 CPUs is better concerning the computation of the Matrix Factorization Floating Point. However, noteworthy effects of the straggler were observed during experimentation in both setups, and the outcomes were similar to the ones noted in the c4. Large VMs is relative to EC2.  The results are demonstrated in the figure below.

 

Figure 9: Microsoft Cluster, MF, no. injected delay

Probe=E Nome Large MF Experiment

The research sought to verify the effectiveness of FlexRR for the more significant workload. The associated experiment, therefore, employed the synthetically inflated Dataset-Netflix* 256 on Cluster F as shown in figure 10 below. The results indicate that there is a 21% of in FlexRR for time-per-iteration relative to SSP plus 51% reduction over BSP on the committed cluster devoid of injected delays. As the size of the group increased, more effects from the straggler were realized. However, the effectiveness of the FlexRR is not hindered by the augmentation of the problem size. For Cluster-F, it demonstrated fewer effects of the straggler problem as a result of its dedicated nature coupled with superior network compared to the EC2 Clusters.

Figure 10: Probe=E Nome Large, no injected delay

Results for slower Worker Pattern

            This section details the results of the study on the speed as well the rate of convergence applying ML under Slow Worker Patterns. This research approach modeled transient worker by physically adding sleep to the worker, in applications or else engines. The researcher added all possible delay points (any point that may incur waiting) within each iteration, and with each worker deciding its possibility (x%) be delayed independently. It was suspended from 0 – 2 times for the duration of the actual iteration time. It was possible to have no or multiple stragglers in each iteration. The transient delay % was used to determine the level of worker decelerated (e.g., 100% delay means that runs are two times slower than original one) [1].

            Speed Tests

            For every application, the experiments above assessed the time-per-iteration, (overall runtime/number of iterations) for each of the four types, varying the intensity of the transient delay %. Each experiment measured the variations in time-per-iteration by running 20 iterations. It was not necessary to run more test because the same yielded similar results.

            Cluster-A results

These results for the LDA and MF are shown in figure 11a, 11b, and 11c below. The results from the MLR appear similar to those adduced by MF (demonstration was not available due to space limitations) from the results, there is a linear slowing down of BSP relative to intensity delays. By controlling the intensity of the struggler, one notes that the SSP can be used to mitigate such delays below its respective stack-bound, but the SSP is also affected linearly. This is because the large size of the fundamental problems was too overwhelming for the SSP. On the other hand, FLexRR almost matches the Ideal situation to 400% delays. However, this is extreme compared to normal than the expected standard practice.

The experiment also examined the volume of percentage work reassigned by the FLexRR and established that it ranges from 8% – 9% of the total work in instances where the delay is 0% or else when injected delays are absent to 19%-22% when the delays are at 400%.  Even when the delay is 0%, FLexRR operates 18% faster relative to the BSP about LDA and MF and 13% compared to BSP on MLR.

Figure 11: Slower Worker Pattern Injected Speed tests

            The figures above also the RapidReassignment technique employed in the experiment can be applied to reduce the straggler penalty in BSP. Nonetheless, it is the combination of the RapidReassignment and the soft consistency bounds by FlexRR that closely matches Ideal.

During elevated % delay values for LDA, slight divergence from Ideal to observed. This is because the Latent Dirichlet Allocation employs a unique mechanism when handling its associated local state. This state encompasses other model bound updates concerning individual work reassignment. The FLexRR reassigns more work for higher delays, and therefore additional updates start to produce effects on run-times resulting making the FlexRR depart from Ideal.

            Cluster-B Results

            Figure 11c above illustrates the results of Matric Factorization from the bigger EC2 cluster. The results are qualitatively akin to Cluster-A outcomes. Just like Cluster- A, there is a linear slowing down of BSP with the intensity of the delay. From the Results, the SSP can moderate stragglers but only up to its stack parameter, and the FlexRR closely related to Ideal. From the results, at) % delays, 21% of the work is reassigned by the FlexRR while 31% is reassigned at 400% interruption. The primary distinctions between the results both clusters (A and B) is that for Cluster-B an even wider separation amongst FlexRR and BSP under RapidReassignment (the next suitable approach) [1]. For example, with a 400% delay, the RapidReassignment for BSP is ten times slower compared to the FlexRR. Similarly, at 0% interruption without injected delays, as shown in figure 1 above FlexRR is 35% quicker compared to BSP and 25% relative to SSP. The non-injected performance jitter accounts for these variations. Finally, the results of the LDA concerning Cluster-B are similar to those of Cluster-A, from a qualitative perspective.

Convergence Tests

            The experiments also measured the duration of convergence about running ML applications for individual modes. The Ideal was calculated through the multiplication of the absolute time-per-iteration values obtained from the speed tests described above with the iterations required to achieve convergence by the experiment on BSP. The stopping criterion described below was applied founded with the help of ML professionalsIf the
Objective value (for MF) or log-likelihood (for LDA) of the solution changes less than 2% over the course of 10 iterations; then convergence is considered to have been reached [1]. The experiment also confirmed that the criterion achieved similar unbiased values. Since calculating the actual value is comparatively costly, and the researcher intended to make frequent observations, it was done offline through FlexRR checkpoints.

            Results from the Convergence Test

            The results of figure 12a illustrate the NF results. For all the % delays, BSP needed 112 iterations to achieve while the SSP needed one other interaction. FlexRR, on the other hand, required 114 iterations, but for the 400% disruption, it required 115. Although the additional iterations are necessary to achieve convergence, compared 0% injected delay for BSP, FlexRR converged faster by 10%. With the inserted interruptions, BSP was affected by linear convergence time increase. Conversely, FlexRR efficiently corresponded the Ideal time to achieve convergence event at a delay of 400%. As anticipated, the RapidReassignment in BSP enhances associated convergence time in a faster manner compared to SSP but still significantly slower relative to FlexRR.

In figure 12b the LDA experiment shows that BSP needed 41 iterations to achieve convergence for % delays. For 0%, 50% as well as 100% interruptions, 42 iterations were required by FlexRR with 200% while 43 were required at 400% injected delay. Despite needing additional iterations, the experiment showed that FlexRR converged considerably faster on BSP. In instances where inserted delays were absent, the convergence was 18%faster compared to BSP underlined by the maintenance of Near-Ideal convergence period with swelling interruptions. Conversely, BSP is affected by an increase in linear convergence time where suspensions were introduced. Also, LDA digresses from perfect.

Figure 8: convergence tests

Disrupted Machine Pattern

The experiments also compared different average time-per-iteration of FlexRR to alternatives forms for the Disrupted Machine Pattern as shown in figure 9.  The MF, MLR, and LDA on Cluster-A are qualitatively the same. BSP and SSP RR separately reduce interruptions experience by BSP by 49% and 42 percent respectively. Additionally, combining both methods in FlexRR reflects Ideal as there is a cumulative 63% reduction in run-time.

Figure 13: Distributed Machine Pattern

Results for Power-Law Pattern

The experiment involved a comparison of average time-per-interaction (twenty interactions) about the two alternate Power-Law Pattern modes. For each of the applications, result from Cluster-A are presented based on a set of α to 4,7, and 11. α =11 mirrors an actual cluster as measured above. The configuration of smaller values for α generated severe delays.  From the results, the current research notes that for α=11 that BSP RR and SSP are 40% and 39% faster respectively. When FlexRR combines the two methods, the results show that the run-time relative to BSP is 48% faster. Likewise, the experiments described in previous parts of this evaluation underlined by increasing interruptions, or else a smaller α, the alternative nodes indicated a significant runt-times increase. However, FlexRR only experienced slight increases.

LDA and MLR results show like trends. For α =11, BSP and SSP RR were 31% and 36% respectively quicker compared to BSP for the MLR and 42% and 37% respectively faster relative to BSP for LDA. For increasing disruptions, (smaller α), the other three forms a significant increase in run-times was observed for MLR and to some extent more significant interruptions for LDA. In all the instances, FlexRR considerably outdoes the other modes of straggler mitigation.

Figure 14: MF Speed, Power-Law Pattern

Distribution of Uneven Workload

The study as well incorporated the concept of persistent straggler pattern whereby some the machines received 75% or even 50% of the work per iteration. Such unbalanced assignments could trigger the case where processing skew of the data is correlated with the data placement.

Logistic Regression (LR)

This is a regression model where DV is unconditional. Specifically, the model employed by this experiment is a binary dependent variable, this means that only two values such as “0” and “1”, representing output such as boy/girl, or else win/lose can be used. Similarly, the outcome would then be the likelihood of classifications predicated on one or more independent variable.

Our LR experiments use the university cluster dataset, containing 40,428,967 training size underlined by a variable dimension of 1,000,000 as well as two classifications

Matrix Factorization (MF)

This is a procedure used in reference systems, such as endorsing movies to users on networks such as Netflix. It is also referred to as collaborative filtering [1].

The example is how to break the rating matrix down into smaller patterns that describe user preference for different types of items or characteristics of issues, and the extent to which elements express those characteristics. In numerical analysis, given a moderately complete matrix X, “matrix factorization factorizes X into factor matrices L and R” [3]. Consequently, their respective product estimates X as X ≈ LR (Henggang Cui, 2014). We can, therefore, execute MF through the “stochastic gradient descent” (SGD) algorithm with the reference of other systems and individual worker threads. This will be assigned entries subsets in X, and in every single iteration, the worker computes its part of matrix gradient and apprises to the L row as well as the R within the parameter server. The experiment will employ the “Netflix dataset” underlined by “a 480k-by-18k sparse matrix with 100m known features”.

While initially designed to moderate transient stragglers, FlexRR is also useful in addressing differences in long-term workload between workers.

From an experiment conducted on Cluster-A in this regard, where 50% of the machines were assigned 75% of the total workload, while the remaining devices were allocated 25%. FlexRR was successful in mitigating the effects of the straggler produced by the uneven distribution of workload in a manner that closely reflected the ideal speed. This is shown in figure 15 where there was a 54% slowdown in SSP.

Figure 15: Uneven workload

Partial Replication

In all the experiments described here, the was 100% replication by workers of the response information belonging to operators being helped. The study established that FlexRR remains useful in instances where the workers influence the replication of only a percentage of that data are, therefore, only qualified to assist that particular lot. From figure16 below, without injected delays, the MF application ran on Cluster-A with dissimilar input data proportions replicated on helper workers for SSP.

Figure 16: Partial replication

Sensitivity Study

This part documents the findings of tests conducted to define appropriate settings concerning FlexRR bounds. Each parameter was varied across its range on the one hand while using while using default values for the rest as shown in table 1. For brevity purposes, only the MF’s sensitivity results are shown for typical time-per-iterations (20 iterations) under Slow Worker straggler trend running on Cluster-A. Similar results are valid for other applications concerning the time required to achieve convergence as well as other clusters. In other words, the same default settings were applied in all other experiments as described above.

Table 1: FlexRR Parameter Settings

Helper Group Size Test

The helper group refers to the helper to setts representing workers to which an individual worker is qualified to provide help [1]. The outcome of changing the size of the helper group from 0 helpers to 16 helpers are shown in figure 17a. The finding indicates that once the volume is established at 3 or an upper value, the Near-Ideal helper’s performance is realized. Additional exploration discloses that utilizing four assistants provides the most suitable performance. However, the variance between varying the settings from 3 to 16 is minor.

Figure 17: Sensitivity tests

Work assignment size test

One of the most critical design decisions made in the experiments was determining the number of tasks to reassigned. Work assignments took place through two dissimilar sizes. These included the initial assignment size and follow-up assignment size. In Figure 17b shown the outcomes of changing the follow-up work assignment size from zero percentage points to 30%. The experiment confirmed that the initial work assignment size is always 50% of the follow-up work assignment size [1]. Near perfect performance was attained across the non-zero dimensions’ array although as the interruptions increased, the experiments indicated that more significant work assignments performed poorly.

This development is due to the presence of an intermittent corner case through which the slowly running could reassign a percentage of their tasks to a quicker worker that begins to complete the added work but is ultimately hindered significantly. Since the existing application of FlexRR as a mitigation for the straggler problem does not focus on reassigning already reassigned and acknowledged work, extra workers must wait about this case. However, the corner case does not present a challenge to smaller task assignments as there is no significant lagging behind of the helper.

            Message Check Frequency

FlexRR relies upon messages exchanged between workers to continually track records as well as work reassignment.  The message check frequency, therefore, refers to the times an individual worker looks for inbound messages for each iteration. If these checks are not conducted regularly, then the system is prone to the risk of not responding fast enough. On the hand, checking for inbound messages too often results in unnecessary overheads. From figure 17c one notes that any frequency ranging from 100 to 10K performs better butter is affected negatively once the frequency surpasses 10K.

Conclusion

The experiments described herein were designed to address the straggler problem of the Parameter Server for important Machine Learning strategies such as FlexRR, Matrix Factorization (MF), Logistic Regression (LR), and Latent Dirichlet Allocation (LDA), among other multifaceted learning algorithms that use iterative approaches to achieve convergence. As straggler mitigation technique compared to the methods such as the cloning and speculative execution is more effective. It enables the execution of control in a parallel manner in addition to allowing shared state management concerning input-data-parallel interactive convergent Machine Learning algorithms like Matrix Factorization (MF), Logistic Regression (LR), and Latent Dirichlet Allocation (LDA). The FlexRR technique is executed as a C++ library connected to different ML applications using it to address stragglers. From the results of the experiments described herein, FlexRR execution comprises of the execution of one process on every node being applied by ML. Similarly, each worker process prompts a worker thread for every core underlining a node and a several background threads driving internal functionality. The results show that FlexRR decreases the time-per-iteration significantly when executed through Matrix Factorization compared to the ML algorithms. For example, the experiment on naturally-occurring stragglers shows that FlexRR decreases time-per-iteration by 35% concerning Matrix factorization and 34% for Latent Dirichlet Allocation relative to the Bulk Synchronous Parallel (BSP) and 25% and 34% reductions relative to SSP respectively. When the c4. x. Large VMs is used, the decreases are 53%(BSP) and 39% (SSP) for MF and 49% (BSP) and 32% (SSP) for the LDA. For the disrupted machine pattern, the experiments compared different average time-per-iteration of FlexRR to alternatives forms for the Disrupted Machine Pattern as shown.  The MF, MLR, and LDA on Cluster-A are qualitatively the same. However, FlexRR reduces interruptions experience by 49% with matrix factorization and 42 percent when LDA ML algorithms are applied. The study thus confirms that matrix factorization is the best algorithm when using FlexRR as a straggler mitigation technique.

References

  1. Harlap, A., Cui, H., Dai, W., Ganger, G. R., Gibbons, P. B., Gibson, G. A., & Xing, E. P. (2016). Addressing the straggler problem for iterative convergent parallel ML. ACM Symposium on Cloud Computing (pp. 98-111). Santa Clara, CA: ACM.
  2. Li, M. (2014). Scaling Distributed Machine Learning with the Parameter Server. Proceedings of the 2014 International Conference on Big Data Science and Computing (1).
  3. Xu, H., & Lau, W. C. (2017). Optimization for Speculative Execution in Big Data Processing Clusters.
  4. IEEE Transactions on Parallel and Distributed Systems, 28(2), 530-545.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *