Workflow optimization in distributed computing environment for stream-based data processing model / Saima Gulzar Ahmad

Saima Gulzar, Ahmad (2017) Workflow optimization in distributed computing environment for stream-based data processing model / Saima Gulzar Ahmad. PhD thesis, University of Malaya.

[img] PDF (The Candidate's Agreement)
Restricted to Repository staff only

Download (1799Kb) | Request a copy
    [img]
    Preview
    PDF (Thesis (Ph.D.)
    Download (4Mb) | Preview

      Abstract

      With the advancement in science and technology numerous complex scientific applications can be executed in heterogeneous computing environment. However, the bottle neck is efficient scheduling algorithms. Such complex applications can be expressed in the form of workflows. Geographically distributed heterogeneous resources can execute such workflows in parallel. This enhances the workflow execution. In data-intensive workflows, heavy data moves across the execution nodes. This causes high communication overhead. To avoid such overheads many techniques have been used, however in this thesis stream-based data processing model is used in which data is processed in the form of continuous instances of data items. Data-intensive workflow optimization is an active research area because numerous applications are producing huge amount of data that is increasing exponentially day by day. This thesis proposes data-intensive workflow optimization algorithms. The first algorithm architecture consists of two phases a) workflow partitioning, and b) partitions mapping. Partitions are made in such a way that minimum data should move across the partitions. It enables heavy data processing locally on same execution node because each partition is mapped to one execution node. It overcomes the high communication costs. In the mapping phase, a partition is mapped on that execution node which offers minimum execution time. Eventually, the workflow is executed. The second algorithm is a variation in first algorithm in which data parallelism is introduced in each partition. Most compute intensive task in each partition is identified and data parallelism is applied to that task. It reduces the execution time of that compute intensive tasks. The simulation results prove that proposed algorithms outperform from state of the art algorithms for variety of workflows. The datasets used for performance evaluation are synthesized as well as workflows derived from real world applications. The workflows derived from real world applications include Montage and Cybershake. Synthesized workflows were generated with different sizes, shapes and densities to evaluate the proposed algorithms. The simulation results shows 60% reduced latency with 47% improvement in the throughput. Similarly, when data parallelism is introduced in the algorithm the performance of the algorithm improved further by 12% in latency and 17% in throughput when compared to PDWA algorithm. In the real time stream processing framework the experiments were performed using STORM with a use-case data-intensive workflow (EURExpressII). Experiments show that PDWA outperforms in terms of execution time of the workflow with different input data size.

      Item Type: Thesis (PhD)
      Additional Information: Thesis (PhD) – Faculty of Computer Science & Information Technology, University of Malaya, 2017.
      Uncontrolled Keywords: Workflow optimization; Stream-based data processing model; Heterogeneous resources; Algorithms; Computing environment
      Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
      Divisions: Faculty of Computer Science & Information Technology
      Depositing User: Mr Mohd Safri Tahir
      Date Deposited: 19 Sep 2017 16:21
      Last Modified: 19 Sep 2017 16:21
      URI: http://studentsrepo.um.edu.my/id/eprint/7761

      Actions (For repository staff only : Login required)

      View Item