Establishing Job Scheduling and Checkpointing in Multi-Cluster Systems
K. Akshitha1, B. V. S. S. R. S. Sastry2, M. V. Vijaya Saradhi3
1K. Akshitha, Department of IT , Aurora’s Engineering College, Bhuvanagiri, Andhra Pradesh, India.
2B. V. S. S. R. S. SASTRY, Department of IT, Aurora’s Engineering College, Bhuvanagiri, (Andhra Pradesh), India.
3Dr. M. V. Vijaya Saradhi, Department of IT, Aurora’s Engineering College, Bhuvanagiri, Andhra Pradesh, India.
Manuscript received on October 15, 2011. | Revised Manuscript received on October 24, 2011. | Manuscript published on November 05, 2011. | PP: 380-384 | Volume-1 Issue-5, November 2011. | Retrieval Number: E0248101511/2011©BEIESP
Open Access | Ethics and Policies | Cite
© The Authors. Published By: Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Multi-site parallel job schedulers can improve average job turn-around time by making use of fragmented node resources available throughout the grid. By mapping jobs across potentially many clusters, jobs that would otherwise wait in the queue for local resources can begin execution much earlier; thereby improving system utilization and reducing average queue waiting time. Recent research in this area of scheduling leverages user-provided estimates of job communication characteristics to more effectively partition the job across system resources. In this paper, we address the impact of inaccuracies in these estimates on system performance and show that multi-site scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are instances where these errors result in poor job scheduling decisions that cause network over-subscription. This situation can lead to significantly degraded application performance and turnaround time. Consequently, we explore the use of job check pointing, termination, migration, and restart (CTMR) to selectively stop offending jobs to alleviate network congestion and subsequently restart them when (and where) sufficient network resources are available. We then characterize the conditions and the extent to which the process of CTMR improves overall performance.
Keywords: Parallel job scheduling; check pointing; migration; clusters; grid scheduling.