1.   مشاوره و انجام پروپوزال  و پایان نامه ، مشاوره در زمینه ارائه سمینار، 
       مشاوره و انجام مقاله های بین المللی و داخلی، 
       مشاوره و انجام مقاله در مجله های علمی پژوهشی معتبر، 
        مشاوره و آموزش شبیه سازی شبکه توسط شبیه ساز آکادمیک 2-NS، 
         مشاوره و آموزش شبیه سازهای ترافیک شهری از قبیل  SUMO، ONE، و ...
          کمک به دانشجویان برای پیاده سازی ایده ها و مقالات خود با شبیه سازهای
               NS2, NS3 , OMNET++ , ONE
     
    
                 شماره تماس :
                         حسین رنجبران:    09101607834   
                                          
    
                  ساعات تماس: 
                                      ۸ الی ۲۰
                         
                   ایمیل:
                         hossein.ranjbaran.it@gmail.com
                        
           
    

A multi-factor monitoring fault tolerance model based on a GPU cluster for big data processing

شروع موضوع توسط Hossein Ranjbaran ‏30/1/20 در انجمن مدیریت خطا و تحمل پذیری خطا

وضعیت موضوع:
You must be a logged-in, registered member of this site to view further posts in this thread.
  1. Administrator
    Hossein Ranjbaran
    کاربر ویژه
    تاریخ عضویت:
    ‏3/10/13
    ارسال ها:
    1,034
    تشکر شده:
    197
    High-performance computing clusters are widely used in large-scale data mining applications, and have higher requirements for persistence, stability and real-time use and sre therefore computationally intensive. To support large-scale data processing, we design a multi-factor real-time monitoring fault tolerance (MRMFT) model based on a GPU cluster. However, the higher clock frequency of GPU chips results in excessively high energy consumption in computing systems. Moreover, the ability to support a long-lasting high temperature operation varies greatly between different GPUs owing to the individual differences between the chips. In this paper, we design a GPU cluster energy consumption monitoring system based on wireless sensor networks (WSNs) and propose an energy consumption aware checkpointing (ECAC) for high energy consumption problems with the following two advantages: the system sets checkpoints according to actual energy consumption and the device temperature to improve the utilization of checkpoints and reduce time cost; and it exploits the parallel computing features of CPU and GPU to hide the CPU detection overhead in GPU parallel computation, and further reduce the time and energy consumption overhead in the fault tolerance phase. Using ECAC as the constraint and aiming for a persistent and reliable operation, the dynamic task migration mechanism is designed, and the reliability of the cluster is greatly improved. The theoretical analysis and experiment results show that the model improves the persistence and stability of the computing system while reducing checkpoint overhead.​


    لینک دانلود در پست بعد برای اعضاء قابل مشاهده است.
     
وضعیت موضوع:
You must be a logged-in, registered member of this site to view further posts in this thread.

این صفحه را به اشتراک بگذارید