Each day, semiconductor manufacturing companies (fabs) run distributed compute-intensive post-tape-out-flow runs (PTOF) jobs to apply various resolution enhancement technology (RET) techniques to incoming designs and convert these designs into photomask data that is ready for manufacturing. This process is performed on large compute clusters managed by a job scheduler. To minimize the compute cost of each PTOF job, various manual techniques are used to choose the best compute setup that produces the optimum hardware utilization and efficient runtime for that job. We introduce a machine learning (ML) solution that can provide CPU time prediction for these PTOF jobs, which can be used to provide compute cost estimations, provide recommendations for resources, and feed scheduling models. ML training is based on job-specific features extracted from production data, such as layout size, hierarchy, and operations, as well as meta-data like job type, technology node, and layer. The list of input features correlated to the prediction was evaluated, along with several ML techniques, across a wide variety of jobs. Integrating an ML-based CPU runtime prediction module into the production flow provides data that can be used to improve job priority decisions, raise runtime warnings due to hardware or other issues, and estimate compute cost for each job (which is especially useful in a cloud environment). Given the wide variation of expected runtimes for different types of jobs, replacing manual monitoring of jobs in tape-out operation with an integrated ML-based solution can improve both the productivity and efficiency of the PTOF process.
With CMOS technology nodes going further into the realm of sub-wavelength lithography, the need for compute power also increases to meet runtime requirements for reticle enhancement techniques and results validation. Expanding the mask data preparation (MDP) cluster size is an obvious solution to increase compute power, but this can lead to unforeseen events such as network bottlenecks, which must be taken into account. Advanced scalable solutions provided by optical proximity correction (OPC)/mask process correction (MPC) software are obviously critical, but other optimizations such as dynamic CPU allocations (DCA) based on real CPU needs, high-level jobs management, real-time resource monitoring, and bottleneck detection are also important factors for improving cluster utilization in order to meet runtime requirements and handle post-tapeout (PTO) workloads efficiently. In this paper, we will discuss tackling such efforts through various levels of the “cluster utilization stack” from low CPU levels to business levels to head towards maximizing cluster utilization and maintaining lean computing.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.