With the increasing speed difference between processor and memory, the memory speed bottleneck has become an important factor restricting program performance, a hierarchical storage system with multilevel cache has been developed in modern computers to cope with this situation. However, some applications using structure type as the core data structure do not fully consider the principle of data locality in the design, resulting in a low cache hit rate when the program runs, and the program performance is seriously constrained. To address this problem, the study of structure layout optimization has been carried out. Structure layout optimization is a technique to optimize the core data structure of a program, which requires changes to the entire program. Based on the GCC 10.3.0 compiler, the research carried out in this thesis implements structure layout optimization in four steps: recording, analysis, conversion and rewriting, by utilizing GCC's well-established interprocedural optimization and link-time optimization frameworks. The structures in the program are maximally split to separate the frequently accessed data fields from the infrequently accessed data fields, which improves the spatial locality of the program, increases the cache hit rate when the program is running, and finally improves the performance of the program. Through experimental validation and analysis, significant performance improvements were achieved for memory-intensive applications that use structure type as their core data structure and exhibit noticeable differences in memory access frequency among structure fields. Experiment was conducted on the SPEC CPU 2000 and SPEC CPU 2006 benchmark suites, resulting in a maximum speedup ratio of 1.86.
KEYWORDS: Power consumption, Design and modelling, Matrices, Convolutional neural networks, Windows, Clocks, Computer architecture, Convolution, Image compression, Digital signal processing
A low-power RISC-V-based convolutional neural network acceleration processor is proposed to cope with the problem that the increasing resource requirements of convolutional neural networks in the direction of hardware convolutional acceleration are difficult to be met on embedded devices. The processor is designed with three instructions that can configure the parameters of each CNN layer to accommodate different input data, multiplex computational resources to reduce power consumption, and execute operations that repeat a large number of executions in parallel to speed up operation efficiency. Through comparison experiments, it can be found that this processor acceleration instruction set is 20.93 times, 7.67 times, and 8.97 times faster than the base RISC-V instruction set after verified with the same data on three operations, including convolution, activation, and pooling, respectively. The experimental results show that the total power consumption of the processor with this custom instruction set is only 0.221 W at 16 MHZ operating frequency, which is advantageous in terms of performance-to-power ratio compared to other RISC-V accelerated processors with less resource consumption and lower power consumption.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.