Logout succeed
Logout succeed. See you again!

Programming for Hybrid Multi/Manycore MPP Systems PDF
Preview Programming for Hybrid Multi/Manycore MPP Systems
Programming for Hybrid Multi/Manycore MPP Systems John Levesque Aaron Vose CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2018 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business Version Date: 20170825 International Standard Book Number-13: 978-1-4398-7371-7 (Hardback) Library of Congress Cataloging-in-Publication Data Names: Levesque, John M. author. | Vose, Aaron, author. Title: Programming for hybrid multi/manycore MPP systems / John Levesque, Aaron Vose. Description: Boca Raton : CRC Press, Taylor & Francis, 2017. | Series: Chapman & Hall/CRC computational science | Includes index. Identifiers: LCCN 2017018319 | ISBN 9781439873717 (hardback : alk. paper) Subjects: LCSH: Parallel programming (Computer science) | Multiprocessors--Programming. | Coprocessors--Programming. Classification: LCC QA76.642 .L+475 2017 | DDC 005.2/75--dc23 LC record available at https://lccn.loc.gov/2017018319 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents Preface About the Authors List of Figures List of Tables List of Excerpts hapter C 1 (cid:4) Introduction 1 1.1 INTRODUCTION 1 1.2 CHAPTER OVERVIEWS 3 hapter C 2 (cid:4) Determining an Exaflop Strategy 7 2.1 FOREWORDBYJOHNLEVESQUE 7 2.2 INTRODUCTION 8 2.3 LOOKING ATTHEAPPLICATION 9 2.4 DEGREEOFHYBRIDIZATION REQUIRED 13 2.5 DECOMPOSITIONAND I/O 15 2.6 PARALLEL AND VECTORLENGTHS 15 2.7 PRODUCTIVITYAND PERFORMANCEPORTABILITY 15 2.8 CONCLUSION 19 2.9 EXERCISES 19 hapter C 3 (cid:4) Target Hybrid Multi/Manycore System 21 3.1 FOREWORDBYJOHNLEVESQUE 21 xii (cid:4) Contents 3.2 UNDERSTANDING THEARCHITECTURE 22 3.3 CACHEARCHITECTURES 23 3.3.1 Xeon Cache 24 3.3.2 NVIDIA GPU Cache 25 3.4 MEMORYHIERARCHY 25 3.4.1 Knight’s Landing Cache 27 3.5 KNLCLUSTERING MODES 28 3.6 KNLMCDRAMMODES 33 3.7 IMPORTANCEOFVECTORIZATION 38 3.8 ALIGNMENT FORVECTORIZATION 40 3.9 EXERCISES 40 hapter C 4 (cid:4) How Compilers Optimize Programs 43 4.1 FOREWORDBYJOHNLEVESQUE 43 4.2 INTRODUCTION 45 4.3 MEMORYALLOCATION 45 4.4 MEMORYALIGNMENT 47 4.5 COMMENT LINE DIRECTIVE 48 4.6 INTERPROCEDURALANALYSIS 49 4.7 COMPILER SWITCHES 49 4.8 FORTRAN2003AND INEFFICIENCIES 50 4.8.1 Array Syntax 51 4.8.2 Use Optimized Libraries 53 4.8.3 Passing Array Sections 53 4.8.4 Using Modules for Local Variables 54 4.8.5 Derived Types 54 4.9 C/C++ AND INEFFICIENCIES 55 4.10 COMPILER SCALAR OPTIMIZATIONS 61 4.10.1 Strength Reduction 61 4.10.2 Avoiding Floating Point Exponents 63 4.10.3 Common Subexpression Elimination 64 4.11 EXERCISES 65 hapter C 5 (cid:4) Gathering Runtime Statistics for Optimizing 67 5.1 FOREWORDBYJOHNLEVESQUE 67 Contents (cid:4) xiii 5.2 INTRODUCTION 68 5.3 WHAT’SIMPORTANT TOPROFILE 69 5.3.1 Profiling NAS BT 69 5.3.2 Profiling VH1 74 5.4 CONCLUSION 76 5.5 EXERCISES 77 hapter C 6 (cid:4) Utilization of Available Memory Bandwidth 79 6.1 FOREWORDBYJOHNLEVESQUE 79 6.2 INTRODUCTION 80 6.3 IMPORTANCEOFCACHEOPTIMIZATION 80 6.4 VARIABLEANALYSIS INMULTIPLE LOOPS 81 6.5 OPTIMIZING FORTHECACHEHIERARCHY 84 6.6 COMBINING MULTIPLE LOOPS 93 6.7 CONCLUSION 96 6.8 EXERCISES 96 hapter C 7 (cid:4) Vectorization 97 7.1 FOREWORDBYJOHNLEVESQUE 97 7.2 INTRODUCTION 98 7.3 VECTORIZATIONINHIBITORS 99 7.4 VECTORIZATIONREJECTIONFROMINEFFICIENCIES 101 7.4.1 Access Modes and Computational Intensity 101 7.4.2 Conditionals 104 7.5 STRIDINGVERSUSCONTIGUOUSACCESSING 107 7.6 WRAP AROUND SCALAR 111 7.7 LOOPS SAVING MAXIMA AND MINIMA 114 7.8 MULTINESTED LOOPSTRUCTURES 116 7.9 THERE’SMATMULAND THENTHERE’SMATMUL 119 7.10 DECISIONPROCESSESIN LOOPS 122 7.10.1 Loop-Independent Conditionals 123 7.10.2 Conditionals Directly Testing Indicies 125 7.10.3 Loop-Dependent Conditionals 130 7.10.4 Conditionals Causing Early Loop Exit 132 7.11 HANDLING FUNCTIONCALLS WITHINLOOPS 134 xiv (cid:4) Contents 7.12 RANK EXPANSION 139 7.13 OUTERLOOPVECTORIZATION 143 7.14 EXERCISES 144 hapter C 8 (cid:4) Hybridization of an Application 147 8.1 FOREWORDBYJOHNLEVESQUE 147 8.2 INTRODUCTION 147 8.3 THENODE’SNUMA ARCHITECTURE 148 8.4 FIRSTTOUCHINTHEHIMENO BENCHMARK 149 8.5 IDENTIFYING WHICHLOOPS TOTHREAD 153 8.6 SPMD OPENMP 158 8.7 EXERCISES 167 hapter C 9 (cid:4) Porting Entire Applications 169 9.1 FOREWORDBYJOHNLEVESQUE 169 9.2 INTRODUCTION 170 9.3 SPEC OPENMP BENCHMARKS 170 9.3.1 WUPWISE 170 9.3.2 MGRID 175 9.3.3 GALGEL 177 9.3.4 APSI 179 9.3.5 FMA3D 182 9.3.6 AMMP 184 9.3.7 SWIM 190 9.3.8 APPLU 192 9.3.9 EQUAKE 194 9.3.10 ART 201 9.4 NASA PARALLEL BENCHMARK (NPB) BT 208 9.5 REFACTORINGVH 1 218 9.6 REFACTORINGLESLIE3D 223 9.7 REFACTORINGS3D –2016 PRODUCTIONVERSION 226 9.8 PERFORMANCEPORTABLE–S3D ONTITAN 230 9.9 EXERCISES 241 Contents (cid:4) xv hapter C 10(cid:4) Future Hardware Advancements 243 10.1 INTRODUCTION 243 10.2 FUTUREX86CPUS 244 10.2.1 Intel Skylake 244 10.2.2 AMD Zen 244 10.3 FUTUREARMCPUS 245 10.3.1 Scalable Vector Extension 245 10.3.2 Broadcom Vulcan 248 10.3.3 Cavium Thunder X 249 10.3.4 Fujitsu Post-K 249 10.3.5 Qualcomm Centriq 249 10.4 FUTUREMEMORYTECHNOLOGIES 250 10.4.1 Die-Stacking Technologies 250 10.4.2 Compute Near Data 251 10.5 FUTUREHARDWARECONCLUSIONS 252 10.5.1 Increased Thread Counts 252 10.5.2 Wider Vectors 252 10.5.3 Increasingly Complex Memory Hierarchies 254 ppendix A A(cid:4) Supercomputer Cache Architectures 255 A.1 ASSOCIATIVITY 255 ppendix A B(cid:4) The Translation Look Aside Buffer 261 B.1 INTRODUCTIONTOTHETLB 261 ppendix A C(cid:4) Command Line Options and Compiler Direc tives 263 C.1 COMMAND LINE OPTIONS AND COMPILER DIRECTIVES 263 ppendix A D (cid:4) Previously Used Optimizations 265 D.1 LOOPREORDERING 265 D.2 INDEX REORDERING 266 xvi (cid:4) Contents D.3 LOOPUNROLLING 266 D.4 LOOPFISSION 266 D.5 SCALAR PROMOTION 266 D.6 REMOVALOFLOOP INDEPENDENT IFS 267 D.7 USEOFINTRINSICS TOREMOVEIFS 267 D.8 STRIPMINING 267 D.9 SUBROUTINEINLINING 267 D.10 PULLING LOOPS INTOSUBROUTINES 267 D.11 CACHEBLOCKING 268 D.12 LOOPFUSION 268 D.13 OUTERLOOPVECTORIZATION 268 ppendix A E(cid:4) I/O Optimization 269 E.1 INTRODUCTION 269 E.2 I/O STRATEGIES 269 E.2.1 Spokesperson 269 E.2.2 Multiple Writers – Multiple Files 270 E.2.3 Collective I/O to Single or Multiple Files 270 E.3 LUSTREMECHANICS 270 ppendix A F(cid:4) Terminology 273 F.1 SELECTED DEFINITIONS 273 ppendix A G (cid:4) 12 Step Process 277 G.1 INTRODUCTION 277 G.2 PROCESS 277 Bibliography 279 Crypto 283 Index 285 Preface Forthepast20years,highperformancecomputinghasbenefitedfromasignif- icant reduction in the clock cycle time of the basic processor. Going forward, trends indicate the clock rate of the most powerful processors in the world may stay the same or decrease slightly. When the clock rate decreases, the chip runs at a slower speed. At the same time, the amount of physical space that a computing core occupies is still trending downward. This means more processing cores can be contained within the chip. Withthisparadigmshiftinchiptechnology,causedbytheamountofelec- tricalpowerrequiredtorunthedevice,additionalperformanceisbeingdeliv- ered by increasing the number of processors on the chip and (re)introducing SIMD/vectorprocessing.Thegoalistodelivermorefloating-pointoperations per second per watt. Interestingly, these evolving chip technologies are being used on scientific systems as small as a single workstationand as large as the systems on the Top 500 list. Withinthisbookaretechniquesforeffectivelyutilizingthesenewnodear- chitectures. Efficient threading on the node, vectorization to utilize the pow- erful SIMD units, and effective memory management will be covered along with examples to allow the typical application developer to apply them to their programs. Performance portable techniques will be shown that will run efficiently on all HPC nodes. TheprincipaltargetsystemswillbethelatestmulticoreIntelXeonproces- sor,thelatestIntelKnight’sLanding(KNL)chipwithdiscussion/comparison tothelatesthybrid/acceleratedsystemsusingtheNVIDIAPascalaccelerator. The following QR code points to www.hybridmulticore.com,the book’s companion website, which will contain solutions to the exercises in the book: Figures 2.3.1 3D grid decomposition minimizing MPI surface area. 12 2.4.1 PerformanceofS3DonKNLwithdifferentMPIrankcounts. 14 2.7.1 Performance increase of refactored COSMOS code. 17 2.7.2 Energy reduction of refactored COSMOS code. 18 3.4.1 NVLink (solid) and PCIe (dashed) connecting two CPUs and eight GPUs. 26 3.4.2 NVLink(solid)andPCIe(dashed)connectingoneCPUand one to two GPUs. 26 3.4.3 KNL cache hierarchy. 27 3.5.1 All-to-all clustering mode on KNL: (1) level-2 miss, (2) di- rectory access, (3) memory access, and (4) data return. 28 3.5.2 Quandrant clustering mode on KNL: (1) level-2 miss, (2) directory access, (3) memory access, and (4) data return. 29 3.5.3 SNC4 clustering mode on KNL: (1) level-2 miss, (2) direc- tory access, (3) memory access, and (4) data return. 30 3.5.4 KNL clustering modes with Himeno and 1 thread per rank. 31 3.5.5 KNL clustering modes with Himeno and 4 threads per rank. 32 3.5.6 KNL clustering modes with S3D and 1 thread per rank. 33 3.6.1 MCDRAM as a direct-mapped cache for 128GB of DDR. 35 3.6.2 KNL cache modes with Himeno and 1 thread per rank. 36 3.7.1 Efficiency of vectorization on GPU with increasing vector lengths. 39 4.9.1 Speedup relative to extended DAXPY code with aliasing. 59 6.3.1 Haswell cache hierarchy. 80 6.4.1 Performance of strip mine example in quadrant/cachemode. 83 6.4.2 Performance of strip mine example in snc4/cache mode. 83 6.5.1 Vector speedup at increasing loop interation counts. 84