A closed-cycle gasoline compression ignition (GCI) engine simulation near top dead center (TDC) was used to profile the performance of a parallel commercial engine computational fluid dynamics (CFD) code, as it was scaled on up to 4096 cores of an IBM Blue Gene/Q (BG/Q) supercomputer. The test case has 9 × 106 cells near TDC, with a fixed mesh size of 0.15 mm, and was run on configurations ranging from 128 to 4096 cores. Profiling was done for a small duration of 0.11 crank angle degrees near TDC during ignition. Optimization of input/output (I/O) performance resulted in a significant speedup in reading restart files, and in an over 100-times speedup in writing restart files and files for postprocessing. Improvements to communication resulted in a 1400-times speedup in the mesh load balancing operation during initialization, on 4096 cores. An improved, “stiffness-based” algorithm for load balancing chemical kinetics calculations was developed, which results in an over three-times faster runtime near ignition on 4096 cores relative to the original load balancing scheme. With this improvement to load balancing, the code achieves over 78% scaling efficiency on 2048 cores, and over 65% scaling efficiency on 4096 cores, relative to 256 cores.