Befriend in 2005 I was the Xbox 360 CPU guy. I lived and breathed that chip. I serene possess a 30-cm CPU wafer on my wall, and a four-foot poster of the CPU’s layout. I spent so noteworthy time understanding how that CPU’s pipelines worked that once I was requested to analyze some very now not going crashes I was in a self-discipline to intuit how a develop worm ought to be their cause. However first, some background…
The Xbox 360 CPU is a three-core PowerPC chip made by IBM. The three cores sit down in three separate quadrants with the fourth quadrant containing a 1-MB L2 cache – that you just can too stare the diversified parts, in the image at staunch and on my CPU wafer. Every core has a 32-KB instruction cache and a 32-KB info cache.
Minutiae: Core zero was closer to the L2 cache and had measurably decrease L2 latencies.
The Xbox 360 CPU had excessive latencies for every part, with memory latencies being specifically corrupt. And, the 1-MB L2 cache (all that can perchance match) was slightly small for a three-core CPU. So, conserving self-discipline in the L2 cache in sigh to diminish cache misses was major.
CPU caches toughen efficiency due to the spatial and temporal locality. Spatial locality formula that for individuals who’ve feeble one byte of info then you’ll doubtlessly command other shut by bytes of info quickly. Temporal locality formula that for individuals who’ve feeble some memory then you would possibly maybe well doubtlessly command it another time in the shut to future.
However customarily temporal locality doesn’t no doubt occur. In case you is also processing a accurate array of info once-per-frame then it would possibly maybe perchance be trivially provable that this can all be long gone from the L2 cache by the purpose you want it another time. You continue to prefer that info in the L1 cache so as which that you just can too attend from spatial locality, nonetheless having it drinking precious self-discipline in the L2 cache pretty formula this can evict other info, presumably slowing down the opposite two cores.
Most continuously this is unavoidable. The memory coherency mechanism of our PowerPC CPU required that every particular person info in the L1 caches even be in the L2 cache. The MESI protocol feeble for memory coherency requires that after one core writes to a cache line that any other cores with a reproduction of the the same cache line want to discard it – and the L2 cache was in impress for retaining note of which L1 caches had been caching which addresses.
However, the CPU was for a on-line sport console and efficiency trumped all so a recent instruction was added – xdcbt. The long-established PowerPC dcbt instruction was a common prefetch instruction. The xdcbt instruction was an prolonged prefetch instruction that fetched straight from memory to the L1 d-cache, skipping L2. This intended that memory coherency was no longer assured, nonetheless howdy, we’re on-line sport programmers, we know what we’re doing, this would possibly maybe well even be comely.
I wrote a broadly-feeble Xbox 360 memory reproduction routine that optionally feeble xdcbt. Prefetching the source info was mandatory for efficiency and frequently it would possibly maybe perchance command dcbt nonetheless spin in the PREFETCH_EX flag and it would possibly maybe perchance prefetch with xdcbt. This was now not smartly-understanding-out. The prefetching was customarily:
if (flags & PREFETCH_EX)
A sport developer who was using this selection reported weird crashes – heap corruption crashes, nonetheless the heap buildings in the memory dumps regarded long-established. After watching the smash dumps for awhile I spotted what a mistake I had made.
Memory that is prefetched with xdcbt is toxic. If it’s written by one other core sooner than being flushed from L1 then two cores possess diversified views of memory and there would possibly maybe be rarely this form of thing as a pronounce their views will ever converge. The Xbox 360 cache traces had been 128 bytes and my reproduction routine’s prefetching went staunch to the tip of the source memory, which formula that xdcbt was utilized to some cache traces whose latter portions had been segment of adjacent info buildings. Mainly this was heap metadata – no lower than that’s where we noticed the crashes. The incoherent core noticed archaic info (despite careful command of locks), and crashed, nonetheless the smash dump wrote out the true contents of RAM so as that we couldn’t stare what came about.
So, the handiest accurate formula to make command of xdcbt was to be very careful now not to prefetch even a single byte previous the tip of the buffer. I mounted my memory reproduction routine to support away from prefetching too far, nonetheless while waiting for the fix the game developer stopped passing the PREFETCH_EX flag and the crashes went away.
The true worm
To date so long-established, staunch? Cocky sport builders play with fire, fly too shut to the sun, marry their mothers, and a sport console nearly misses Christmas.
However, we caught it in time, we purchased away with it, and we had been all space to ship the video games and the console and spin residence contented.
After which the the same sport started crashing another time.
The indicators had been the same. Besides for that the game was no longer using the xdcbt instruction. I would possibly maybe well step thru the code and stare that. We had a extreme grief.
I feeble the old debugging methodology of watching my conceal with a blank mind, let the CPU pipelines absorb my subconscious, and I with out warning realized the grief. A snappy e mail to IBM confirmed my suspicion a couple of refined internal CPU detail that I had by no formula understanding about sooner than. And it’s the the same perpetrator gradual Meltdown and Spectre.
The Xbox 360 CPU is an in-sigh CPU. It’s slightly easy in actual fact, relying on its excessive frequency (now not as excessive as hoped despite 10 FO4) for efficiency. However it absolutely does possess a branch predictor – its very prolonged pipelines develop that mandatory. Here’s a publicly shared CPU pipeline plot I made (my cycle-staunch version is NDA handiest, nonetheless looky right here) that shows the overall pipelines:
You would possibly maybe well presumably also stare the branch predictor, and likewise that you just can too stare that the pipelines are very prolonged (wide on the plot) – loads prolonged ample for mispredicted instructions to withstand the trace, even with in-sigh processing.
So, the branch predictor makes a prediction and the anticipated instructions are fetched, decoded, and executed – nonetheless now not retired until the prediction is assumed to be pretty. Sound acquainted? The perception I had – it was recent to me at the time – was what it intended to speculatively attain a prefetch. The latencies had been prolonged, so it was major to bring together the prefetch transaction on the bus as quickly as imaginable, and once a prefetch had been initiated there was no formula to spoil it. So a speculatively-executed xdcbt was the same to a true xdcbt! (a speculatively-executed load instruction was pretty a prefetch, FWIW).
And that was the grief – the branch predictor would customarily cause xdcbt instructions to be speculatively executed and that was pretty as corrupt as in actual fact executing them. One in every of my coworkers (thanks Tracy!) urged a colorful test to substantiate this – substitute every xdcbt in the game with a breakpoint. This accomplished two things:
- The breakpoints weren’t hit, thus proving that the game was now not executing xdcbt instructions.
- The crashes went away.
I knew that can perchance be the end result and yet it was serene appropriate. All these years later, and even after finding out about Meltdown, it’s serene nerdy cold to stare solid proof that instructions that weren’t executed had been causing crashes.
The branch predictor realization made it particular that this instruction was too unhealthy to possess anyplace in the code segment of any sport – controlling when an instruction would possibly maybe well be speculatively executed is too now not easy. The branch predictor for indirect branches would possibly maybe well, theoretically, predict any take care of, so there was no “accurate self-discipline” to set an xdcbt instruction. And, if speculatively executed it would possibly maybe perchance happily attain an prolonged prefetch of irrespective of memory the desired registers came about to randomly beget. It was imaginable to diminish the possibility, nonetheless now not bring together rid of it, and it pretty wasn’t rate it. While Xbox 360 architecture discussions continue to affirm the instruction I doubt that any video games ever shipped with it.
I talked about this once for the length of a job interview – “philosophize the toughest worm you’ve had to analyze” – and the interviewer’s response was “yeah, we hit one thing identical on the Alpha processor”. The more things trade…
Due to of Michael for some modifying.
How can a branch that is by no formula taken be predicted to be taken? Easy. Branch predictors don’t support appropriate history for every branch in the executable – that can perchance be impractical. As a substitute easy branch predictors on the total squish together a bunch of take care of bits, perchance some branch history bits as smartly, and index into an array of two-bit entries. Thus, the branch predict end result’s tormented by other, unrelated branches, resulting in customarily fraudulent predictions. However it absolutely’s k, due to the it’s “pretty a prediction” and it doesn’t want to be staunch.