nekrosoft13 01-19-08 04:29 PM

Inside Source Reveal the Truth About Xbox 360 "Red Ring of Death" Failures
Q: So what do you think the real failure rate of the Xbox 360 is? Some have estimated it as high as 30%. I got my Xbox in early 2007 and so far so good but what do you think the chance is that it's going to die on me one day.

It's around 30%, and all will probably fail early. This quarter they are expecting 1 M failures, most of those Xenons. Some of those are repeat failures. Life expectancy is all over the map because the design has very little margin for most of the important parameters. That means it's not a fault tolerant design. So a good unit may last a couple of years, while a bad unit can fail in hours. I have a launch unit and have not had a single problem with it. And it's used a lot. But I don't know anyone else with a 360 that hasn't broken, except you now. There's no way to tell when yours might die. But the cooler you can keep it, the longer it will probably last. So stand it up, keep it in free air, etc. :Note : Xenon was the code name for the first Xbox 360 mother board.

Q: Of all five videogame systems on the market now (PS3, PSP, PS2, Wii and 360)only the Xbox 360 has had such major hardware failure problems. Microsoft being the only company based in the US making a videogame system. What part of Microsoft's way of doing things do you think caused this situation to happen.
First, MS has under resourced that product unit in all engineering areas since the very beginning. Especially in engineering support functions like test, quality, manufacturing, and supplier management. There just weren't enough people to do the job that needed to be done. The leadership in many of those areas was also lopsided in essential skills and experience. But I hear they are really trying to staff up now based on what has happened, and how cheap staff is compared to a couple of billion in cost of quality.

Second, MS was so focused on beating Sony this cycle that the 360 was rushed to market when all indications were that it had serious flaws. The design qual testing was insufficient and incomplete when the product was released to production. The manufacturing test equipment had major gaps in test coverage and wasn't reliable or repeatable. Manufacturing processes at eall levels of suppliers were immature and not in control. Initial end to end yields were in the mid 30%. Low yields always indicate serious design and manufacturing defects. Management chose to continue to ship anyways, and keep the lines running while trying to solve problems and bring the yields up. Whenever something failed and there was a question about whether the test result was false, they would remove that test, retest and ship, or see if the unit would boot a game and run briefly and then ship. 360 is too complex of a machine to get away with that.

In the end I think it was fear of failure, ambition to beat Sony, and the arrogance that they could figure anything out, that led to the decision to keep shipping. That management team had made some pretty bad decisions in the past and had never had to pay a proportional consequence. I'm sure they thought that somehow they would figure it out and everything would end up ok. Plus, they tend to make big decisions like that in terms of dollars. They would rationalize that if the first few million boxes had a high failure rate, a few 10's of millions of dollars would cover it. And contrasting that cost with a big lead on Sony, would pay it in a heartbeat. They weren't even thinking about Nintendo.

Compare that to Sony, who delayed their launch, even though they were behind, when their box wasn't ready.

Q: In your opinion what do you think the main cause of the Red Ring of Death failures have been?
RROD is caused by anything that fails in the "digital backbone" on the mother board. Also known as a core digital error. CPU, GPU, memory, etc. Bad parts, incompatible parts (timing problems) bad manufacturing process (like solder joints), misapplied heat sinks or thermal interface material, missing parts, broken parts, parts of the wrong value, missed test coverage. Any one or more, on any chip, or many other discrete components, would cause this. And many of the failures were obviously infant mortality, where they work when they leave the factory and fail early in use. The main design flaw was the excessive heat on the GPU warping the mother board around it. This would stress the solder joints on the GPU and any bad joints would then fail in early life.

There are also other significantly high failure rates in other areas, like the DVD.

Q: Does some games more than others can cause hardware failure. Gears of War and Dead Rising were thought to be system killers when they came out.
Of course. Infant mortality, which is a weakened mechanical "thing" like a solder joint with a void in it, are exercised to failure by cyclic stress. The number of cycles and the amplitude of temperature change from low to high determine how quickly it will fail. Certain games will consume more bandwidth on the GPU, which has the most substandard thermal solution on the mother board, making it a lot hotter, warping the mobo and flexing the solder joints. Weak joints fail quickly. The better the game, the more often it will be played, again accelerating failures.

Q: Let's go over some of the rumored reasons RROD. Could you tell how close each theory is?
Over heating CPU/GPU due to the lead free solder?
They don't overheat due to PB Free. They over heat due to too much power dissipated in too small of an area, w/o a sufficient thermal management design to take the heat away from the junction of the transistors on the chips, the packages themselves, and the mobo. And the over heating is on the GPU. When the CPU heatsink is applied right, it does not over heat.

Defective parts due to overseas subcontractors?
Some defective parts, like BGAs where the solder balls are not of sufficient and uniform size, so they don't solder down evenly, or the substrate is warped, causing some joints to have insufficient solder. Bad chips from marginal or under tested wafers. Others are deficient processes, like misaligning the solder paste to the circuit board, or same on the parts, or not having the thermal profile right in the reflow oven during soldering. Manufacturers new to PB free tend to err on the low temp side thinking they are saving the parts reliability wise from a large thermal load. What they are really doing is not reflowing the PB free solder enough to make a good joint. PB free solder is non eutectic, which means the different metals in the solder alloy melt at different temperatures, unlike leaded solder where everything melts at the same temperature. If you under heat it, it won't bond well to the board or parts, won't form a good joint, leaving voids and other defects in the joints that lead to early failure under normal circumstances. But when you add the extraordinary heat and mother board warpage that goes with it, well you get a catastrophic failure rate like we've all seen on 360.

Defective or insufficient heat sinks?
A heat sink like the one they eventually put on the GPU would have helped a lot, since it stops the GPU heat from warping the mobo and breaking the solder joints. The CPU heatsink was fine. I've heard the memory was running hot too, and contributing to these failures. Not sure if they were heated by contact with the GPU heatsink, proximity on the mother board, or both. But with the new GPU heatsink the failure rate probably would have still been double digits overall. Way too high still.

Corrupt BIOS or OS bricking the system?
Maybe. But haven't heard of this outside of the periodic dash updates bricking boxes.

Is humidity a factor? Are Xbox 360s in Florida just as likely as a 360 in Seattle?
Humidity is a co-factor with temperature for many failure modes. The hotter the room ambient conditions, the more likely a 360 is to fail, all else being equal. Same for humidity.

Is keeping the 360 horizontal more safe than keeping it vertical?
I don't think so. Vertical exposes more surface area and volume to heat exchange with cooler room air. And I think opens more vent holes. Just don't let it fall over.

System wide design problems due to a production schedule that shipped a full year before the competition's systems?
Yes. It just wasn't mature enough. Too many design defects, lack of design margins, immature test processes and equipment, insufficient PB free manufacturing expertise at partner manufacturers who made the mother board.

Or is there no one specific problem but a bunch of possible problem for each console?
Yes. See above.

Q: How have IBM and ATI dealt with the Xbox 360 problems?
Sorry, I don't know. But they were contracted to design and help launch the chips. After that, MS owned the design and tooling. So they didn't have to worry about it. Although I'm sure they were pulled in.

Q: Just what is up with the RROD "Towel Trick" fix?
My best guess is that it somewhat reflows the solder joints on the GPU while it's under a high compressive load from the heatsink clip, causing any open solder joints to make contact again. I don't think it's going to fully reflow them because 1) PB free solder melts above 300 degrees C, and if that happened the GPU would be pulled flat to the mother board with a big puddle of solder under it shorting everything out.

Q: One of the problems that I have run into my 360 is that the disk tray will fail to eject and not let me swap disks. Have any ideas?

LOL. Reboot and try it again! Sorry, couldn't help myself. You didn't give me enough info. How often does it happen? Notice any conditions that tend to make it happen more repeatably (after long play, unit standing up, right after a previous eject, etc.)? Can you recover and get the tray open at some other time after it fails? What did you have to do? It might be as simple as a bad connection somewhere in the circuit for the eject button. Usually I'd recommend percussive maintenance (hit it) but that would probably damage the disc and could damage the console. So don't. Maybe the disc is jammed in there. Does the tray try to come out and then stop? Maybe there is a misalignment with the box case. See if you can find a place where it might be catching. If you can't find the problem, bring it with you when we meet and I'll look at it.

Q: What do you think of the Karla Starr of the Seattle Weekly's article about video game hardware testing?
I read that when it came out. It's pretty accurate. I've been to VMC a few times where that testing is done. It's kinda brute force last stage game qual testing, after a lot of other testing has been done at the developer and MS. Funny, but you can only automate so much. And then you need to have people touch it and use it to find the unlikely bugs.

Q: How much more reliable are the current generation of Xbox 360 than the previous designs? Original Xenon, Zypher and Falcon.
I've heard that the failure rates for the current design is sub 10%. Much much better, but still too high imoh. And those designs haven't seen much life yet, so no one knows if that failure rate will hold.

Q: Do you think that the "Falcon" Xbox 360 design is the final Xbox 360 hardware iteration or will they come out with a redesigned Xbox 360?
They will come out with new hardware at least once a year until they retire this design. That's the console financial model. Keep the features and functionality the same, reduce cost and price, and improve quality if needed. The 360 roadmap always called for SI die shrink and integration, since that's where most of the cost is. Right now they are working to get the GPU and CPU on the same BGA package for the next mobo. Could lower cost, heat, number of heat sinks, mother board size (maybe squeeze the PS inside too), etc. Too bad that they screwed up and forgot to retain the JTAG IEEE 488 test functionality, at least what little they had. Now it will be almost impossible for them to tell if that chip is bad if the unit won't boot in the factory. So they will have to trouble shoot by replacing the most expensive part in the system blindly. They keep repeating bad decisions, and everyone is afraid to push issues considered to be bad news.

Q: Do you think that third party fans like the Nyko Intercooler will make things worse? Are they snake oil? I personally have plastic Tiki figures around my Xbox to ward off any evil spirits and so far they have done better in protecting than some of the fan coolers that you see at Gamestop.
I don't know, I'd have to test them. But I'll give you some thoughts. In order for those fans to do any good, they would have to increase the volume of air coming through the box w/o adding heat. I think those things are powered through the USB hub, which is specced at 5 volts, 1/2 an amp. So very little heat added. But the piggybacked fan would have to run at a higher volume that the box fan in order to unload it and make it spin faster, pulling more air over the heatsinks. Would be an easy test to run. Just tape a dry cleaning bag to the back with and w/o the extra fan and time how long to fill. Or if you have access to one, an anemometer is a test instrument that measures airflow and would give a more accurate reading.
Note : the Nyko Intercoolers draws power from the 360 power-source and it looks like surefire way to potentially make things worse.

Q: How many times does an Xbox 360 unit have to be sent in and repaired before they will replace it with a completely new unit?
That's not how it works. You send in a broken box, you get back a working box (hopefully). So there is a rotating stock of the original units that get repaired and returned to service. Plus, they keep finding these cashes of launch units here and there and using them too. Didn't you hear during the holidays that bundles were found with units made in 06? Those were pulled back from the retail channel last spring when the new heatsink was done, and had the new heatsink placed on them and then put into the shipping flow like any other box.

Back to the rotating inventory of launch units. You risk getting one of those back until the last one is out of the system. I imagine the next big outrage will be when some of the folks who waited till Falcon to buy a console for reliability reasons, and has to send it in for service, gets a Xenon back! Even when all of the Xenons are gone, you will likely get a newer gen repaired one back rather than new. Unless the fail rate gets so low there are none available. I'm holding my breath...


ENU291 01-19-08 06:11 PM

Re: Inside Source Reveal the Truth About Xbox 360 "Red Ring of Death" Failures
That article is pretty much what was being leaked all along. The Xbox 360 was rushed to market. Bad management decisions plus the massive amounts of heat that the internal components produce created the conditions that caused the 30% failure rate (IMO I think 30% is on the low end. 40 to 50% seems more realistic).

