October 9th, 2008 | #1 (permalink) |
| Senior Member WhereWolf is sober (maybe....) Location: Oregon Rep Power: 6  | Strange Errors, Chasing Ghosts I thought I would share this, because someone else might run into the same situation....you just never know.... (this is a 61c R4.5 Properly Grounded AC Power - plus running power through a properly installed UPS - ALL per the NTP bible) Sunday Morning 00:03 (Right after switchover) Started taking XMI001, and XMI002 errors out the WAZZOOO.... All pointing to Shelf 12 1....CPU 1 was active cards bouncing like a superball - the switch monitor was texting my phone with the XMI errors every few seconds - had a couple of hundred texts within a short period - dragged my sleepy arse out of bed and logi'd. - errors limited to 12 1 Since I was remote, I couldn't do anything physically, but I downed the controller for the shelf and re-enabled...still taking errors. Due to problems that I've experienced in the past right after switchover, decided to swap CPU's - switched back to CPU 0 and the problem cleared. Viola! after 30 more mins of running (apparently) clean, back to bed....Zzzzzzz. Dreaming of CNI / Per Sig/ Xnet errors got me back up about 5 am and off to work I go... Logi'd in at work, and dinked around trying to re-create the problem...no such luck - I could swap CPU's back and forth, down the controller and re-enable while running on either CPU and no problem to be found. Hmmm..... All's well for two days....then the gremlins rampaged through the switch once again. Just after mid’s, performed the same tricks as above…solved it for about 2 hours, then it started all over again. And again 30 mins later – my poor phone was inundated with text messages - all XMI001,XMI002 – bouncie – bouncie – bouncie… This night it was happening on either CPU – still limited to shelf 12 1….threw some clothes on and headed to the campus. Decided to work methodically from the shelf back towards the CPU’s – downed the controller and swapped it with a spare. Cured the problem immediately. Wahoo!!! Oh…wait…there it went again…..CRAP! swapped CPU’s….still happening. CRAP! CRAP! By this time, it’s now working hours, and the trouble tickets are a-flowin’….I start to have the help desk send out an email indicating that I need to change a card, and it will affect many users since the XNET will take out shelf 12 1 and 12 0 – both packed shelves with a mixture of 60 % DLC and 40 % analog. But wait! I see one analog card that hasn’t come up since the controller change…Hmmmm….that’s odd. Disable/Renable…light still won’t go out….card isn’t responding….AAAHHHaaa! Jack the card out…..notice something very strange…….(See attached) Put spare Analog card in place….WOW….I love it when it works the way it’s supposed to. Let me just say that one failed card can cause all sorts of symptoms – always attempt to work methodically from one end of the communications path to the other. You just never know what one little issue can affect…..I don’t know how in the world this could have happened – there has been no lightning in this area for a very long time. The only other time I’ve seen a card smoked like this is when another tech re-inserted a 201i and bumped the adjacent analog card with the heatsink on the 201i – .....that was a crispy critter. That was not the situation in this case. Thanks for reading, I hope this helps someone else in the future……and saves a lot of headache…. Wherewolf Last edited by WhereWolf; September 25th, 2009 at 08:01 AM. |
| | |
| |