Author Topic: Diagnosing hangs on busy network - advice?  (Read 16452 times)

jgilbert

  • Newbie
  • *
  • Posts: 18
    • View Profile
Diagnosing hangs on busy network - advice?
« on: November 09, 2014, 10:53:40 PM »
Like many of you, I'm on a path to develop a wireless ecosystem built around Felix's moteinos to measure everything in my house. (In fact, I've even designed a few boards that fit the Moteino footprint to make it that much easier to add sensors.) I've recently stumbled into a difficult reliability issue and I'm wondering if anyone has any advice. My gateway moteino R3 (that echos packets via a serial port to the raspberry pi) is periodically hanging, requiring a hard reset. My network is busy -- I have about 5 nodes operating, sending packets every 5-10 seconds. This glitch is hard to reproduce, sometimes it occurs right away, other times it takes hours or even a day to emerge.

After attaching some additional LEDs and instrumenting my code, it appears that the hang is occurring in my call to radio.sendAck(). (E.g. the LED that goes on only before this function is called is stuck "high" when the gateway freezes.)

I spent the afternoon tracing through the RFM69 code (amazing work Felix -- not a simple chip to talk to) and I'm puzzled by the following. During ::send(), there is some obvious protection for deadlocks and for timeouts. I frankly don't understand the first line at all (must be something deep in the internals of the RFM chip at work) but the second part is straightforward -- the radio waits until the channel is "clear" or a certain amount of time has passed.

Code: [Select]
void RFM69::send(byte toAddress, const void* buffer, byte bufferSize, bool requestACK)
{
  writeReg(REG_PACKETCONFIG2, (readReg(REG_PACKETCONFIG2) & 0xFB) | RF_PACKET2_RXRESTART); // avoid RX deadlocks
  long now = millis();
  while (!canSend() && millis()-now < RF69_CSMA_LIMIT_MS) receiveDone();
  sendFrame(toAddress, buffer, bufferSize, requestACK, false);
}

This same protection does not seem to take place for ACKs:

Code: [Select]
/// Should be called immediately after reception in case sender wants ACK
void RFM69::sendACK(const void* buffer, byte bufferSize) {
  byte sender = SENDERID;
  while (!canSend()) receiveDone();
  sendFrame(sender, buffer, bufferSize, false, true);
}

Theoretically, while(!canSend()) could spin forever if the radio is not already in RX mode, the CSMA limit is not reached, or something else goes wrong in receiveDone(). However, receiveDone() seems guaranteed to eventually put the radio in RX mode, and sooner or later the channel will clear. During the hang, the other nodes (even those very close to the gateway) continue to transmit and report low RSSIs suggesting that the gateway's while loop should also eventually realize the channel is clear. I've also verified this by adding code to nodes to passively wake up and report on the background RSSI they receive (thanks to an earlier post for how to do this.)

Has anyone hit this glitch before or am I in brand new territory? Should I patch sendAck to look like send()?

Another (possibly) related issue: I've observed some isolated cases of a node hanging while holding the channel open (my debugging nodes report a readRSSI() of -22-30, well above the CSSA limit) -- the problem fixes itself after the one particular node is reset. Again, reading the code very carefully I can't figure out any situation where the transmitter would stick in the "on" position unless something was tampering with the interrupt pin which my designs leave untouched. Through battery instrumentation I've mostly ruled out brownouts as a potential factor. Still, it might be safer to have some form of guard on the amount of time that the radio is allowed to stay in RF69_MODE_TX.

Thanks for all help and tips.
« Last Edit: November 14, 2014, 04:48:07 PM by Felix »

TomWS

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1688
    • View Profile
Re: Diagnosing hangs on busy network - advice?
« Reply #1 on: November 09, 2014, 11:46:53 PM »
One problem I did run into with sendAck is that the SENDERID is not restored if receiveDone is called.  Most times this isn't an issue, but, in the right circumstances, ie, an attempt to send an Ack before a previous ack had completed, then the SENDERID value is toasted...

I found that restoring the SENDERID from 'sender' after the return from sendFrame, fixed my issue. 

I am using Moteino R4 with RFM69HW, but I don't think that is a distinguishing factor in this case.

Tom

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 5160
  • Country: us
    • View Profile
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #2 on: November 10, 2014, 08:56:04 AM »
Hey jgilbert,
I think you have a valid point and something I may have missed. ACKs happen far less often than regular packets. I did notice that kind of glitch myself and I eliminated it by adding the time limit seen in the send() function. I will probably add this in the sendACK() as well, it makes a lot of sense, but I need to test this myself. In the meantime you can add the same code and try it out, it won't hurt, and my gut tells me it will solve your issue.

And Tom - the SENDERID is just the ID of the sender of the last packet. You should not try to send an ACK, then in the middle of that send something else. In fact the pipeline is a FIFO and unless I'm missing something, it makes no sense to interrupt a transmission for another transmission because there's no way to resume the previous. It's like cutting a word in half, saying another word, then completing the first word. If i completely misunderstood then my apologies ... let me know.

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
    • View Profile
Re: Diagnosing hangs on busy network - advice?
« Reply #3 on: November 10, 2014, 09:14:22 AM »
I noticed similar things. Having network of several nodes and one node which sends info in burst packets (every minute sends 8 messages with 1 second delay).
Till now two times (one was today) :) all home network was hanged, even nodes which are communicating each other (not to gateway) was halted till time I restarted gateway node (MoteinoMega).

TomWS

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1688
    • View Profile
Re: Diagnosing hangs on busy network - advice?
« Reply #4 on: November 10, 2014, 10:27:44 AM »
<snip..>
And Tom - the SENDERID is just the ID of the sender of the last packet. You should not try to send an ACK, then in the middle of that send something else. In fact the pipeline is a FIFO and unless I'm missing something, it makes no sense to interrupt a transmission for another transmission because there's no way to resume the previous. It's like cutting a word in half, saying another word, then completing the first word. If i completely misunderstood then my apologies ... let me know.
The problem, IIRC, was that I was just about to send an Ack to a received packet when a new packet started coming in causing canSend to temporarily fail and then receiveDone was called, and, on return from all that, SENDERID had been set to 0 and my subsequent code, which had dutifully sent the ack, no longer had the original sender's address.  As I said, the problem was solved (or at least, masked  ;) when I added the line:
Code: [Select]
  
  while (!canSend()) receiveDone();
  SENDERID = sender; // TWS: Restore SenderID after it gets wiped out by receiveDone()
  sendFrame(sender, buffer, bufferSize, false, true, sendRSSI, lastRSSI);
in sendAck()

Tom

jgilbert

  • Newbie
  • *
  • Posts: 18
    • View Profile
Re: Diagnosing hangs on busy network - advice?
« Reply #5 on: November 10, 2014, 01:19:48 PM »
Thanks for the replies. I tried a version last night with the patch in place and I did get a hang overnight which resolved itself several hours later. Not a clean test because I haven't ruled out that something else went wrong. Will try again tonight. In the meantime, I'm going to stick a LED debugging statement ahead of all of the while() loops in the RFM code. (I would go with Serial but I fear I'd be tampering with the timing of the code too much.)

@TomWS, the way I read the code right now the original sender is saved off into a locally scoped variable so I don't think there is a race condition. However, it does seem possible that between the point in time you get a packet and send an ACK, another packet may have come in which you could potentially miss because the radio goes into a standby mode as soon as a packet is received. I'd have to defer to Felix if that packet is actually picked up again on the next call to receiveDone().

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 5160
  • Country: us
    • View Profile
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #6 on: November 10, 2014, 03:57:31 PM »
By the way, while we figure this out, a way out of it would be to set up a watchdog reset and call it periodically in the main loop. If there is a hang that will cause a watchdog reset.

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
    • View Profile
Re: Diagnosing hangs on busy network - advice?
« Reply #7 on: November 11, 2014, 01:53:35 AM »
Did a modification in SendACK, and had a hangout for all network this night. Was thinking about watchdog, but not so urgent system at the moment, will try to diagnose.
Before I was running Moteino R4 as gateway for ~2 months I didn't noticed such things. It has attached to another hardware via serial line.
Now running Mega both serials has attached hardware and it just forward info what was received. (so, gateway code is a little different from previous). Added also memory monitoring - it is stable all time (no overflows). Will try to find out what causes my network hanging.

Started trace:
When this 'hanged network' happens, starting new nodes or reseting others in this network (not a hanged gateway node) nothing changes.

While gateway was hanged, started trace with test node adding some trace lines in RFM69 library and noticed that receiveDone() returns false:
Code: [Select]
bool RFM69::receiveDone() {
...
  else if (_mode == RF69_MODE_RX)  //already in RX no payload yet
  {
    interrupts(); //explicitly re-enable interrupts
    return false;

Added some more lines, got:
Code: [Select]
exec receiveDone
receiveBegin()...
exec receiveDone
Already RX - 0,0 (payloadlen=0, senderid=0)

After reseting hanged gateway on test node got:
Code: [Select]
exec receiveDone
STANDBY
[31] TEMP;2225;3362  [RX_RSSI:-48]  <- printing radio data & rssi
exec receiveDone
Starting receive...
exec receiveDone
Already RX - 0,0
 - ACK sent.
exec receiveDone
Already RX - 0,0
exec receiveDone
Already RX - 0,0
...

And after some time of tests, I noticed that node hangs on noInterrupts(); in receiveDone() beginning;
« Last Edit: November 11, 2014, 09:00:19 AM by ieris »

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
    • View Profile
Re: Diagnosing hangs on busy network - advice?
« Reply #8 on: November 11, 2014, 02:28:55 AM »
The problem, IIRC, was that I was just about to send an Ack to a received packet when a new packet started coming in causing canSend to temporarily fail and then receiveDone was called, and, on return from all that, SENDERID had been set to 0 and my subsequent code, which had dutifully sent the ack, no longer had the original sender's address.  As I said, the problem was solved (or at least, masked  ;) when I added the line:
Code: [Select]
  
  while (!canSend()) receiveDone();
  SENDERID = sender; // TWS: Restore SenderID after it gets wiped out by receiveDone()
  sendFrame(sender, buffer, bufferSize, false, true, sendRSSI, lastRSSI);
in sendAck()

Tom
As I see from code and your comment, SENDERID can be changed/reseted executing receiveDone, but SendACK should proceed without fault because it uses incoming parameter sender in sendFrame procedure.
But your fix will avoid problem if SENDERID must be used somewhere else in code after SendACK.

I noticed strange things with SENDERID before when I had 2 nodes with the same ID, and one of them sending payload requests ACK.

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
    • View Profile
Re: Diagnosing hangs on busy network - advice?
« Reply #9 on: November 11, 2014, 09:35:33 AM »
At the moment I don't understood when and where interrupts are restored?

Code: [Select]
bool RFM69::receiveDone() {
// ATOMIC_BLOCK(ATOMIC_FORCEON)
// {
  noInterrupts(); //re-enabled in unselect() via setMode() or via receiveBegin()
Comment says that it is enabled in unselect(), but setMode() do not have it, and receiveBegin just have a setMode() call.

Currently I see that only SendACK->SendFrame->unselect() restores it. What I missed?

jgilbert

  • Newbie
  • *
  • Posts: 18
    • View Profile
Re: Diagnosing hangs on busy network - advice?
« Reply #10 on: November 11, 2014, 02:48:42 PM »
Just a quick update to Felix and Ieris and others watching this thread.

I instrumented the ___ out of the library, and discovered hangs had occured in two places in the last 12 hours. For context, I have already implemented the following change to sendACK so that is presumably no longer a source of crashes and these two new crashes occur for other reasons.

Code: [Select]
void RFM69::sendACK(const void* buffer, byte bufferSize) {
  byte sender = SENDERID;
  writeReg(REG_PACKETCONFIG2, (readReg(REG_PACKETCONFIG2) & 0xFB) | RF_PACKET2_RXRESTART); // avoid RX deadlocks
  long now = millis();
  DEBUGWRITE( 3 );
  while (!canSend() && millis()-now < RF69_CSMA_LIMIT_MS) receiveDone();
  if( millis()-now >= RF69_CSMA_LIMIT_MS )
      ERROR_CONDITION &= ERR_CSMA_LIMIT_REACHED;
  sendFrame(sender, buffer, bufferSize, false, true);

}

New issue #1: The gateway had received packets that were larger than RF69_MAX_DATA_LEN. How? I have no idea. One of them originated from node 45, which is not a node I've ever assigned on my network. The contents was a stream of "---" characters. I can't imagine how this happened given that I am using an encryption key, etc. DATALEN with a value of 254 presumably caused memory locations to be overwritten in both the RFM library and the calling code. I was able to identify this by chance, since I had changed the logging code to report on the packet length in an attempt to reduce serial bandwidth.

I've put in the following quick patch a second ago to the interrupt handler. I also added a static variable called "ERROR_CONDITION" so I can log this situation out in a cleaner way.

Code: [Select]
    for (byte i= 0; i < DATALEN; i++)
    {
      byte inByte = SPI.transfer(0);
      if( i < RF69_MAX_DATA_LEN )
        DATA[i] = inByte;
    }

    if( DATALEN > RF69_MAX_DATA_LEN )
      ERROR_CONDITION &= ERR_DATA_LEN_EXCEEDED;

The second crash occured on one of the transmitting low power nodes (runs at 6uA, thank you Felix!). In this case, the glitch occured here in the last part of the send() code:

Code: [Select]
	/* no need to wait for transmit mode to be ready since its handled by the radio */
setMode(RF69_MODE_TX);
  DEBUGWRITE( 8 );
while (digitalRead(_interruptPin) == 0 ); //wait for DIO0 to turn HIGH signalling transmission finish
  //while (readReg(REG_IRQFLAGS2) & RF_IRQFLAGS2_PACKETSENT == 0x00); // Wait for ModeReady
  DEBUGWRITE( 9 );
  setMode(RF69_MODE_STANDBY);

I have now patched it to read:

Code: [Select]
  
    unsigned long txStart = millis();
/* no need to wait for transmit mode to be ready since its handled by the radio */
setMode(RF69_MODE_TX);
  DEBUGWRITE( 8 );
while (digitalRead(_interruptPin) == 0 && millis()-txStart < RF69_TX_LIMIT_GUARD_MS); //wait for DIO0 to turn HIGH signalling transmission finish
  //while (readReg(REG_IRQFLAGS2) & RF_IRQFLAGS2_PACKETSENT == 0x00); // Wait for ModeReady
  DEBUGWRITE( 9 );
  if( millis() - txStart >= RF69_TX_LIMIT_GUARD_MS )
      ERROR_CONDITION &= ERR_TX_GUARD_REACHED;
  setMode(RF69_MODE_STANDBY);





ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
    • View Profile
Re: Diagnosing hangs on busy network - advice?
« Reply #11 on: November 11, 2014, 06:24:06 PM »
Added logging in gateway node and it hanged.....   on SendACK :)
jgilbert you are some steps further than me!

And very strange that DATALEN can be more than allowed, because some lines before DATALEN is calculated there is precaution:
Code: [Select]
PAYLOADLEN = PAYLOADLEN > 66 ? 66 : PAYLOADLEN; //precaution

jgilbert

  • Newbie
  • *
  • Posts: 18
    • View Profile
Re: Diagnosing hangs on busy network - advice?
« Reply #12 on: November 11, 2014, 08:50:35 PM »
@ieris - you're right, there does appear to be a check like that on PAYLOADLEN. I think happened is that the PAYLOADLEN in this circumstance was a low value -- like 2, and then when 3 was subtracted to calculate the DATALEN it wrapped around to 254.  I imagine GCC is assuming a byte is behaving like a unsigned 8 bit int.

Perhaps a simpler fix would simply be to abort the receive if the payloadlen < 3.



ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
    • View Profile
Re: Diagnosing hangs on busy network - advice?
« Reply #13 on: November 12, 2014, 01:18:32 AM »
@jgilbert - very good point! +1

Added your provided solution in sendACK - working w/o hangup already 9h  8)
EDIT: After 15h again hanged... most probably should add payloadlen<3 fix
« Last Edit: November 12, 2014, 04:19:32 PM by ieris »

jgilbert

  • Newbie
  • *
  • Posts: 18
    • View Profile
Re: Diagnosing hangs on busy network - advice?
« Reply #14 on: November 12, 2014, 04:50:11 PM »
I've been running almost 24 hours now with 3 fixes in place:
1) The payload length check
2) sendACK
3) transmit modetimeout

In that period of time, I've had two crashes and both are very unusual. In both cases, all of the nodes stayed up (e.g. no crashes) but all traffic halted. The measured RSSI on the network in one case was -29DB flat, and in the other case -59DB flat as measured by the gateway. In both cases, resetting the gateway fixed the problem.

@Felix and others, perhaps you could weigh in here since I'm a bit of a NF neophyte. My basic mental model of what might be happening: The transmitter somehow powered up, emitting the carrier signal, sent the packet, and then for unknown reasons never shut down. From everything I can determine, the culprit in both instances was the gateway itself.

I've scoured the data sheet but I don't see any mention of a requirement to explicitly power down the radio, and the RFM code swiches to MODE_STANDBY in any case as soon as the interrupt pulse arrives indicating the packet has been sent. The only funny quirk I can see -- setMode() never actually wait for the mode change to occur.

I added some code this morning to dump the radio's registers every 30 seconds. Maybe this will show a pattern? Honestly i'm just fishing at this point with no compelling theory of what is missing in the code.

Any other ideas?