Author Topic: Diagnosing hangs on busy network - advice?  (Read 18456 times)

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #15 on: November 13, 2014, 03:59:09 AM »
At the moment I don't understood when and where interrupts are restored?

Code: [Select]
bool RFM69::receiveDone() {
// ATOMIC_BLOCK(ATOMIC_FORCEON)
// {
  noInterrupts(); //re-enabled in unselect() via setMode() or via receiveBegin()
Comment says that it is enabled in unselect(), but setMode() do not have it, and receiveBegin just have a setMode() call.

Currently I see that only SendACK->SendFrame->unselect() restores it. What I missed?
Found that interrupts are restored after data is read, receiveBegin() cleans everything up and on 'first visit receiveDone() restores interrupts()

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #16 on: November 13, 2014, 07:01:56 AM »
Examined code. Some of our things like RX deadlocks was in place. At the moment rewrited SendACK logic. This evening will flash new version and will start to test. Fingers crossed  ;)

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 5934
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #17 on: November 13, 2014, 09:13:35 AM »
I am keeping an eye on this thread, don't think I'm ignorant. I appreciate your persistence and if you find a solution I am eager to know. Thank you

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #18 on: November 13, 2014, 09:32:13 AM »
Yep, of course I will inform you. Now flashed a new version and debugging/improving. Will write later in any case - what I changed and why and does it helped or not.
EDIT: Felix, do you have any numbers in percentage by example, how often sending back ACK failing and node should resend info? Is it very close to 100%?
« Last Edit: November 13, 2014, 09:50:49 AM by ieris »

TomWS

  • Hero Member
  • *****
  • Posts: 1888
Re: Diagnosing hangs on busy network - advice?
« Reply #19 on: November 13, 2014, 10:17:19 AM »
<...snip>
As I see from code and your comment, SENDERID can be changed/reseted executing receiveDone, but SendACK should proceed without fault because it uses incoming parameter sender in sendFrame procedure.
But your fix will avoid problem if SENDERID must be used somewhere else in code after SendACK.

I noticed strange things with SENDERID before when I had 2 nodes with the same ID, and one of them sending payload requests ACK.
For what it's worth, my failures with sendAck occurred when I called CheckForWirelessHex() immediately after getting a positive response from receiveDone().  Note that CheckForWirelessHex will immediately send an Ack with data ("FLX?") if it sees a wireless programming poll.  Given some of your observations, with the immediate Ack, I suspect that the RSSI value didn't have a chance to disappear and canSend() subsequently returned false causing the call to receiveDone(), which wiped out SENDERID, which CheckForWirelessHex needed for subsequent operations...  Again, this may be unrelated to your issues, but provided as an FYI...

Tom

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #20 on: November 13, 2014, 10:54:16 AM »
<...snip>
As I see from code and your comment, SENDERID can be changed/reseted executing receiveDone, but SendACK should proceed without fault because it uses incoming parameter sender in sendFrame procedure.
But your fix will avoid problem if SENDERID must be used somewhere else in code after SendACK.

I noticed strange things with SENDERID before when I had 2 nodes with the same ID, and one of them sending payload requests ACK.
For what it's worth, my failures with sendAck occurred when I called CheckForWirelessHex() immediately after getting a positive response from receiveDone().  Note that CheckForWirelessHex will immediately send an Ack with data ("FLX?") if it sees a wireless programming poll.  Given some of your observations, with the immediate Ack, I suspect that the RSSI value didn't have a chance to disappear and canSend() subsequently returned false causing the call to receiveDone(), which wiped out SENDERID, which CheckForWirelessHex needed for subsequent operations...  Again, this may be unrelated to your issues, but provided as an FYI...

Tom
Thank you for info. For now it seems related to our problem also. If my idea which I try to implement will be right this should be fixed also.

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 5934
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #21 on: November 13, 2014, 12:27:04 PM »
Yep, of course I will inform you. Now flashed a new version and debugging/improving. Will write later in any case - what I changed and why and does it helped or not.
EDIT: Felix, do you have any numbers in percentage by example, how often sending back ACK failing and node should resend info? Is it very close to 100%?
If I understand your Q right, the resending should not fail under normal circumstances. I coded the sendWithRetry() function that will retry to send and wait for an ACK up to 3 times by default (3 total sends). How often does a first send/ack fail? How often does a second fail? A third? I am really not sure. But from what I can observe 3 is about way more than enough to ensure a packet goes through.

jgilbert

  • Newbie
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #22 on: November 13, 2014, 08:59:27 PM »
Guys, an interesting update on my end. I am still running with the 3 adjustments that I mentioned before. This time the freeze doesn't match the earlier pattern. The gateway indicated a RSSI very high (-5db), but all other nodes were measuring low (-90db) RSSIs. No nodes had crashed but the gateway wasn't receiving anything, and wasn't showing any interrupts across the wire. I spent the last hour going through register by register checking to be sure that the RFM69 was configured the way I'd expect, and generally everything seemed reasonable. But what caught my attention is that the packetready flag was high.

I began to suspect that the interrupt was not firing, or perhaps had not recognized that a packet was ready. I pulled the D2 pin low using a spare wire manually, and immediately the network came back online.

I'll have to read up on exactly what is going on with the interrupts but perhaps somehow when the interrupt originally fired to pick up the packet it didn't meet its conditions? I'm not sure I know at this point if the interrupt for RxReady is based on edge or level, but this has presented a new theory to check and I'll look into tomorrow AM.

Felix, I wanted to ask you something. The interrupt code puts the radio into standby mode, presumably to prevent any other crap from happening while the interrupt reads out the registers. When the interrupt decides that the packet is not acceptable (e.g. packet not for me, etc) it immediately unselects and returns, and does not ever set RF69_MODE_RX. However, if the packet is fine, the mode switch back to RF69_MODE_RX does occur after the packet is clocked out of the chip. Is it possible this is an oversight?

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 5934
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #23 on: November 13, 2014, 10:11:40 PM »
Felix, I wanted to ask you something. The interrupt code puts the radio into standby mode, presumably to prevent any other crap from happening while the interrupt reads out the registers. When the interrupt decides that the packet is not acceptable (e.g. packet not for me, etc) it immediately unselects and returns, and does not ever set RF69_MODE_RX. However, if the packet is fine, the mode switch back to RF69_MODE_RX does occur after the packet is clocked out of the chip. Is it possible this is an oversight?
It's possibly possible :)
I can't remember if there was a specific reason for that but I do remember a big red mental note that I wrote down in my brain after bringing this library to a stable state: not to mess with it very easily and make sure any future changes are really well tested. This is of course very hard to do. Because so many things can happen and a lot of times in uncontrollable situations. That's why I try to suggest to people to give it a go if they want to try a new feature or a modification and report back with results after some time when they are confident the new stuff works.
Switching to STANDBY instead of RX is obviously different. However it's somewhat assumed that your main loop code will keep calling receiveDone() to check if any packets were received and buffered from the radio chip. This function will also make sure the radio is in RX mode. So even if you're in standby, calling receiveDone() soon after will switch it back to RX. So I don't really think this could be an issue (and I don't have proof in my own practice) even if it perhaps is an oversight and lack of consistency between the two states you mentioned.

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #24 on: November 14, 2014, 02:34:06 AM »
Yep, of course I will inform you. Now flashed a new version and debugging/improving. Will write later in any case - what I changed and why and does it helped or not.
EDIT: Felix, do you have any numbers in percentage by example, how often sending back ACK failing and node should resend info? Is it very close to 100%?
If I understand your Q right, the resending should not fail under normal circumstances. I coded the sendWithRetry() function that will retry to send and wait for an ACK up to 3 times by default (3 total sends). How often does a first send/ack fail? How often does a second fail? A third? I am really not sure. But from what I can observe 3 is about way more than enough to ensure a packet goes through.
My question was how often the first send/ack fail. But now it is under control! My initial version had ~10% fail on first sending, but now it is close to 0%. :)

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #25 on: November 14, 2014, 02:42:49 AM »
@jgilbert you are close to what I found! There is a problem with interrupts. But wait a little bit and keep calm :) Fixed a code right now and should give some testing to provide it.

@Felix can you explain about modes:
a) this works: RX(got data)->STANDBY->RX->STANDBY->TX(sending),
b) this not: RX(got data)->STANDBY->TX(sending)?
« Last Edit: November 14, 2014, 02:46:16 AM by ieris »

jgilbert

  • Newbie
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #26 on: November 14, 2014, 08:59:20 AM »
@ieris  - looking forward to hearing your solution

@felix - During the observed crashes, the radio is clearly in a receive state with a packet waiting and receiveDone is being called many times a second but no packets are coming through. Here is one way I think this could happen:

Normally the interrupt fires, and if the radio is in RX mode and there is a payload, it moves the radio to standby, copies out the data, and puts the radio back in RX mode. During the time before receiveDone is next called, more packets can come in. From what I can tell, you only get ONE pulse, telling you that a new packet is waiting, then the radio holds. (Again, I am new to all of this RF stuff so treat me gently! :)) I think that the receiver normally waits for the FIFO to be emptied before going back into RSSI mode to listen for more packets. For instance, at the time of my hangs, REG:3D - 13 - 10011 - AutoRxRestartOn = 1, AEs on. As far as I can tell that means the radio does not enter RSSI reading mode waiting for the next packet and instead is waiting for the fifo to empty.

If the last packet read was "good" (e.g. destined for this node), the interrupt handler clocks out the message, emptying the FIFO. If its spurious, no such FIFO empty occurs. However, since the radio is in standby mode, the FIFO will clear again as soon as the mode is switched back to RX (see page 46 of the data sheet). That means that the next interrupt could fire again very quickly after the last interrupt on a "good" packet, thereby overwritting the last packet, UNLESS the last packet was spurious, in which case it will wait until the receiveDone loop transitions the radio back to Rx. If a good packet and a bad packet arrive one right after another, If the last packet recieved is spurious the reading code will see the mismash of the previous packets ack_requested, sender_id, target_id, and will be halted (e.g. radio stayes in standby mode with no new packets inbound.) This is somewhat inconsistent behavior but I'm not sure it actually causes any harm.

Anyway, I agree that whenever receiveDone is next called, it sees the radio in RX mode, and the payload > 0, puts the radio to standby and returns true. The calling code handles the packet, and then on the next receviedDone call, because the radio is in standby, receiveBegin is called, clearing out the receive variables and priming the interrupt and making sure the next interrupt will fire.

So here is where we get to the race condition:

Imagine for that an interrupt for the last inbound packet was somehow missed by the AVR a single time, perhaps arriving before the next call to receiveBegin(). There is theoretically a small window of time between when the interrupt sets the mode back to RX but before the interrupt handler ends where this pulse could be ignored. Ignored pulses could also occur during receiveDone's various check that suspend interrupts.

What happens when the edge of packetReady is missed? The RFM will not issue another packetRx interrupt because it thinks it already has done so. Its waiting for someone to clock out the FIFO or clear it.  The interrupt code never gets called. The code is in a loop calling receiveDone() -- the radio is in RX mode, but PAYLOADLEN >0 has not been set by the interrupt. At this point, the deadlock will occur. There are already some protections in place for this scenario.  RestartRX is set by sendACK(), send(), and receiveBegin(). But in my code, there is no periodic message from the gateway, so the missed pulse is an issue.

Anyway, it was helpful to write all of this down because its clarified some of my thinking. I'm going to see if I can engineer a fix and will report back.


ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #27 on: November 14, 2014, 10:09:05 AM »
So, there is what I found and what I did.

In two last posts from jgilbert I saw the same problem and thinking what I found.
Examining the code and putting all RFM69 logic on paper like diagram :) I found that receiveDone do not work as expected.

The main problem what causes the network hanging is enabling interrupts (for very short time of course).
Waiting for free network receiveDone() is called (with enabled interrupts!)
Code: [Select]
while (!canSend()) receiveDone();
and if in that time some other message is incoming in gateway node problem happens as radio gets receive state (and last one received message is not cleared correctly with next receiveDone->receiveBegin execution, also).

To solve this there are 2 ways:
a) improve receiveDone and think about interupts
b) change/improve sendACK.

I went for b), as I didn't liked that sendACK clears incoming message and can destroy SENDERID.
With improved code below sendACK can be executed immediately after receiveDone w/o destroying message and getting some extra ms what is important for sleeping low power nodes waiting for ACK(avoiding read out data from radio before execute sendACK as now).

RFM69.cpp
Code: [Select]
void RFM69::sendACK(const void* buffer, byte bufferSize) {
  setMode(RF69_MODE_RX); //Switching from STANDBY to RX before TX
  int _RSSI = RSSI; //save payload received RSSI value
  bool canSendACK = false;
  long now = millis();
  while (millis()-now < ACK_CSMA_LIMIT_MS) //wait for free network the same time as sender waits for ACK
  {
    if (readRSSI() < CSMA_LIMIT) //if signal weaker than -90dBm(CSMA_LIMIT) is detected channel should be free
{
  canSendACK = true;
  break;
}
  }
  if (canSendACK) // channel is free let's send ACK
  {
//    Serial.print("ACK sent:");Serial.print(millis()-now, DEC); Serial.print("ms;RSSI:");Serial.println(readRSSI(), DEC); Serial.flush();
    sendFrame(SENDERID, buffer, bufferSize, false, true);
  }
  RSSI = _RSSI; //restore payload RSSI
}

RFM69.h (add first line, change second)
Code: [Select]
#define ACK_CSMA_LIMIT_MS    40
    bool sendWithRetry(byte toAddress, const void* buffer, byte bufferSize, byte retries=2, byte retryWaitTime=ACK_CSMA_LIMIT_MS); //40ms roundtrip req for  61byte packets

I will explain what I did and why:
-in RFM69.h added new defined value ACK_CSMA_LIMIT_MS which is time in ms before send next one message if ACK wasn't received.
-use this parameter instead of manual value in sendWithRetry() procedure, because it will be used later in some other place

RFM69.cpp
- setting mode back to RX (without this sending do not work). Why changing status from STANDBY to TX in sendFrame do not allow to send, I didn't figured. Seems it want's RX before or some timing issues. The correct place for it is one line before sendFrame, but in that case it is not working(seems some time is needed to switch), tried to add "Wait for ModeReady" w/o success
- keeping RSSI, as we will do measurement for current network and message received RSSI should be kept
- in while loop we are checking free network conditions ONLY ACK_CSMA_LIMIT_MS time, as we do not need to do it longer because sendWithRetry only waits for ACK this amount of time. (In current version it is to long also, no need to wait for free network to sendACK if sender waited only 40ms for answer)
- sending ACK

This version works very good for me, ACK times are improved and tracing sendACK observed that in most cases while loop executes immediately (0ms). Super! :)
You can get full changed code and look at changes in github: https://github.com/openminihub/RFM69

EDIT: Bad news, it hanged again...
added writeReg(REG_PACKETCONFIG2, (readReg(REG_PACKETCONFIG2) & 0xFB) | RF_PACKET2_RXRESTART); // avoid RX deadlocks
before sendFrame. Now going to sleep, let's see what morning will tell.
« Last Edit: November 15, 2014, 03:45:05 PM by ieris »

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 5934
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #28 on: November 14, 2014, 04:47:41 PM »
Thanks ieris for the pull request, unfortunately I cannot merge it because of some things I noticed. There are some Serial.writes and a lot of lines that have not changed except for blank spaces, I like to keep the merges lean and clean. Also saw your edit about it failing again.
I think I am going to start to do my own testing and see if/when there's a hang. There are now several threads reporting hangs so solving this issue is very important.
Thanks for your continued persistence and feedback.

jgilbert

  • Newbie
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #29 on: November 14, 2014, 05:44:25 PM »
Great stuff Ieris. Based on the work I've been doing, the addition of the CSMA limit in my code seems to have eliminated ACK issues, and at least one major form of deadlock. I might consider adding the RX restart code too as per send().

I have now been crash free on a very congested network (6+ nodes, each transmitting quite a lot of traffic continuously.) for almost 12 hours. If this continues I can submit a patch to you, Felix.

 The changes that have worked best for me in my setup:

1) Ensuring that sendAck can timeout like send()
2) Putting RXrestart in sendAck (not sure if this is definitively a fix but it should mirror what send() does to avoid more vague errors)
3) Placing a time guard in transmit -- this was definitively the source of at least one hang although I cannot trace why in the data sheet a send should hang at all
4) Fixing interrupt to handle cases where DATALEN = 0, which can apparently occur in high traffic scenarios and improves the hygene of the code
5) Adding a call to the interrupt function in some situations during receiveDone to catch the missed interrupt scenario I outlined earlier

If this works well, I'll remove my debugging code, and retest. If that works well, I can send the patch.

Jeremy