Author Topic: Diagnosing hangs on busy network - advice?  (Read 26992 times)

jgilbert

  • NewMember
  • *
  • Posts: 18
Diagnosing hangs on busy network - advice?
« on: November 09, 2014, 10:53:40 PM »
Like many of you, I'm on a path to develop a wireless ecosystem built around Felix's moteinos to measure everything in my house. (In fact, I've even designed a few boards that fit the Moteino footprint to make it that much easier to add sensors.) I've recently stumbled into a difficult reliability issue and I'm wondering if anyone has any advice. My gateway moteino R3 (that echos packets via a serial port to the raspberry pi) is periodically hanging, requiring a hard reset. My network is busy -- I have about 5 nodes operating, sending packets every 5-10 seconds. This glitch is hard to reproduce, sometimes it occurs right away, other times it takes hours or even a day to emerge.

After attaching some additional LEDs and instrumenting my code, it appears that the hang is occurring in my call to radio.sendAck(). (E.g. the LED that goes on only before this function is called is stuck "high" when the gateway freezes.)

I spent the afternoon tracing through the RFM69 code (amazing work Felix -- not a simple chip to talk to) and I'm puzzled by the following. During ::send(), there is some obvious protection for deadlocks and for timeouts. I frankly don't understand the first line at all (must be something deep in the internals of the RFM chip at work) but the second part is straightforward -- the radio waits until the channel is "clear" or a certain amount of time has passed.

Code: [Select]
void RFM69::send(byte toAddress, const void* buffer, byte bufferSize, bool requestACK)
{
  writeReg(REG_PACKETCONFIG2, (readReg(REG_PACKETCONFIG2) & 0xFB) | RF_PACKET2_RXRESTART); // avoid RX deadlocks
  long now = millis();
  while (!canSend() && millis()-now < RF69_CSMA_LIMIT_MS) receiveDone();
  sendFrame(toAddress, buffer, bufferSize, requestACK, false);
}

This same protection does not seem to take place for ACKs:

Code: [Select]
/// Should be called immediately after reception in case sender wants ACK
void RFM69::sendACK(const void* buffer, byte bufferSize) {
  byte sender = SENDERID;
  while (!canSend()) receiveDone();
  sendFrame(sender, buffer, bufferSize, false, true);
}

Theoretically, while(!canSend()) could spin forever if the radio is not already in RX mode, the CSMA limit is not reached, or something else goes wrong in receiveDone(). However, receiveDone() seems guaranteed to eventually put the radio in RX mode, and sooner or later the channel will clear. During the hang, the other nodes (even those very close to the gateway) continue to transmit and report low RSSIs suggesting that the gateway's while loop should also eventually realize the channel is clear. I've also verified this by adding code to nodes to passively wake up and report on the background RSSI they receive (thanks to an earlier post for how to do this.)

Has anyone hit this glitch before or am I in brand new territory? Should I patch sendAck to look like send()?

Another (possibly) related issue: I've observed some isolated cases of a node hanging while holding the channel open (my debugging nodes report a readRSSI() of -22-30, well above the CSSA limit) -- the problem fixes itself after the one particular node is reset. Again, reading the code very carefully I can't figure out any situation where the transmitter would stick in the "on" position unless something was tampering with the interrupt pin which my designs leave untouched. Through battery instrumentation I've mostly ruled out brownouts as a potential factor. Still, it might be safer to have some form of guard on the amount of time that the radio is allowed to stay in RF69_MODE_TX.

Thanks for all help and tips.
« Last Edit: November 14, 2014, 04:48:07 PM by Felix »

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Diagnosing hangs on busy network - advice?
« Reply #1 on: November 09, 2014, 11:46:53 PM »
One problem I did run into with sendAck is that the SENDERID is not restored if receiveDone is called.  Most times this isn't an issue, but, in the right circumstances, ie, an attempt to send an Ack before a previous ack had completed, then the SENDERID value is toasted...

I found that restoring the SENDERID from 'sender' after the return from sendFrame, fixed my issue. 

I am using Moteino R4 with RFM69HW, but I don't think that is a distinguishing factor in this case.

Tom

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6867
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #2 on: November 10, 2014, 08:56:04 AM »
Hey jgilbert,
I think you have a valid point and something I may have missed. ACKs happen far less often than regular packets. I did notice that kind of glitch myself and I eliminated it by adding the time limit seen in the send() function. I will probably add this in the sendACK() as well, it makes a lot of sense, but I need to test this myself. In the meantime you can add the same code and try it out, it won't hurt, and my gut tells me it will solve your issue.

And Tom - the SENDERID is just the ID of the sender of the last packet. You should not try to send an ACK, then in the middle of that send something else. In fact the pipeline is a FIFO and unless I'm missing something, it makes no sense to interrupt a transmission for another transmission because there's no way to resume the previous. It's like cutting a word in half, saying another word, then completing the first word. If i completely misunderstood then my apologies ... let me know.

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #3 on: November 10, 2014, 09:14:22 AM »
I noticed similar things. Having network of several nodes and one node which sends info in burst packets (every minute sends 8 messages with 1 second delay).
Till now two times (one was today) :) all home network was hanged, even nodes which are communicating each other (not to gateway) was halted till time I restarted gateway node (MoteinoMega).

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Diagnosing hangs on busy network - advice?
« Reply #4 on: November 10, 2014, 10:27:44 AM »
<snip..>
And Tom - the SENDERID is just the ID of the sender of the last packet. You should not try to send an ACK, then in the middle of that send something else. In fact the pipeline is a FIFO and unless I'm missing something, it makes no sense to interrupt a transmission for another transmission because there's no way to resume the previous. It's like cutting a word in half, saying another word, then completing the first word. If i completely misunderstood then my apologies ... let me know.
The problem, IIRC, was that I was just about to send an Ack to a received packet when a new packet started coming in causing canSend to temporarily fail and then receiveDone was called, and, on return from all that, SENDERID had been set to 0 and my subsequent code, which had dutifully sent the ack, no longer had the original sender's address.  As I said, the problem was solved (or at least, masked  ;) when I added the line:
Code: [Select]
  
  while (!canSend()) receiveDone();
  SENDERID = sender; // TWS: Restore SenderID after it gets wiped out by receiveDone()
  sendFrame(sender, buffer, bufferSize, false, true, sendRSSI, lastRSSI);
in sendAck()

Tom

jgilbert

  • NewMember
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #5 on: November 10, 2014, 01:19:48 PM »
Thanks for the replies. I tried a version last night with the patch in place and I did get a hang overnight which resolved itself several hours later. Not a clean test because I haven't ruled out that something else went wrong. Will try again tonight. In the meantime, I'm going to stick a LED debugging statement ahead of all of the while() loops in the RFM code. (I would go with Serial but I fear I'd be tampering with the timing of the code too much.)

@TomWS, the way I read the code right now the original sender is saved off into a locally scoped variable so I don't think there is a race condition. However, it does seem possible that between the point in time you get a packet and send an ACK, another packet may have come in which you could potentially miss because the radio goes into a standby mode as soon as a packet is received. I'd have to defer to Felix if that packet is actually picked up again on the next call to receiveDone().

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6867
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #6 on: November 10, 2014, 03:57:31 PM »
By the way, while we figure this out, a way out of it would be to set up a watchdog reset and call it periodically in the main loop. If there is a hang that will cause a watchdog reset.

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #7 on: November 11, 2014, 01:53:35 AM »
Did a modification in SendACK, and had a hangout for all network this night. Was thinking about watchdog, but not so urgent system at the moment, will try to diagnose.
Before I was running Moteino R4 as gateway for ~2 months I didn't noticed such things. It has attached to another hardware via serial line.
Now running Mega both serials has attached hardware and it just forward info what was received. (so, gateway code is a little different from previous). Added also memory monitoring - it is stable all time (no overflows). Will try to find out what causes my network hanging.

Started trace:
When this 'hanged network' happens, starting new nodes or reseting others in this network (not a hanged gateway node) nothing changes.

While gateway was hanged, started trace with test node adding some trace lines in RFM69 library and noticed that receiveDone() returns false:
Code: [Select]
bool RFM69::receiveDone() {
...
  else if (_mode == RF69_MODE_RX)  //already in RX no payload yet
  {
    interrupts(); //explicitly re-enable interrupts
    return false;

Added some more lines, got:
Code: [Select]
exec receiveDone
receiveBegin()...
exec receiveDone
Already RX - 0,0 (payloadlen=0, senderid=0)

After reseting hanged gateway on test node got:
Code: [Select]
exec receiveDone
STANDBY
[31] TEMP;2225;3362  [RX_RSSI:-48]  <- printing radio data & rssi
exec receiveDone
Starting receive...
exec receiveDone
Already RX - 0,0
 - ACK sent.
exec receiveDone
Already RX - 0,0
exec receiveDone
Already RX - 0,0
...

And after some time of tests, I noticed that node hangs on noInterrupts(); in receiveDone() beginning;
« Last Edit: November 11, 2014, 09:00:19 AM by ieris »

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #8 on: November 11, 2014, 02:28:55 AM »
The problem, IIRC, was that I was just about to send an Ack to a received packet when a new packet started coming in causing canSend to temporarily fail and then receiveDone was called, and, on return from all that, SENDERID had been set to 0 and my subsequent code, which had dutifully sent the ack, no longer had the original sender's address.  As I said, the problem was solved (or at least, masked  ;) when I added the line:
Code: [Select]
  
  while (!canSend()) receiveDone();
  SENDERID = sender; // TWS: Restore SenderID after it gets wiped out by receiveDone()
  sendFrame(sender, buffer, bufferSize, false, true, sendRSSI, lastRSSI);
in sendAck()

Tom
As I see from code and your comment, SENDERID can be changed/reseted executing receiveDone, but SendACK should proceed without fault because it uses incoming parameter sender in sendFrame procedure.
But your fix will avoid problem if SENDERID must be used somewhere else in code after SendACK.

I noticed strange things with SENDERID before when I had 2 nodes with the same ID, and one of them sending payload requests ACK.

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #9 on: November 11, 2014, 09:35:33 AM »
At the moment I don't understood when and where interrupts are restored?

Code: [Select]
bool RFM69::receiveDone() {
// ATOMIC_BLOCK(ATOMIC_FORCEON)
// {
  noInterrupts(); //re-enabled in unselect() via setMode() or via receiveBegin()
Comment says that it is enabled in unselect(), but setMode() do not have it, and receiveBegin just have a setMode() call.

Currently I see that only SendACK->SendFrame->unselect() restores it. What I missed?

jgilbert

  • NewMember
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #10 on: November 11, 2014, 02:48:42 PM »
Just a quick update to Felix and Ieris and others watching this thread.

I instrumented the ___ out of the library, and discovered hangs had occured in two places in the last 12 hours. For context, I have already implemented the following change to sendACK so that is presumably no longer a source of crashes and these two new crashes occur for other reasons.

Code: [Select]
void RFM69::sendACK(const void* buffer, byte bufferSize) {
  byte sender = SENDERID;
  writeReg(REG_PACKETCONFIG2, (readReg(REG_PACKETCONFIG2) & 0xFB) | RF_PACKET2_RXRESTART); // avoid RX deadlocks
  long now = millis();
  DEBUGWRITE( 3 );
  while (!canSend() && millis()-now < RF69_CSMA_LIMIT_MS) receiveDone();
  if( millis()-now >= RF69_CSMA_LIMIT_MS )
      ERROR_CONDITION &= ERR_CSMA_LIMIT_REACHED;
  sendFrame(sender, buffer, bufferSize, false, true);

}

New issue #1: The gateway had received packets that were larger than RF69_MAX_DATA_LEN. How? I have no idea. One of them originated from node 45, which is not a node I've ever assigned on my network. The contents was a stream of "---" characters. I can't imagine how this happened given that I am using an encryption key, etc. DATALEN with a value of 254 presumably caused memory locations to be overwritten in both the RFM library and the calling code. I was able to identify this by chance, since I had changed the logging code to report on the packet length in an attempt to reduce serial bandwidth.

I've put in the following quick patch a second ago to the interrupt handler. I also added a static variable called "ERROR_CONDITION" so I can log this situation out in a cleaner way.

Code: [Select]
    for (byte i= 0; i < DATALEN; i++)
    {
      byte inByte = SPI.transfer(0);
      if( i < RF69_MAX_DATA_LEN )
        DATA[i] = inByte;
    }

    if( DATALEN > RF69_MAX_DATA_LEN )
      ERROR_CONDITION &= ERR_DATA_LEN_EXCEEDED;

The second crash occured on one of the transmitting low power nodes (runs at 6uA, thank you Felix!). In this case, the glitch occured here in the last part of the send() code:

Code: [Select]
	/* no need to wait for transmit mode to be ready since its handled by the radio */
setMode(RF69_MODE_TX);
  DEBUGWRITE( 8 );
while (digitalRead(_interruptPin) == 0 ); //wait for DIO0 to turn HIGH signalling transmission finish
  //while (readReg(REG_IRQFLAGS2) & RF_IRQFLAGS2_PACKETSENT == 0x00); // Wait for ModeReady
  DEBUGWRITE( 9 );
  setMode(RF69_MODE_STANDBY);

I have now patched it to read:

Code: [Select]
  
    unsigned long txStart = millis();
/* no need to wait for transmit mode to be ready since its handled by the radio */
setMode(RF69_MODE_TX);
  DEBUGWRITE( 8 );
while (digitalRead(_interruptPin) == 0 && millis()-txStart < RF69_TX_LIMIT_GUARD_MS); //wait for DIO0 to turn HIGH signalling transmission finish
  //while (readReg(REG_IRQFLAGS2) & RF_IRQFLAGS2_PACKETSENT == 0x00); // Wait for ModeReady
  DEBUGWRITE( 9 );
  if( millis() - txStart >= RF69_TX_LIMIT_GUARD_MS )
      ERROR_CONDITION &= ERR_TX_GUARD_REACHED;
  setMode(RF69_MODE_STANDBY);





ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #11 on: November 11, 2014, 06:24:06 PM »
Added logging in gateway node and it hanged.....   on SendACK :)
jgilbert you are some steps further than me!

And very strange that DATALEN can be more than allowed, because some lines before DATALEN is calculated there is precaution:
Code: [Select]
PAYLOADLEN = PAYLOADLEN > 66 ? 66 : PAYLOADLEN; //precaution

jgilbert

  • NewMember
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #12 on: November 11, 2014, 08:50:35 PM »
@ieris - you're right, there does appear to be a check like that on PAYLOADLEN. I think happened is that the PAYLOADLEN in this circumstance was a low value -- like 2, and then when 3 was subtracted to calculate the DATALEN it wrapped around to 254.  I imagine GCC is assuming a byte is behaving like a unsigned 8 bit int.

Perhaps a simpler fix would simply be to abort the receive if the payloadlen < 3.



ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #13 on: November 12, 2014, 01:18:32 AM »
@jgilbert - very good point! +1

Added your provided solution in sendACK - working w/o hangup already 9h  8)
EDIT: After 15h again hanged... most probably should add payloadlen<3 fix
« Last Edit: November 12, 2014, 04:19:32 PM by ieris »

jgilbert

  • NewMember
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #14 on: November 12, 2014, 04:50:11 PM »
I've been running almost 24 hours now with 3 fixes in place:
1) The payload length check
2) sendACK
3) transmit modetimeout

In that period of time, I've had two crashes and both are very unusual. In both cases, all of the nodes stayed up (e.g. no crashes) but all traffic halted. The measured RSSI on the network in one case was -29DB flat, and in the other case -59DB flat as measured by the gateway. In both cases, resetting the gateway fixed the problem.

@Felix and others, perhaps you could weigh in here since I'm a bit of a NF neophyte. My basic mental model of what might be happening: The transmitter somehow powered up, emitting the carrier signal, sent the packet, and then for unknown reasons never shut down. From everything I can determine, the culprit in both instances was the gateway itself.

I've scoured the data sheet but I don't see any mention of a requirement to explicitly power down the radio, and the RFM code swiches to MODE_STANDBY in any case as soon as the interrupt pulse arrives indicating the packet has been sent. The only funny quirk I can see -- setMode() never actually wait for the mode change to occur.

I added some code this morning to dump the radio's registers every 30 seconds. Maybe this will show a pattern? Honestly i'm just fishing at this point with no compelling theory of what is missing in the code.

Any other ideas?

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #15 on: November 13, 2014, 03:59:09 AM »
At the moment I don't understood when and where interrupts are restored?

Code: [Select]
bool RFM69::receiveDone() {
// ATOMIC_BLOCK(ATOMIC_FORCEON)
// {
  noInterrupts(); //re-enabled in unselect() via setMode() or via receiveBegin()
Comment says that it is enabled in unselect(), but setMode() do not have it, and receiveBegin just have a setMode() call.

Currently I see that only SendACK->SendFrame->unselect() restores it. What I missed?
Found that interrupts are restored after data is read, receiveBegin() cleans everything up and on 'first visit receiveDone() restores interrupts()

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #16 on: November 13, 2014, 07:01:56 AM »
Examined code. Some of our things like RX deadlocks was in place. At the moment rewrited SendACK logic. This evening will flash new version and will start to test. Fingers crossed  ;)

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6867
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #17 on: November 13, 2014, 09:13:35 AM »
I am keeping an eye on this thread, don't think I'm ignorant. I appreciate your persistence and if you find a solution I am eager to know. Thank you

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #18 on: November 13, 2014, 09:32:13 AM »
Yep, of course I will inform you. Now flashed a new version and debugging/improving. Will write later in any case - what I changed and why and does it helped or not.
EDIT: Felix, do you have any numbers in percentage by example, how often sending back ACK failing and node should resend info? Is it very close to 100%?
« Last Edit: November 13, 2014, 09:50:49 AM by ieris »

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Diagnosing hangs on busy network - advice?
« Reply #19 on: November 13, 2014, 10:17:19 AM »
<...snip>
As I see from code and your comment, SENDERID can be changed/reseted executing receiveDone, but SendACK should proceed without fault because it uses incoming parameter sender in sendFrame procedure.
But your fix will avoid problem if SENDERID must be used somewhere else in code after SendACK.

I noticed strange things with SENDERID before when I had 2 nodes with the same ID, and one of them sending payload requests ACK.
For what it's worth, my failures with sendAck occurred when I called CheckForWirelessHex() immediately after getting a positive response from receiveDone().  Note that CheckForWirelessHex will immediately send an Ack with data ("FLX?") if it sees a wireless programming poll.  Given some of your observations, with the immediate Ack, I suspect that the RSSI value didn't have a chance to disappear and canSend() subsequently returned false causing the call to receiveDone(), which wiped out SENDERID, which CheckForWirelessHex needed for subsequent operations...  Again, this may be unrelated to your issues, but provided as an FYI...

Tom

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #20 on: November 13, 2014, 10:54:16 AM »
<...snip>
As I see from code and your comment, SENDERID can be changed/reseted executing receiveDone, but SendACK should proceed without fault because it uses incoming parameter sender in sendFrame procedure.
But your fix will avoid problem if SENDERID must be used somewhere else in code after SendACK.

I noticed strange things with SENDERID before when I had 2 nodes with the same ID, and one of them sending payload requests ACK.
For what it's worth, my failures with sendAck occurred when I called CheckForWirelessHex() immediately after getting a positive response from receiveDone().  Note that CheckForWirelessHex will immediately send an Ack with data ("FLX?") if it sees a wireless programming poll.  Given some of your observations, with the immediate Ack, I suspect that the RSSI value didn't have a chance to disappear and canSend() subsequently returned false causing the call to receiveDone(), which wiped out SENDERID, which CheckForWirelessHex needed for subsequent operations...  Again, this may be unrelated to your issues, but provided as an FYI...

Tom
Thank you for info. For now it seems related to our problem also. If my idea which I try to implement will be right this should be fixed also.

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6867
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #21 on: November 13, 2014, 12:27:04 PM »
Yep, of course I will inform you. Now flashed a new version and debugging/improving. Will write later in any case - what I changed and why and does it helped or not.
EDIT: Felix, do you have any numbers in percentage by example, how often sending back ACK failing and node should resend info? Is it very close to 100%?
If I understand your Q right, the resending should not fail under normal circumstances. I coded the sendWithRetry() function that will retry to send and wait for an ACK up to 3 times by default (3 total sends). How often does a first send/ack fail? How often does a second fail? A third? I am really not sure. But from what I can observe 3 is about way more than enough to ensure a packet goes through.

jgilbert

  • NewMember
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #22 on: November 13, 2014, 08:59:27 PM »
Guys, an interesting update on my end. I am still running with the 3 adjustments that I mentioned before. This time the freeze doesn't match the earlier pattern. The gateway indicated a RSSI very high (-5db), but all other nodes were measuring low (-90db) RSSIs. No nodes had crashed but the gateway wasn't receiving anything, and wasn't showing any interrupts across the wire. I spent the last hour going through register by register checking to be sure that the RFM69 was configured the way I'd expect, and generally everything seemed reasonable. But what caught my attention is that the packetready flag was high.

I began to suspect that the interrupt was not firing, or perhaps had not recognized that a packet was ready. I pulled the D2 pin low using a spare wire manually, and immediately the network came back online.

I'll have to read up on exactly what is going on with the interrupts but perhaps somehow when the interrupt originally fired to pick up the packet it didn't meet its conditions? I'm not sure I know at this point if the interrupt for RxReady is based on edge or level, but this has presented a new theory to check and I'll look into tomorrow AM.

Felix, I wanted to ask you something. The interrupt code puts the radio into standby mode, presumably to prevent any other crap from happening while the interrupt reads out the registers. When the interrupt decides that the packet is not acceptable (e.g. packet not for me, etc) it immediately unselects and returns, and does not ever set RF69_MODE_RX. However, if the packet is fine, the mode switch back to RF69_MODE_RX does occur after the packet is clocked out of the chip. Is it possible this is an oversight?

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6867
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #23 on: November 13, 2014, 10:11:40 PM »
Felix, I wanted to ask you something. The interrupt code puts the radio into standby mode, presumably to prevent any other crap from happening while the interrupt reads out the registers. When the interrupt decides that the packet is not acceptable (e.g. packet not for me, etc) it immediately unselects and returns, and does not ever set RF69_MODE_RX. However, if the packet is fine, the mode switch back to RF69_MODE_RX does occur after the packet is clocked out of the chip. Is it possible this is an oversight?
It's possibly possible :)
I can't remember if there was a specific reason for that but I do remember a big red mental note that I wrote down in my brain after bringing this library to a stable state: not to mess with it very easily and make sure any future changes are really well tested. This is of course very hard to do. Because so many things can happen and a lot of times in uncontrollable situations. That's why I try to suggest to people to give it a go if they want to try a new feature or a modification and report back with results after some time when they are confident the new stuff works.
Switching to STANDBY instead of RX is obviously different. However it's somewhat assumed that your main loop code will keep calling receiveDone() to check if any packets were received and buffered from the radio chip. This function will also make sure the radio is in RX mode. So even if you're in standby, calling receiveDone() soon after will switch it back to RX. So I don't really think this could be an issue (and I don't have proof in my own practice) even if it perhaps is an oversight and lack of consistency between the two states you mentioned.

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #24 on: November 14, 2014, 02:34:06 AM »
Yep, of course I will inform you. Now flashed a new version and debugging/improving. Will write later in any case - what I changed and why and does it helped or not.
EDIT: Felix, do you have any numbers in percentage by example, how often sending back ACK failing and node should resend info? Is it very close to 100%?
If I understand your Q right, the resending should not fail under normal circumstances. I coded the sendWithRetry() function that will retry to send and wait for an ACK up to 3 times by default (3 total sends). How often does a first send/ack fail? How often does a second fail? A third? I am really not sure. But from what I can observe 3 is about way more than enough to ensure a packet goes through.
My question was how often the first send/ack fail. But now it is under control! My initial version had ~10% fail on first sending, but now it is close to 0%. :)

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #25 on: November 14, 2014, 02:42:49 AM »
@jgilbert you are close to what I found! There is a problem with interrupts. But wait a little bit and keep calm :) Fixed a code right now and should give some testing to provide it.

@Felix can you explain about modes:
a) this works: RX(got data)->STANDBY->RX->STANDBY->TX(sending),
b) this not: RX(got data)->STANDBY->TX(sending)?
« Last Edit: November 14, 2014, 02:46:16 AM by ieris »

jgilbert

  • NewMember
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #26 on: November 14, 2014, 08:59:20 AM »
@ieris  - looking forward to hearing your solution

@felix - During the observed crashes, the radio is clearly in a receive state with a packet waiting and receiveDone is being called many times a second but no packets are coming through. Here is one way I think this could happen:

Normally the interrupt fires, and if the radio is in RX mode and there is a payload, it moves the radio to standby, copies out the data, and puts the radio back in RX mode. During the time before receiveDone is next called, more packets can come in. From what I can tell, you only get ONE pulse, telling you that a new packet is waiting, then the radio holds. (Again, I am new to all of this RF stuff so treat me gently! :)) I think that the receiver normally waits for the FIFO to be emptied before going back into RSSI mode to listen for more packets. For instance, at the time of my hangs, REG:3D - 13 - 10011 - AutoRxRestartOn = 1, AEs on. As far as I can tell that means the radio does not enter RSSI reading mode waiting for the next packet and instead is waiting for the fifo to empty.

If the last packet read was "good" (e.g. destined for this node), the interrupt handler clocks out the message, emptying the FIFO. If its spurious, no such FIFO empty occurs. However, since the radio is in standby mode, the FIFO will clear again as soon as the mode is switched back to RX (see page 46 of the data sheet). That means that the next interrupt could fire again very quickly after the last interrupt on a "good" packet, thereby overwritting the last packet, UNLESS the last packet was spurious, in which case it will wait until the receiveDone loop transitions the radio back to Rx. If a good packet and a bad packet arrive one right after another, If the last packet recieved is spurious the reading code will see the mismash of the previous packets ack_requested, sender_id, target_id, and will be halted (e.g. radio stayes in standby mode with no new packets inbound.) This is somewhat inconsistent behavior but I'm not sure it actually causes any harm.

Anyway, I agree that whenever receiveDone is next called, it sees the radio in RX mode, and the payload > 0, puts the radio to standby and returns true. The calling code handles the packet, and then on the next receviedDone call, because the radio is in standby, receiveBegin is called, clearing out the receive variables and priming the interrupt and making sure the next interrupt will fire.

So here is where we get to the race condition:

Imagine for that an interrupt for the last inbound packet was somehow missed by the AVR a single time, perhaps arriving before the next call to receiveBegin(). There is theoretically a small window of time between when the interrupt sets the mode back to RX but before the interrupt handler ends where this pulse could be ignored. Ignored pulses could also occur during receiveDone's various check that suspend interrupts.

What happens when the edge of packetReady is missed? The RFM will not issue another packetRx interrupt because it thinks it already has done so. Its waiting for someone to clock out the FIFO or clear it.  The interrupt code never gets called. The code is in a loop calling receiveDone() -- the radio is in RX mode, but PAYLOADLEN >0 has not been set by the interrupt. At this point, the deadlock will occur. There are already some protections in place for this scenario.  RestartRX is set by sendACK(), send(), and receiveBegin(). But in my code, there is no periodic message from the gateway, so the missed pulse is an issue.

Anyway, it was helpful to write all of this down because its clarified some of my thinking. I'm going to see if I can engineer a fix and will report back.


ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #27 on: November 14, 2014, 10:09:05 AM »
So, there is what I found and what I did.

In two last posts from jgilbert I saw the same problem and thinking what I found.
Examining the code and putting all RFM69 logic on paper like diagram :) I found that receiveDone do not work as expected.

The main problem what causes the network hanging is enabling interrupts (for very short time of course).
Waiting for free network receiveDone() is called (with enabled interrupts!)
Code: [Select]
while (!canSend()) receiveDone();
and if in that time some other message is incoming in gateway node problem happens as radio gets receive state (and last one received message is not cleared correctly with next receiveDone->receiveBegin execution, also).

To solve this there are 2 ways:
a) improve receiveDone and think about interupts
b) change/improve sendACK.

I went for b), as I didn't liked that sendACK clears incoming message and can destroy SENDERID.
With improved code below sendACK can be executed immediately after receiveDone w/o destroying message and getting some extra ms what is important for sleeping low power nodes waiting for ACK(avoiding read out data from radio before execute sendACK as now).

RFM69.cpp
Code: [Select]
void RFM69::sendACK(const void* buffer, byte bufferSize) {
  setMode(RF69_MODE_RX); //Switching from STANDBY to RX before TX
  int _RSSI = RSSI; //save payload received RSSI value
  bool canSendACK = false;
  long now = millis();
  while (millis()-now < ACK_CSMA_LIMIT_MS) //wait for free network the same time as sender waits for ACK
  {
    if (readRSSI() < CSMA_LIMIT) //if signal weaker than -90dBm(CSMA_LIMIT) is detected channel should be free
{
  canSendACK = true;
  break;
}
  }
  if (canSendACK) // channel is free let's send ACK
  {
//    Serial.print("ACK sent:");Serial.print(millis()-now, DEC); Serial.print("ms;RSSI:");Serial.println(readRSSI(), DEC); Serial.flush();
    sendFrame(SENDERID, buffer, bufferSize, false, true);
  }
  RSSI = _RSSI; //restore payload RSSI
}

RFM69.h (add first line, change second)
Code: [Select]
#define ACK_CSMA_LIMIT_MS    40
    bool sendWithRetry(byte toAddress, const void* buffer, byte bufferSize, byte retries=2, byte retryWaitTime=ACK_CSMA_LIMIT_MS); //40ms roundtrip req for  61byte packets

I will explain what I did and why:
-in RFM69.h added new defined value ACK_CSMA_LIMIT_MS which is time in ms before send next one message if ACK wasn't received.
-use this parameter instead of manual value in sendWithRetry() procedure, because it will be used later in some other place

RFM69.cpp
- setting mode back to RX (without this sending do not work). Why changing status from STANDBY to TX in sendFrame do not allow to send, I didn't figured. Seems it want's RX before or some timing issues. The correct place for it is one line before sendFrame, but in that case it is not working(seems some time is needed to switch), tried to add "Wait for ModeReady" w/o success
- keeping RSSI, as we will do measurement for current network and message received RSSI should be kept
- in while loop we are checking free network conditions ONLY ACK_CSMA_LIMIT_MS time, as we do not need to do it longer because sendWithRetry only waits for ACK this amount of time. (In current version it is to long also, no need to wait for free network to sendACK if sender waited only 40ms for answer)
- sending ACK

This version works very good for me, ACK times are improved and tracing sendACK observed that in most cases while loop executes immediately (0ms). Super! :)
You can get full changed code and look at changes in github: https://github.com/openminihub/RFM69

EDIT: Bad news, it hanged again...
added writeReg(REG_PACKETCONFIG2, (readReg(REG_PACKETCONFIG2) & 0xFB) | RF_PACKET2_RXRESTART); // avoid RX deadlocks
before sendFrame. Now going to sleep, let's see what morning will tell.
« Last Edit: November 15, 2014, 03:45:05 PM by ieris »

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6867
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #28 on: November 14, 2014, 04:47:41 PM »
Thanks ieris for the pull request, unfortunately I cannot merge it because of some things I noticed. There are some Serial.writes and a lot of lines that have not changed except for blank spaces, I like to keep the merges lean and clean. Also saw your edit about it failing again.
I think I am going to start to do my own testing and see if/when there's a hang. There are now several threads reporting hangs so solving this issue is very important.
Thanks for your continued persistence and feedback.

jgilbert

  • NewMember
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #29 on: November 14, 2014, 05:44:25 PM »
Great stuff Ieris. Based on the work I've been doing, the addition of the CSMA limit in my code seems to have eliminated ACK issues, and at least one major form of deadlock. I might consider adding the RX restart code too as per send().

I have now been crash free on a very congested network (6+ nodes, each transmitting quite a lot of traffic continuously.) for almost 12 hours. If this continues I can submit a patch to you, Felix.

 The changes that have worked best for me in my setup:

1) Ensuring that sendAck can timeout like send()
2) Putting RXrestart in sendAck (not sure if this is definitively a fix but it should mirror what send() does to avoid more vague errors)
3) Placing a time guard in transmit -- this was definitively the source of at least one hang although I cannot trace why in the data sheet a send should hang at all
4) Fixing interrupt to handle cases where DATALEN = 0, which can apparently occur in high traffic scenarios and improves the hygene of the code
5) Adding a call to the interrupt function in some situations during receiveDone to catch the missed interrupt scenario I outlined earlier

If this works well, I'll remove my debugging code, and retest. If that works well, I can send the patch.

Jeremy

 


Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6867
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #30 on: November 14, 2014, 05:52:25 PM »
Jeremy, good work. I have setup a gateway Moteino and 2 Moteino nodes that are loaded with the Node example with TRANSMITPERIOD=100ms which generates a LOT of traffic. With the latest code I've been running for a while but will likely leave it overnight. Also I added some code to count the packets received.
I think 100ms might be getting close to the limit of congestion. There are basically 2 packets that try to get through every 100ms, and every 3rd received packet the gateway sends a ping back and asks for an ACK. This is most obvious when packets are getting long and ACKs sometimes don't make it, see attached. With all this, I have not seen a hang, and my main network (same settings just different encryption key is working fine, significantly less traffic on that though, however it's on the same frequency so I think it's demodulating the signal and finding that the AES doesn't pass and doesn't raise the interrupt). I do however see ACKed packets are repeated a lot more often indicating collisions so the nodes are trying hard to get it through. BTW the interrupt is RISING triggered.

I'd like at some point to look at your code and compare. I have also added the time guard in sendACK() today.
« Last Edit: November 14, 2014, 05:56:30 PM by Felix »

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Diagnosing hangs on busy network - advice?
« Reply #31 on: November 14, 2014, 10:24:27 PM »
@felix, if you're using the original Node code for your test, trying swapping the sendAck and serial output:
Original:
Code: [Select]
  if (radio.receiveDone())
  {
    Serial.print('[');Serial.print(radio.SENDERID, DEC);Serial.print("] ");
    for (byte i = 0; i < radio.DATALEN; i++)
      Serial.print((char)radio.DATA[i]);
    Serial.print("   [RX_RSSI:");Serial.print(radio.RSSI);Serial.print("]");

    if (radio.ACKRequested())
    {
      radio.sendACK();
      Serial.print(" - ACK sent");
    }
    Blink(LED,5);
    Serial.println();
  }
to...
Code: [Select]
  //check for any received packets
  if (radio.receiveDone())
  {
    if (radio.ACKRequested())
    {
      radio.sendACK();
      Serial.print(" - ACK sent");
    }

    Serial.print('[');Serial.print(radio.SENDERID, DEC);Serial.print("] ");
    for (byte i = 0; i < radio.DATALEN; i++)
      Serial.print((char)radio.DATA[i]);
    Serial.print("   [RX_RSSI:");Serial.print(radio.RSSI);Serial.print("]");

    Blink(LED,5);
    Serial.println();
  }
and see what happens.  I think there is a timing relationship with how soon you ack after receive that factors into this.

EDIT: It just occurred to me that I may have missed the fact that you shouldn't Ack until you 'consume' the data.  So... in this case, before the Ack, simply memcpy the incoming data for later processing.  The point is that shortening the timing of Ack WRT receiving data, makes potential race conditions more apparent.


Tom
« Last Edit: November 14, 2014, 10:37:50 PM by TomWS »

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #32 on: November 15, 2014, 03:45:05 AM »
Thanks ieris for the pull request, unfortunately I cannot merge it because of some things I noticed. There are some Serial.writes and a lot of lines that have not changed except for blank spaces, I like to keep the merges lean and clean. Also saw your edit about it failing again.
I think I am going to start to do my own testing and see if/when there's a hang. There are now several threads reporting hangs so solving this issue is very important.
Thanks for your continued persistence and feedback.
Yep, no need to merge, as it failed.
Hmmmm... Yes there is a Serial, bet it is commented and leaved for debug, as I saw you also had such commented Serial lines in code :)
Those blank spaces was a lot because I also like a clean and formatted code, and that it is why they appear. And one mistyping in comment was fixed.

@Felix, good that you also doing test. For me my setup also worked for 2-4 weeks w/o gateway restart and then at one point is started to hang, and seems it was when switched to MEGA.
« Last Edit: November 15, 2014, 04:03:35 AM by ieris »

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #33 on: November 15, 2014, 03:53:51 AM »
1) Ensuring that sendAck can timeout like send()
2) Putting RXrestart in sendAck (not sure if this is definitively a fix but it should mirror what send() does to avoid more vague errors)
3) Placing a time guard in transmit -- this was definitively the source of at least one hang although I cannot trace why in the data sheet a send should hang at all
4) Fixing interrupt to handle cases where DATALEN = 0, which can apparently occur in high traffic scenarios and improves the hygene of the code
5) Adding a call to the interrupt function in some situations during receiveDone to catch the missed interrupt scenario I outlined earlier
1) This was fixed in my example, and this timeout should very short.
2) Now I added it after first crash
3) good point, will look how you did it
4) Interrupt must be fixed, actually right now got an idea that interrupt code where data is read from rfm69 chip can be skipped if previous message not proceeded and cleaned out, what you think?
5) This can be done like if interruption was called and skipped as I mentioned in 4th point then some flag is set up and interrupt code called after message is cleaned out.

P.S. After adding RX deadlock my code running 12+ hours.
« Last Edit: November 15, 2014, 03:55:26 AM by ieris »

Tomme

  • NewMember
  • *
  • Posts: 24
Re: Diagnosing hangs on busy network - advice?
« Reply #34 on: November 15, 2014, 05:08:00 AM »
This may be an entirely different issue but in an effort to reduce the number of interrupts...

Currently if a node transmits then any node capable of receiving (in range, in rx mode) will generate an interrupt as the address filtering is done in software. I was playing with 5-6 Moteinos transmitting very quickly and this started to be a problem. More so as I was hoping to scale up to 30-40 nodes. I turned on hardware address filtering and things improved quite a bit for me. Hope I'm not confusing the issue  :P

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Diagnosing hangs on busy network - advice?
« Reply #35 on: November 15, 2014, 10:13:31 AM »
<...snip> I was hoping to scale up to 30-40 nodes. <snip...>
That's all?   :)

Seriously, in my case I already have about 20 'allocated' to about 6 different functions, but waiting for PCBs to come in so I can deploy. 

I think you may be on to something as I can see that only a Repeater would need to see 'all' addresses (a Gateway might, but not in my case).   Also, ability to broadcast is generally useful, but not necessary in my case.  I'd rather have efficiency and use logic to get to all the devices I need than consume available run time dealing with unnecessary interrupts.

Tom

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #36 on: November 15, 2014, 11:31:32 AM »
@Felix, if you want to test setup like my - I have gateway, nodes which send to gateway, and some subnodes which send to nodes.
And timing for some nodes between sending is random length.

Now with added RX deadlock line in code already 20+ hours without hang up...
« Last Edit: November 15, 2014, 11:50:18 AM by ieris »

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6867
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #37 on: November 15, 2014, 02:43:53 PM »
I agree that hardware addressing might be better, however that will remove the possibility of broadcast which I think is an important and useful feature. It's a tradeoff and I am open to switching to hardware addressing if that brings a very significant improvement. I need to test this myself. Perhaps a switch can be added to tell the node whether it cares for broadcasts or not, and whether addressing is done in hardware or software. Then things like MotionOLEDMote can still receive from any node.

ieris - i still see your old pull request, can you cancel that and post a cleaned pull request with your latest code please (only the lines that actually changed)?

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #38 on: November 15, 2014, 03:59:34 PM »
ieris - i still see your old pull request, can you cancel that and post a cleaned pull request with your latest code please (only the lines that actually changed)?
Done. Please check github.
P.S. 24h+ without hanging.

jgilbert

  • NewMember
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #39 on: November 15, 2014, 09:30:47 PM »
I am crash free all day. I just sent you a pull request.

jgilbert

  • NewMember
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #40 on: November 15, 2014, 09:34:50 PM »
One thing I forgot to mention - I also added a change to make millis() store into unsigned longs, which fixes any issues with the 57 day millis() rollover according to posts I've seen on arduino.cc.

These fixes have been crash free for more than 48 hours. I am running about 8 nodes, that send ~5-15 packets every 10 seconds with gateway acknowledgment. Happy to share my codebase for the node/gateway logic if you want to test my exact setup.

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #41 on: November 16, 2014, 03:08:25 AM »
EDIT: It just occurred to me that I may have missed the fact that you shouldn't Ack until you 'consume' the data.  So... in this case, before the Ack, simply memcpy the incoming data for later processing.  The point is that shortening the timing of Ack WRT receiving data, makes potential race conditions more apparent.
You can try my modifications to SendACK procedure (https://github.com/openminihub/RFM69) with those you can send ACK in next line after recevieDone.

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #42 on: November 16, 2014, 03:43:20 AM »
One thing I forgot to mention - I also added a change to make millis() store into unsigned longs, which fixes any issues with the 57 day millis() rollover according to posts I've seen on arduino.cc.

These fixes have been crash free for more than 48 hours. I am running about 8 nodes, that send ~5-15 packets every 10 seconds with gateway acknowledgment. Happy to share my codebase for the node/gateway logic if you want to test my exact setup.
Glad to hear that you are running crash fee! I'm also for 36h right now.
I was thinking to modify millis also, nice that you did it.
Some comments about other changes:
- I like the place where you added "interruptHandler();"
- DATALEN=0 really can be moved out
- good to see RF69_TX_LIMIT_GUARD_MS implemented
- "PAYLOADLEN < 3" can be included in previous IF statement without duplicating all this logic twice
- in sendACK "avoid RX deadlocks" is better to move one line before sendFrame because logic in receiveDone enables interrupts and can trigger RX. And ACK is send anyway if the timeout is triggered and network is still busy (as there are those weaknesses in my opinion I think that crash free is achieved with adding RF69_TX_LIMIT_GUARD_MS not fixing sendACK)

Overall I'm so happy that I wasn't alone with that problem and we have good team work to fix it! ;)

jgilbert

  • NewMember
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #43 on: November 16, 2014, 03:40:44 PM »
@felix, @ireis --

Still great uptime here -- running more than 48 hours at this point. Its was great to work on this at the same time other people -- made the process a lot more enjoyable.

For anyone whose curious, my changes are here: https://github.com/jgilbert20/RFM69

Jeremy

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6867
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #44 on: November 16, 2014, 07:36:10 PM »
Guys, I am testing your changes, but with the Node example I am getting very bad/inconsistent transmits and ACKs.
Basically huge degradation. Not sure if I'm doing something else wrong, but I don't think so.
UPDATE: I think the calling of interrruptHandler() in receiveDone() was the culprit of that.
@ieris - you said you have a different node/gateway example code base? the one in your repo is the same as mine.
« Last Edit: November 16, 2014, 07:43:40 PM by Felix »

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #45 on: November 16, 2014, 11:36:08 PM »
I didn't tried yet all @jgilbert implements, still testing my code. For every improvements I'm doing node/gateway test.
With modified sendACK (as it is in my repo) and moved sendACK right after receiveDone() in gateway code transmits are good and fast, check attached picture (in picture are some very little modified output and removed back-pinging).

So, you are saying interrruptHandler() causes bad transmitions.. hmm.. most probably, need to test, but idea in general was good.

What I mentioned before was that I have another gateway which is my own code, not just modified yours. And still in debugging and testing I seems found a main problem which triggered RFM69 library hangup - I have one slower serial device and reading serial line with timeout 100ms:
Code: [Select]
byte readSerialLine(char* input, uint16_t timeout=100, byte maxLength=64, char endOfLineChar=10);
creates bigger pauses between receiveDone() which changes statuses and interrupts what can be a root cause.
« Last Edit: November 16, 2014, 11:39:02 PM by ieris »

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6867
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #46 on: November 17, 2014, 07:01:02 PM »
Hey guys,
I have posted a more major patch to the RFM69 library and the examples, including portions of your submitted pull requests. I have tested this patch with a run of about 1million ACKed packets for a total of probably around 2million total transmissions, without a hang. I feel confident this is a pretty stable release.
Also wireless programming channel shifting has been fixed, and examples have been updated.
Thanks ieris and jgilbert for your hard work and persistence. This effort would not have been possible without your contribution.
The RFM69 library patch can be seen here: https://github.com/LowPowerLab/RFM69/commit/ed2fd5b8d55d011ed8164d9d517f364cc7841a0c
Let's keep this thread for reference and of course continue to monitor the performance of the RFM69 library and radios.

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Diagnosing hangs on busy network - advice?
« Reply #47 on: November 18, 2014, 08:51:22 AM »
@Felix, I'm wondering about the call to receiveBegin() that has been added in the interrupt Handler (line 299).  It is called after unselect(), which disables the SPI interface to the RFM69...

Tom

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6867
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #48 on: November 18, 2014, 10:13:31 AM »
@Felix, I'm wondering about the call to receiveBegin() that has been added in the interrupt Handler (line 299).  It is called after unselect(), which disables the SPI interface to the RFM69...
The select() is called again in readReg() inside receiveBegin() so no harm there. But I agree unselect() should be called after receiveBegin() to complete the transaction correctly and not leave the CS signal asserted. However even that being said, unselect will be called very soon after in another transaction that follows triggered by a call to receiveBegin. Anyhow I should move the receiveBegin() before unselect().

jgilbert

  • NewMember
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #49 on: November 18, 2014, 05:11:19 PM »
@Felix -- glad to help and glad to contribute to the effort.


I looked through your edits and they make sense. I made a similar set of changes earlier in my debugging process and still ran into an issue where there could be "missed" interrupt that would deadlock (not crash) the gateway. I was able to unstick things by manually pulling D2 through an edge. For that reason, I suggest you continue to test, especially with gateway code that is "ack only" (as is mine- e.g. it doesn't send out packets except to acknowledge those that come in.)


I saw your note suggesting that the addition of the interrupt call into receiveDone caused problems in your tests, so clearly that isn't necessarily the best solution, but I think we should come up with some other answer. Perhaps the reason that fix bombs is that packet received by the extra is then immediately replaced with a new packet on the next incoming message and the radio is never moved to standby? We need to add some periodic check if PAYLOAD > 0, and either discard the packet, retrigger the interrupt, to handle the packet if the interrupt doesn't get around to it in a certain amount of time.






 if the

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #50 on: November 27, 2014, 03:51:59 AM »
@Felix - I'm also glad to help. For me improvements till now working without problems.

What about my comment what I added earlier:
- in sendACK "avoid RX deadlocks" is better to move one line before sendFrame because logic in receiveDone enables interrupts and can trigger RX. And ACK is send anyway if the timeout is triggered and network is still busy.

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6867
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #51 on: November 28, 2014, 09:52:58 AM »
Right but receiveDone is called anyway and I think the chance of such RX received in that time period is about the same. If it does happen then the packet is read in the buffers and then the ACK is sent. Do you see any issue with that?

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Diagnosing hangs on busy network - advice?
« Reply #52 on: November 28, 2014, 06:05:37 PM »
@all,
do we think this is 'done' now?

I'm a bit unnerved by the added call to receiveBegin() in the interrupt handler.  Sorry to ask about this, but it just doesn't 'feel' right to me.  I thought I was keeping track of the progress on this thread, but somehow missed the purpose/need for this particular addition.  Can someone explain why this needed to be added (or, at least, point me to the posting that explained it)?

Thanks in advance,
Tom

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #53 on: November 30, 2014, 11:13:04 AM »
Right but receiveDone is called anyway and I think the chance of such RX received in that time period is about the same. If it does happen then the packet is read in the buffers and then the ACK is sent. Do you see any issue with that?
No, do not see an issue currently, just thinking of the right way how it must be... ;)

@TomWS, as I understood Felix added this line, but I do not see nothing wrong with it, maybe a little bit to much to reset/initialize all, but that's all.

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Diagnosing hangs on busy network - advice?
« Reply #54 on: November 30, 2014, 08:39:33 PM »
<...snip>
@TomWS, as I understood Felix added this line, but I do not see nothing wrong with it, maybe a little bit to much to reset/initialize all, but that's all.
Thanks for replying.

Have you been testing with this version of code?  I've looked at it, but haven't merged into my code yet.   I'm sort of waiting for you guys to decide that this works and no further changes are expected.

I'm in the process of getting my various Motes deployed and stabilizing my infrastructure code so I probably won't be in a position to 'challenge' this code for a few weeks.  Fortunately all of my Motes will have the ability to do Wireless Program updates  :)

Tom



ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #55 on: December 02, 2014, 05:02:44 AM »
<...snip>
Have you been testing with this version of code?  I've looked at it, but haven't merged into my code yet.   I'm sort of waiting for you guys to decide that this works and no further changes are expected.
Hi, nope, currently using my code, but I'm going to switch to Felix code soon. Must try! ;)

WhiteHare

  • Hero Member
  • *****
  • Posts: 1300
  • Country: us
Re: Diagnosing hangs on busy network - advice?
« Reply #56 on: January 16, 2016, 02:47:35 PM »
I've also verified this by adding code to nodes to passively wake up and report on the background RSSI they receive (thanks to an earlier post for how to do this.)


Does anyone happen to know which earlier post jgilbert is referring to?  For several different reasons, it would be helpful to know what method jgilbert was using for measuring background RSSI. 

JojoS

  • NewMember
  • *
  • Posts: 2
  • Country: de
Re: Diagnosing hangs on busy network - advice?
« Reply #57 on: December 21, 2016, 04:42:01 PM »
UPDATE - solved, see post below this.

sorry for opening this old topic, but I'm trying to understand the suggested modifications in sendACK().
I found this thread because I called sendACK directly after receivedDone and got no data. I've read that you don't recommend this, but I think there is something wrong.
In my main, I measured the time for the canSend in the sendACK:
Code: [Select]
	while (1) {
if (rfm.receiveDone()) {
if (rfm.ACKRequested()) {
//ackMinRSSI = rfm.NewSendACK();
csma_time = rfm.sendACK();
}

// now rfm.DATA is zero
and the modified sendACK:
Code: [Select]
int RFM69::sendACK(const void* buffer, uint8_t bufferSize) {
  uint8_t sender = SENDERID;
  int16_t _RSSI = RSSI; // save payload received RSSI value
  writeReg(REG_PACKETCONFIG2, (readReg(REG_PACKETCONFIG2) & 0xFB) | RF_PACKET2_RXRESTART); // avoid RX deadlocks
  uint32_t now = t.read_ms();
  while (!canSend() && t.read_ms() - now < RF69_CSMA_LIMIT_MS)
  receiveDone();

  int csma_time = t.read_ms() - now;

  sendFrame(sender, buffer, bufferSize, false, true);
  RSSI = _RSSI; // restore payload RSSI
  return csma_time;
}
The result is always a time <= 1ms. But with the other code suggested here  I run into timeouts because of too high background rssi. Both cases look not plausible to me...
Code: [Select]
int RFM69::NewSendACK(const void* buffer, uint8_t bufferSize) {
setMode(RF69_MODE_TX);

int16_t _RSSI = RSSI; // save payload received RSSI value
bool canSendACK = false;

int rssiMin = 1000;
uint32_t now = t.read_ms();
while ((canSendACK == false) && (t.read_ms() - now < 100 /*ACK_CSMA_LIMIT_MS*/))
{
int rssi = readRSSI();
canSendACK = (rssi < CSMA_LIMIT); // CSMA_LIMIT = -90
rssiMin = (rssi < rssiMin) ? rssi : rssiMin;
wait_us(50);
}

if (canSendACK) {
writeReg(REG_PACKETCONFIG2,
(readReg(REG_PACKETCONFIG2) & 0xFB) | RF_PACKET2_RXRESTART); // avoid RX deadlocks
sendFrame(SENDERID, buffer, bufferSize, false, true);
}
RSSI = _RSSI; // restore payload RSSI
return (canSendACK ? 0 : rssiMin);
}

Trying to follow the flow in the original code:
- receive Data, want ACK
- call sendACK, call canSend()
- canSend is in RX mode, payload > 0, rssi > threshold
- receiveDone in sendACK is called
- RX mode, payload > 0   ->  mode is set to standby!
- next call to canSend in sendACK loop
- mode is standby, canSend returns false
- next call to receiveDone
- mode is standby, receiveBegin is called and everything reseted
- next call to canSend in sendACK loop
- mode is RX, payload is 0, now rssi is (now always?) low
- ACK is sent

So this looks reliable because rssi is always measured low and the ACK is sent. In opposite, the NewSendACK version reports timeouts cause rssi is too high. But I think this rssi measurement gives a wrong value, I only don't now why. From the the program flow, the new Version should work without destroying the previous received payload.
Can someone follow me?


JojoS

  • NewMember
  • *
  • Posts: 2
  • Country: de
Re: Diagnosing hangs on busy network - advice?
« Reply #58 on: December 21, 2016, 05:59:23 PM »
aargh... in my NewSendAck() test I set the mode to TX instead of RX. Don't know where I copied it from, but this was wrong of course and gave the wrong rssi values.
Now this version works fine, every paket gets ACKed and the received data is not destroyed by the receiveBegin() call. This means also that I need a fifo now for the received data when new data is received while the current is not processed yet. Fortunately, this is a port for mbed and my Cortex-M3 has some free RAM for this.