Author Topic: Diagnosing hangs on busy network - advice?  (Read 18461 times)

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 5938
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #30 on: November 14, 2014, 05:52:25 PM »
Jeremy, good work. I have setup a gateway Moteino and 2 Moteino nodes that are loaded with the Node example with TRANSMITPERIOD=100ms which generates a LOT of traffic. With the latest code I've been running for a while but will likely leave it overnight. Also I added some code to count the packets received.
I think 100ms might be getting close to the limit of congestion. There are basically 2 packets that try to get through every 100ms, and every 3rd received packet the gateway sends a ping back and asks for an ACK. This is most obvious when packets are getting long and ACKs sometimes don't make it, see attached. With all this, I have not seen a hang, and my main network (same settings just different encryption key is working fine, significantly less traffic on that though, however it's on the same frequency so I think it's demodulating the signal and finding that the AES doesn't pass and doesn't raise the interrupt). I do however see ACKed packets are repeated a lot more often indicating collisions so the nodes are trying hard to get it through. BTW the interrupt is RISING triggered.

I'd like at some point to look at your code and compare. I have also added the time guard in sendACK() today.
« Last Edit: November 14, 2014, 05:56:30 PM by Felix »

TomWS

  • Hero Member
  • *****
  • Posts: 1888
Re: Diagnosing hangs on busy network - advice?
« Reply #31 on: November 14, 2014, 10:24:27 PM »
@felix, if you're using the original Node code for your test, trying swapping the sendAck and serial output:
Original:
Code: [Select]
  if (radio.receiveDone())
  {
    Serial.print('[');Serial.print(radio.SENDERID, DEC);Serial.print("] ");
    for (byte i = 0; i < radio.DATALEN; i++)
      Serial.print((char)radio.DATA[i]);
    Serial.print("   [RX_RSSI:");Serial.print(radio.RSSI);Serial.print("]");

    if (radio.ACKRequested())
    {
      radio.sendACK();
      Serial.print(" - ACK sent");
    }
    Blink(LED,5);
    Serial.println();
  }
to...
Code: [Select]
  //check for any received packets
  if (radio.receiveDone())
  {
    if (radio.ACKRequested())
    {
      radio.sendACK();
      Serial.print(" - ACK sent");
    }

    Serial.print('[');Serial.print(radio.SENDERID, DEC);Serial.print("] ");
    for (byte i = 0; i < radio.DATALEN; i++)
      Serial.print((char)radio.DATA[i]);
    Serial.print("   [RX_RSSI:");Serial.print(radio.RSSI);Serial.print("]");

    Blink(LED,5);
    Serial.println();
  }
and see what happens.  I think there is a timing relationship with how soon you ack after receive that factors into this.

EDIT: It just occurred to me that I may have missed the fact that you shouldn't Ack until you 'consume' the data.  So... in this case, before the Ack, simply memcpy the incoming data for later processing.  The point is that shortening the timing of Ack WRT receiving data, makes potential race conditions more apparent.


Tom
« Last Edit: November 14, 2014, 10:37:50 PM by TomWS »

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #32 on: November 15, 2014, 03:45:05 AM »
Thanks ieris for the pull request, unfortunately I cannot merge it because of some things I noticed. There are some Serial.writes and a lot of lines that have not changed except for blank spaces, I like to keep the merges lean and clean. Also saw your edit about it failing again.
I think I am going to start to do my own testing and see if/when there's a hang. There are now several threads reporting hangs so solving this issue is very important.
Thanks for your continued persistence and feedback.
Yep, no need to merge, as it failed.
Hmmmm... Yes there is a Serial, bet it is commented and leaved for debug, as I saw you also had such commented Serial lines in code :)
Those blank spaces was a lot because I also like a clean and formatted code, and that it is why they appear. And one mistyping in comment was fixed.

@Felix, good that you also doing test. For me my setup also worked for 2-4 weeks w/o gateway restart and then at one point is started to hang, and seems it was when switched to MEGA.
« Last Edit: November 15, 2014, 04:03:35 AM by ieris »

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #33 on: November 15, 2014, 03:53:51 AM »
1) Ensuring that sendAck can timeout like send()
2) Putting RXrestart in sendAck (not sure if this is definitively a fix but it should mirror what send() does to avoid more vague errors)
3) Placing a time guard in transmit -- this was definitively the source of at least one hang although I cannot trace why in the data sheet a send should hang at all
4) Fixing interrupt to handle cases where DATALEN = 0, which can apparently occur in high traffic scenarios and improves the hygene of the code
5) Adding a call to the interrupt function in some situations during receiveDone to catch the missed interrupt scenario I outlined earlier
1) This was fixed in my example, and this timeout should very short.
2) Now I added it after first crash
3) good point, will look how you did it
4) Interrupt must be fixed, actually right now got an idea that interrupt code where data is read from rfm69 chip can be skipped if previous message not proceeded and cleaned out, what you think?
5) This can be done like if interruption was called and skipped as I mentioned in 4th point then some flag is set up and interrupt code called after message is cleaned out.

P.S. After adding RX deadlock my code running 12+ hours.
« Last Edit: November 15, 2014, 03:55:26 AM by ieris »

Tomme

  • Newbie
  • *
  • Posts: 24
Re: Diagnosing hangs on busy network - advice?
« Reply #34 on: November 15, 2014, 05:08:00 AM »
This may be an entirely different issue but in an effort to reduce the number of interrupts...

Currently if a node transmits then any node capable of receiving (in range, in rx mode) will generate an interrupt as the address filtering is done in software. I was playing with 5-6 Moteinos transmitting very quickly and this started to be a problem. More so as I was hoping to scale up to 30-40 nodes. I turned on hardware address filtering and things improved quite a bit for me. Hope I'm not confusing the issue  :P

TomWS

  • Hero Member
  • *****
  • Posts: 1888
Re: Diagnosing hangs on busy network - advice?
« Reply #35 on: November 15, 2014, 10:13:31 AM »
<...snip> I was hoping to scale up to 30-40 nodes. <snip...>
That's all?   :)

Seriously, in my case I already have about 20 'allocated' to about 6 different functions, but waiting for PCBs to come in so I can deploy. 

I think you may be on to something as I can see that only a Repeater would need to see 'all' addresses (a Gateway might, but not in my case).   Also, ability to broadcast is generally useful, but not necessary in my case.  I'd rather have efficiency and use logic to get to all the devices I need than consume available run time dealing with unnecessary interrupts.

Tom

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #36 on: November 15, 2014, 11:31:32 AM »
@Felix, if you want to test setup like my - I have gateway, nodes which send to gateway, and some subnodes which send to nodes.
And timing for some nodes between sending is random length.

Now with added RX deadlock line in code already 20+ hours without hang up...
« Last Edit: November 15, 2014, 11:50:18 AM by ieris »

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 5938
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #37 on: November 15, 2014, 02:43:53 PM »
I agree that hardware addressing might be better, however that will remove the possibility of broadcast which I think is an important and useful feature. It's a tradeoff and I am open to switching to hardware addressing if that brings a very significant improvement. I need to test this myself. Perhaps a switch can be added to tell the node whether it cares for broadcasts or not, and whether addressing is done in hardware or software. Then things like MotionOLEDMote can still receive from any node.

ieris - i still see your old pull request, can you cancel that and post a cleaned pull request with your latest code please (only the lines that actually changed)?

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #38 on: November 15, 2014, 03:59:34 PM »
ieris - i still see your old pull request, can you cancel that and post a cleaned pull request with your latest code please (only the lines that actually changed)?
Done. Please check github.
P.S. 24h+ without hanging.

jgilbert

  • Newbie
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #39 on: November 15, 2014, 09:30:47 PM »
I am crash free all day. I just sent you a pull request.

jgilbert

  • Newbie
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #40 on: November 15, 2014, 09:34:50 PM »
One thing I forgot to mention - I also added a change to make millis() store into unsigned longs, which fixes any issues with the 57 day millis() rollover according to posts I've seen on arduino.cc.

These fixes have been crash free for more than 48 hours. I am running about 8 nodes, that send ~5-15 packets every 10 seconds with gateway acknowledgment. Happy to share my codebase for the node/gateway logic if you want to test my exact setup.

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #41 on: November 16, 2014, 03:08:25 AM »
EDIT: It just occurred to me that I may have missed the fact that you shouldn't Ack until you 'consume' the data.  So... in this case, before the Ack, simply memcpy the incoming data for later processing.  The point is that shortening the timing of Ack WRT receiving data, makes potential race conditions more apparent.
You can try my modifications to SendACK procedure (https://github.com/openminihub/RFM69) with those you can send ACK in next line after recevieDone.

ieris

  • Newbie
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #42 on: November 16, 2014, 03:43:20 AM »
One thing I forgot to mention - I also added a change to make millis() store into unsigned longs, which fixes any issues with the 57 day millis() rollover according to posts I've seen on arduino.cc.

These fixes have been crash free for more than 48 hours. I am running about 8 nodes, that send ~5-15 packets every 10 seconds with gateway acknowledgment. Happy to share my codebase for the node/gateway logic if you want to test my exact setup.
Glad to hear that you are running crash fee! I'm also for 36h right now.
I was thinking to modify millis also, nice that you did it.
Some comments about other changes:
- I like the place where you added "interruptHandler();"
- DATALEN=0 really can be moved out
- good to see RF69_TX_LIMIT_GUARD_MS implemented
- "PAYLOADLEN < 3" can be included in previous IF statement without duplicating all this logic twice
- in sendACK "avoid RX deadlocks" is better to move one line before sendFrame because logic in receiveDone enables interrupts and can trigger RX. And ACK is send anyway if the timeout is triggered and network is still busy (as there are those weaknesses in my opinion I think that crash free is achieved with adding RF69_TX_LIMIT_GUARD_MS not fixing sendACK)

Overall I'm so happy that I wasn't alone with that problem and we have good team work to fix it! ;)

jgilbert

  • Newbie
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #43 on: November 16, 2014, 03:40:44 PM »
@felix, @ireis --

Still great uptime here -- running more than 48 hours at this point. Its was great to work on this at the same time other people -- made the process a lot more enjoyable.

For anyone whose curious, my changes are here: https://github.com/jgilbert20/RFM69

Jeremy

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 5938
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #44 on: November 16, 2014, 07:36:10 PM »
Guys, I am testing your changes, but with the Node example I am getting very bad/inconsistent transmits and ACKs.
Basically huge degradation. Not sure if I'm doing something else wrong, but I don't think so.
UPDATE: I think the calling of interrruptHandler() in receiveDone() was the culprit of that.
@ieris - you said you have a different node/gateway example code base? the one in your repo is the same as mine.
« Last Edit: November 16, 2014, 07:43:40 PM by Felix »