Author Topic: Diagnosing hangs on busy network - advice?  (Read 26816 times)

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #45 on: November 16, 2014, 11:36:08 PM »
I didn't tried yet all @jgilbert implements, still testing my code. For every improvements I'm doing node/gateway test.
With modified sendACK (as it is in my repo) and moved sendACK right after receiveDone() in gateway code transmits are good and fast, check attached picture (in picture are some very little modified output and removed back-pinging).

So, you are saying interrruptHandler() causes bad transmitions.. hmm.. most probably, need to test, but idea in general was good.

What I mentioned before was that I have another gateway which is my own code, not just modified yours. And still in debugging and testing I seems found a main problem which triggered RFM69 library hangup - I have one slower serial device and reading serial line with timeout 100ms:
Code: [Select]
byte readSerialLine(char* input, uint16_t timeout=100, byte maxLength=64, char endOfLineChar=10);
creates bigger pauses between receiveDone() which changes statuses and interrupts what can be a root cause.
« Last Edit: November 16, 2014, 11:39:02 PM by ieris »

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6866
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #46 on: November 17, 2014, 07:01:02 PM »
Hey guys,
I have posted a more major patch to the RFM69 library and the examples, including portions of your submitted pull requests. I have tested this patch with a run of about 1million ACKed packets for a total of probably around 2million total transmissions, without a hang. I feel confident this is a pretty stable release.
Also wireless programming channel shifting has been fixed, and examples have been updated.
Thanks ieris and jgilbert for your hard work and persistence. This effort would not have been possible without your contribution.
The RFM69 library patch can be seen here: https://github.com/LowPowerLab/RFM69/commit/ed2fd5b8d55d011ed8164d9d517f364cc7841a0c
Let's keep this thread for reference and of course continue to monitor the performance of the RFM69 library and radios.

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Diagnosing hangs on busy network - advice?
« Reply #47 on: November 18, 2014, 08:51:22 AM »
@Felix, I'm wondering about the call to receiveBegin() that has been added in the interrupt Handler (line 299).  It is called after unselect(), which disables the SPI interface to the RFM69...

Tom

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6866
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #48 on: November 18, 2014, 10:13:31 AM »
@Felix, I'm wondering about the call to receiveBegin() that has been added in the interrupt Handler (line 299).  It is called after unselect(), which disables the SPI interface to the RFM69...
The select() is called again in readReg() inside receiveBegin() so no harm there. But I agree unselect() should be called after receiveBegin() to complete the transaction correctly and not leave the CS signal asserted. However even that being said, unselect will be called very soon after in another transaction that follows triggered by a call to receiveBegin. Anyhow I should move the receiveBegin() before unselect().

jgilbert

  • NewMember
  • *
  • Posts: 18
Re: Diagnosing hangs on busy network - advice?
« Reply #49 on: November 18, 2014, 05:11:19 PM »
@Felix -- glad to help and glad to contribute to the effort.


I looked through your edits and they make sense. I made a similar set of changes earlier in my debugging process and still ran into an issue where there could be "missed" interrupt that would deadlock (not crash) the gateway. I was able to unstick things by manually pulling D2 through an edge. For that reason, I suggest you continue to test, especially with gateway code that is "ack only" (as is mine- e.g. it doesn't send out packets except to acknowledge those that come in.)


I saw your note suggesting that the addition of the interrupt call into receiveDone caused problems in your tests, so clearly that isn't necessarily the best solution, but I think we should come up with some other answer. Perhaps the reason that fix bombs is that packet received by the extra is then immediately replaced with a new packet on the next incoming message and the radio is never moved to standby? We need to add some periodic check if PAYLOAD > 0, and either discard the packet, retrigger the interrupt, to handle the packet if the interrupt doesn't get around to it in a certain amount of time.






 if the

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #50 on: November 27, 2014, 03:51:59 AM »
@Felix - I'm also glad to help. For me improvements till now working without problems.

What about my comment what I added earlier:
- in sendACK "avoid RX deadlocks" is better to move one line before sendFrame because logic in receiveDone enables interrupts and can trigger RX. And ACK is send anyway if the timeout is triggered and network is still busy.

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6866
  • Country: us
    • LowPowerLab
Re: Diagnosing hangs on busy network - advice?
« Reply #51 on: November 28, 2014, 09:52:58 AM »
Right but receiveDone is called anyway and I think the chance of such RX received in that time period is about the same. If it does happen then the packet is read in the buffers and then the ACK is sent. Do you see any issue with that?

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Diagnosing hangs on busy network - advice?
« Reply #52 on: November 28, 2014, 06:05:37 PM »
@all,
do we think this is 'done' now?

I'm a bit unnerved by the added call to receiveBegin() in the interrupt handler.  Sorry to ask about this, but it just doesn't 'feel' right to me.  I thought I was keeping track of the progress on this thread, but somehow missed the purpose/need for this particular addition.  Can someone explain why this needed to be added (or, at least, point me to the posting that explained it)?

Thanks in advance,
Tom

ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #53 on: November 30, 2014, 11:13:04 AM »
Right but receiveDone is called anyway and I think the chance of such RX received in that time period is about the same. If it does happen then the packet is read in the buffers and then the ACK is sent. Do you see any issue with that?
No, do not see an issue currently, just thinking of the right way how it must be... ;)

@TomWS, as I understood Felix added this line, but I do not see nothing wrong with it, maybe a little bit to much to reset/initialize all, but that's all.

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Diagnosing hangs on busy network - advice?
« Reply #54 on: November 30, 2014, 08:39:33 PM »
<...snip>
@TomWS, as I understood Felix added this line, but I do not see nothing wrong with it, maybe a little bit to much to reset/initialize all, but that's all.
Thanks for replying.

Have you been testing with this version of code?  I've looked at it, but haven't merged into my code yet.   I'm sort of waiting for you guys to decide that this works and no further changes are expected.

I'm in the process of getting my various Motes deployed and stabilizing my infrastructure code so I probably won't be in a position to 'challenge' this code for a few weeks.  Fortunately all of my Motes will have the ability to do Wireless Program updates  :)

Tom



ieris

  • NewMember
  • *
  • Posts: 38
  • Country: lv
Re: Diagnosing hangs on busy network - advice?
« Reply #55 on: December 02, 2014, 05:02:44 AM »
<...snip>
Have you been testing with this version of code?  I've looked at it, but haven't merged into my code yet.   I'm sort of waiting for you guys to decide that this works and no further changes are expected.
Hi, nope, currently using my code, but I'm going to switch to Felix code soon. Must try! ;)

WhiteHare

  • Hero Member
  • *****
  • Posts: 1300
  • Country: us
Re: Diagnosing hangs on busy network - advice?
« Reply #56 on: January 16, 2016, 02:47:35 PM »
I've also verified this by adding code to nodes to passively wake up and report on the background RSSI they receive (thanks to an earlier post for how to do this.)


Does anyone happen to know which earlier post jgilbert is referring to?  For several different reasons, it would be helpful to know what method jgilbert was using for measuring background RSSI. 

JojoS

  • NewMember
  • *
  • Posts: 2
  • Country: de
Re: Diagnosing hangs on busy network - advice?
« Reply #57 on: December 21, 2016, 04:42:01 PM »
UPDATE - solved, see post below this.

sorry for opening this old topic, but I'm trying to understand the suggested modifications in sendACK().
I found this thread because I called sendACK directly after receivedDone and got no data. I've read that you don't recommend this, but I think there is something wrong.
In my main, I measured the time for the canSend in the sendACK:
Code: [Select]
	while (1) {
if (rfm.receiveDone()) {
if (rfm.ACKRequested()) {
//ackMinRSSI = rfm.NewSendACK();
csma_time = rfm.sendACK();
}

// now rfm.DATA is zero
and the modified sendACK:
Code: [Select]
int RFM69::sendACK(const void* buffer, uint8_t bufferSize) {
  uint8_t sender = SENDERID;
  int16_t _RSSI = RSSI; // save payload received RSSI value
  writeReg(REG_PACKETCONFIG2, (readReg(REG_PACKETCONFIG2) & 0xFB) | RF_PACKET2_RXRESTART); // avoid RX deadlocks
  uint32_t now = t.read_ms();
  while (!canSend() && t.read_ms() - now < RF69_CSMA_LIMIT_MS)
  receiveDone();

  int csma_time = t.read_ms() - now;

  sendFrame(sender, buffer, bufferSize, false, true);
  RSSI = _RSSI; // restore payload RSSI
  return csma_time;
}
The result is always a time <= 1ms. But with the other code suggested here  I run into timeouts because of too high background rssi. Both cases look not plausible to me...
Code: [Select]
int RFM69::NewSendACK(const void* buffer, uint8_t bufferSize) {
setMode(RF69_MODE_TX);

int16_t _RSSI = RSSI; // save payload received RSSI value
bool canSendACK = false;

int rssiMin = 1000;
uint32_t now = t.read_ms();
while ((canSendACK == false) && (t.read_ms() - now < 100 /*ACK_CSMA_LIMIT_MS*/))
{
int rssi = readRSSI();
canSendACK = (rssi < CSMA_LIMIT); // CSMA_LIMIT = -90
rssiMin = (rssi < rssiMin) ? rssi : rssiMin;
wait_us(50);
}

if (canSendACK) {
writeReg(REG_PACKETCONFIG2,
(readReg(REG_PACKETCONFIG2) & 0xFB) | RF_PACKET2_RXRESTART); // avoid RX deadlocks
sendFrame(SENDERID, buffer, bufferSize, false, true);
}
RSSI = _RSSI; // restore payload RSSI
return (canSendACK ? 0 : rssiMin);
}

Trying to follow the flow in the original code:
- receive Data, want ACK
- call sendACK, call canSend()
- canSend is in RX mode, payload > 0, rssi > threshold
- receiveDone in sendACK is called
- RX mode, payload > 0   ->  mode is set to standby!
- next call to canSend in sendACK loop
- mode is standby, canSend returns false
- next call to receiveDone
- mode is standby, receiveBegin is called and everything reseted
- next call to canSend in sendACK loop
- mode is RX, payload is 0, now rssi is (now always?) low
- ACK is sent

So this looks reliable because rssi is always measured low and the ACK is sent. In opposite, the NewSendACK version reports timeouts cause rssi is too high. But I think this rssi measurement gives a wrong value, I only don't now why. From the the program flow, the new Version should work without destroying the previous received payload.
Can someone follow me?


JojoS

  • NewMember
  • *
  • Posts: 2
  • Country: de
Re: Diagnosing hangs on busy network - advice?
« Reply #58 on: December 21, 2016, 05:59:23 PM »
aargh... in my NewSendAck() test I set the mode to TX instead of RX. Don't know where I copied it from, but this was wrong of course and gave the wrong rssi values.
Now this version works fine, every paket gets ACKed and the received data is not destroyed by the receiveBegin() call. This means also that I need a fifo now for the received data when new data is received while the current is not processed yet. Fortunately, this is a port for mbed and my Cortex-M3 has some free RAM for this.