Like many of you, I'm on a path to develop a wireless ecosystem built around Felix's moteinos to measure everything in my house. (In fact, I've even designed a few boards that fit the Moteino footprint to make it that much easier to add sensors.) I've recently stumbled into a difficult reliability issue and I'm wondering if anyone has any advice. My gateway moteino R3 (that echos packets via a serial port to the raspberry pi) is periodically hanging, requiring a hard reset. My network is busy -- I have about 5 nodes operating, sending packets every 5-10 seconds. This glitch is hard to reproduce, sometimes it occurs right away, other times it takes hours or even a day to emerge.
After attaching some additional LEDs and instrumenting my code, it appears that the hang is occurring in my call to radio.sendAck(). (E.g. the LED that goes on only before this function is called is stuck "high" when the gateway freezes.)
I spent the afternoon tracing through the RFM69 code (amazing work Felix -- not a simple chip to talk to) and I'm puzzled by the following. During ::send(), there is some obvious protection for deadlocks and for timeouts. I frankly don't understand the first line at all (must be something deep in the internals of the RFM chip at work) but the second part is straightforward -- the radio waits until the channel is "clear" or a certain amount of time has passed.
void RFM69::send(byte toAddress, const void* buffer, byte bufferSize, bool requestACK)
{
writeReg(REG_PACKETCONFIG2, (readReg(REG_PACKETCONFIG2) & 0xFB) | RF_PACKET2_RXRESTART); // avoid RX deadlocks
long now = millis();
while (!canSend() && millis()-now < RF69_CSMA_LIMIT_MS) receiveDone();
sendFrame(toAddress, buffer, bufferSize, requestACK, false);
}
This same protection does not seem to take place for ACKs:
/// Should be called immediately after reception in case sender wants ACK
void RFM69::sendACK(const void* buffer, byte bufferSize) {
byte sender = SENDERID;
while (!canSend()) receiveDone();
sendFrame(sender, buffer, bufferSize, false, true);
}
Theoretically, while(!canSend()) could spin forever if the radio is not already in RX mode, the CSMA limit is not reached, or something else goes wrong in receiveDone(). However, receiveDone() seems guaranteed to eventually put the radio in RX mode, and sooner or later the channel will clear. During the hang, the other nodes (even those very close to the gateway) continue to transmit and report low RSSIs suggesting that the gateway's while loop should also eventually realize the channel is clear. I've also verified this by adding code to nodes to passively wake up and report on the background RSSI they receive (thanks to an earlier post for how to do this.)
Has anyone hit this glitch before or am I in brand new territory? Should I patch sendAck to look like send()?
Another (possibly) related issue: I've observed some isolated cases of a node hanging while holding the channel open (my debugging nodes report a readRSSI() of -22-30, well above the CSSA limit) -- the problem fixes itself after the one particular node is reset. Again, reading the code very carefully I can't figure out any situation where the transmitter would stick in the "on" position unless something was tampering with the interrupt pin which my designs leave untouched. Through battery instrumentation I've mostly ruled out brownouts as a potential factor. Still, it might be safer to have some form of guard on the amount of time that the radio is allowed to stay in RF69_MODE_TX.
Thanks for all help and tips.