Author Topic: MoteinoM0 with RFM69HCW - losing 1 in 100 messages with receiveDone() as trigger  (Read 1809 times)

ianraustralia

  • NewMember
  • *
  • Posts: 9
My application uses a MoteinoM0 with RFM69HCW module mounted(as purchased). I must not lose any event messages so I have been testing high volume messages delivery with what I believe is the simplest possible sender (my SENSOR) and receiver (my PLUGPACK) code using Felix's RFM69 library (as at 29-1-2019). I am monitoring the hardware timing on both devices using an oscilloscope writing an output pin HIGH and LOW around the code lines of current particular interest.

For days now  ( !!!   :(   ) with a wide variety of message lengths (1-50 characters), devices separations (300mm, 1 metre and 5 metres), and frequency of sending messages  (10 per second up to 1 every 2 seconds) ... the PLUGPACK reliably loses 1 in 100 "chirped" messages, whatever the test setup. This behaviour occurs whether I use send() or sendWithRetry() (with zero retries )as the SENSOR sending event.

I am looking for a fresh view on my using Felix's RFM69 libray function receiveDone() boolean as the trigger for processing an inbound (received) message.

I would appreciate any alternative suggestions/methods for the receive trigger, from the RFM69 library, to avoid using receiveDone(), as my belief is that something in that library code is the cause of the 1/100 lost message .

I "believe" the PLUGPACK (receiver) is missing the 1/100 messages as its reported timing data reliably shows a receipt timing gap of 2 chirp spacings with EVERY lost message occurrence.

I "believe" the SENSOR is reliably sending sequential chirped messages, with no timing gaps, up to test batches of 3000 messages.

The SENSOR repeatedly sends "chirp" messages controlled by a delay. The SENSOR reports the chirp number and millis() report of before and after sending each:

Code: [Select]
Format: MILLIS()      MSG NO    MILLIS()

201 1 205
305 2 309
409 3 413
513 4 517
617 5 621
721 6 725
825 7 829
929 8 933
SENSOR (sender) code:
Code: [Select]
void loop() {

  delay(100);

  digitalWrite(OSCILLOSCOPE, HIGH);
 
  Serial.print(millis());Serial.print(" ");
 
  chirp_message_count++;
  send_string = chirp_message_count; 
  send_length = send_string.length() + 1;
  send_buffer[send_length];
  send_string.toCharArray(send_buffer, send_length);
  Serial.print(send_string); Serial.print(" ");
 
  radio.send(PLUGPACK, send_buffer, send_length, false); for production ???
 
  Serial.println(millis());
 
  digitalWrite(OSCILLOSCOPE, LOW);

}

PLUGPACK (receiver) code:
Code: [Select]
void loop() {

  if (radio.receiveDone())
  {
    digitalWrite(OSCILLOSCOPE, HIGH);

    Serial.print(millis()); Serial.print(" ");
    rx_count++;
    Serial.print(rx_count); Serial.print(" ");

    digitalWrite(OSCILLOSCOPE, LOW);

    digitalWrite(OSCILLOSCOPE, HIGH);

    Serial.print((char*)radio.DATA); Serial.print(" ");
    Serial.print(millis()); Serial.println();

    digitalWrite(OSCILLOSCOPE, LOW);

  }

The PLUGPACK reports, when  receiveDone() == true , the received message count, the received message number and the millis() value before and after the receive event. A typical PLUGPACK report, showing a typical "lost" 1/100 message , is shown below:
Code: [Select]
MILLIS()      RECVD COUNT     RCD MSG NUMBER     MILLIS()

23204 159 159 23204  - note 100 mS chirp timing spacings
23308 160 160 23308
23412 161 161 23412
23516 162 162 23517
23620 163 163 23621
23829 164 165 23829  - inbound message 164 was lost here , note 200mS timing gap before 165 was received
23933 165 166 23933
24037 166 167 24037
24141 167 168 24141
24245 168 169 24245

After 4 days of failure, I need a fresh view.

ianraustralia

  • NewMember
  • *
  • Posts: 9
aha!! looking down the RF rabbit hole ...

I decided to rebuild the SENSOR and PLUGPACK code back up to the "production" length messages.
In that process I added Felix's wonderful idea to report RFM69HCW  RSSI.

Lo and behold, in a test of 1500 messages chirped 100mS apart the lost messages still happened ... but ..
.. in every case of a lost message the message immediately after having a HIGH RSSI value !

This implies to me the PLUGPACK (receiver) is seeing an "RF field strength" anomaly in the case of the lost messages.

The above test was with only 300mm SENSOR-PLUGPACK separation.
The same loss level has been reliably observed at 5 metres separation

The evidence follows ... the RSSI reported in 99% of messages was a very stable -26 or -27 with both devices remaining steady and untouched on a table top.

Code: [Select]
367	S000001	190129	53	ABCDEFGHIJKLMNOPQRS	50	367	RX_RSSI=	-26	.
368 S000001 190129 54 ABCDEFGHIJKLMNOPQRS 50 368 RX_RSSI= -26 .
369 S000001 190129 54 ABCDEFGHIJKLMNOPQRS 50 370 RX_RSSI= -12 why strong?   message 369 just lost
370 S000001 190129 54 ABCDEFGHIJKLMNOPQRS 50 371 RX_RSSI= -26 .
371 S000001 190129 54 ABCDEFGHIJKLMNOPQRS 50 372 RX_RSSI= -26 .

380 S000001 190129 55 ABCDEFGHIJKLMNOPQRS 50 381 RX_RSSI= -26 .
381 S000001 190129 56 ABCDEFGHIJKLMNOPQRS 50 382 RX_RSSI= -26 .
382 S000001 190129 56 ABCDEFGHIJKLMNOPQRS 50 383 RX_RSSI= -26 .
383 S000001 190129 56 ABCDEFGHIJKLMNOPQRS 50 385 RX_RSSI= -12 why strong?  message 384 just lost
384 S000001 190129 56 ABCDEFGHIJKLMNOPQRS 50 386 RX_RSSI= -27 .
385 S000001 190129 56 ABCDEFGHIJKLMNOPQRS 50 387 RX_RSSI= -26 .

746 S000001 190129 149 ABCDEFGHIJKLMNOPQRS 50 748 RX_RSSI= -26 .
747 S000001 190129 149 ABCDEFGHIJKLMNOPQRS 50 749 RX_RSSI= -26 .
748 S000001 190129 150 ABCDEFGHIJKLMNOPQRS 50 750 RX_RSSI= -26 .
749 S000001 190129 150 ABCDEFGHIJKLMNOPQRS 50 751 RX_RSSI= -27 .
750 S000001 190129 150 ABCDEFGHIJKLMNOPQRS 50 753 RX_RSSI= -13 why strong?  message 752 just lost
751 S000001 190129 150 ABCDEFGHIJKLMNOPQRS 50 754 RX_RSSI= -26 .
752 S000001 190129 150 ABCDEFGHIJKLMNOPQRS 50 755 RX_RSSI= -26 .
753 S000001 190129 150 ABCDEFGHIJKLMNOPQRS 50 756 RX_RSSI= -26 .

1305 S000001 190129 312 ABCDEFGHIJKLMNOPQRS 50 1311 RX_RSSI= -26 .
1306 S000001 190129 312 ABCDEFGHIJKLMNOPQRS 50 1312 RX_RSSI= -26 .
1307 S000001 190129 312 ABCDEFGHIJKLMNOPQRS 50 1313 RX_RSSI= -26 .
1308 S000001 190129 312 ABCDEFGHIJKLMNOPQRS 50 1314 RX_RSSI= -26 .
1309 S000001 190129 313 ABCDEFGHIJKLMNOPQRS 50 1316 RX_RSSI= -13 why strong?  message 1315 just lost
1310 S000001 190129 313 ABCDEFGHIJKLMNOPQRS 50 1317 RX_RSSI= -26 .
1311 S000001 190129 313 ABCDEFGHIJKLMNOPQRS 50 1318 RX_RSSI= -26 .

1418 S000001 190129 329 ABCDEFGHIJKLMNOPQRS 50 1425 RX_RSSI= -26 .
1419 S000001 190129 329 ABCDEFGHIJKLMNOPQRS 50 1426 RX_RSSI= -26 .
1420 S000001 190129 329 ABCDEFGHIJKLMNOPQRS 50 1427 RX_RSSI= -26 .
1421 S000001 190129 329 ABCDEFGHIJKLMNOPQRS 50 1429 RX_RSSI= -12 why strong?  message 1428 just lost
1422 S000001 190129 330 ABCDEFGHIJKLMNOPQRS 50 1430 RX_RSSI= -26 .
1423 S000001 190129 330 ABCDEFGHIJKLMNOPQRS 50 1431 RX_RSSI= -26 .
1424 S000001 190129 330 ABCDEFGHIJKLMNOPQRS 50 1432 RX_RSSI= -26 .

Note this setup has no ACKs and is simply observing the "inner core" of a send event followed by a receive event.

So ... since I am running ATC firmware on both devices which automatically adjusts the sender transmit field strength ... I am starting to look more closely at how the SENSOR (sender) could behave in such a radical fast way to increase the transmit field strength suddenly but briefly ...

So reviewing the ATC library code is my next adventure ...

So now I am thinking the receiveDone() code is OK and doing exactly what it is supposed to. (Relax Felix! :) )

There's no escaping any RF application has to consider the hardware behaviour and programming along with the associated "setup and run" code.

Any suggestions for more peering into my RF rabbit hole?

ianraustralia

  • NewMember
  • *
  • Posts: 9
So I have worked through the radio power controls options and settings commenting out TX (higher) power options
... aiming to achieve absolute minimum and constant TX power ...
... sender and receiver are 300mm apart on a table ...

Code: [Select]
//#define ENABLE_ATC      //comment out this line to disable AUTO TRANSMISSION CONTROL
//#define ATC_RSSI        -75

Then in setup():

Code: [Select]
 
radio.setPowerLevel(RF_PALEVEL_PA0_OFF); //see HopeRF datasheet Table 10 p21/79
radio.setPowerLevel(RF_PALEVEL_PA1_ON);
radio.setPowerLevel(RF_PALEVEL_PA2_ON);
  //radio.setHighPower();     // ??? used for max RF power tests ONLY not production
 // radio.enableAutoPower(targetRSSI);

radio.encrypt(ENCRYPTKEY);
//radio.enableAutoPower(ATC_RSSI);

And the result in a test of 1031 messages at 100mS chirp rate was 7 lost 
... a much lower but stable RX_RSSI value ... around -69 to -71 always
... BUT with now no change in RSSI at the lost message event !!
Code: [Select]
55	S000001	190129	8	ABCDEFGHIJKLMNOPQRS	50	55	RX_RSSI=	-68	0	
56 S000001 190129 8 ABCDEFGHIJKLMNOPQRS 50 56 RX_RSSI= -68 0
57 S000001 190129 8 ABCDEFGHIJKLMNOPQRS 50 58 RX_RSSI= -68 1 57 lost
58 S000001 190129 8 ABCDEFGHIJKLMNOPQRS 50 59 RX_RSSI= -68 1
59 S000001 190129 8 ABCDEFGHIJKLMNOPQRS 50 60 RX_RSSI= -68 1

216 S000001 190129 31 ABCDEFGHIJKLMNOPQRS 50 217 RX_RSSI= -69 1
217 S000001 190129 32 ABCDEFGHIJKLMNOPQRS 50 218 RX_RSSI= -70 1
218 S000001 190129 32 ABCDEFGHIJKLMNOPQRS 50 220 RX_RSSI= -71 2 219 lost
219 S000001 190129 32 ABCDEFGHIJKLMNOPQRS 50 221 RX_RSSI= -70 2
220 S000001 190129 32 ABCDEFGHIJKLMNOPQRS 50 222 RX_RSSI= -70 2

231 S000001 190129 34 ABCDEFGHIJKLMNOPQRS 50 233 RX_RSSI= -70 2
232 S000001 190129 34 ABCDEFGHIJKLMNOPQRS 50 234 RX_RSSI= -69 2
233 S000001 190129 34 ABCDEFGHIJKLMNOPQRS 50 236 RX_RSSI= -69 3 235 lost
234 S000001 190129 34 ABCDEFGHIJKLMNOPQRS 50 237 RX_RSSI= -70 3
235 S000001 190129 35 ABCDEFGHIJKLMNOPQRS 50 238 RX_RSSI= -69 3

... similar for the 4 others


Not sure what to try next ...??

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6866
  • Country: us
    • LowPowerLab
Ian,
Just a random thought, I wonder if anything changes if you disable the encryption. That takes a few ms more time to process each packet (proportionally more for each 16byte section of a packet).

ianraustralia

  • NewMember
  • *
  • Posts: 9
Thanks Felix... tried with encryption turned OFF and loss rate went worse to 1 in 70 messages !

I then tried shifting the reporting if the received count to AFTER reporting the content of the 50 character "production" message.

This brought the loss rate down to and AMAZINGLY DIFFERENT and BETTER 1 in 1000 !!  :) :) :)

I use the EXCEL data import wizard with space delimiters, operating on the TeraTerm copied buffer, to analyse message losses.
An EXCEL line count is a fast way to check against a received message number, and a simple formula in an adjacent colum shows exactly where the messages were lost.
All the data has to be reported on one line to use this technique. I recommend it.

Here is my PLUGPACK code for that result:
Code: [Select]
// PLUGPACK receive messages

void receive_message_service() {

  if (state_m0 == M0_STANDBY) {
    //check if something was received (could be an interrupt from the radio)

    if (radio.receiveDone())
    {
      digitalWrite(OSCILLOSCOPE, HIGH);
      state_m0 = M0_REPORT_MESSAGE;
      digitalWrite(OSCILLOSCOPE, LOW);
    }
  }
  if (state_m0 == M0_REPORT_MESSAGE) {
    digitalWrite(OSCILLOSCOPE, HIGH);
    report_message_just_received();
    state_m0 = M0_STANDBY;
    digitalWrite(OSCILLOSCOPE, LOW);
  }
}

void report_message_just_received() {

  //print message just received and received messages count

  Serial.print((char*)radio.DATA); Serial.print(" ");
  Serial.print("RX_RSSI= "); Serial.print(radio.RSSI); Serial.print(" ");
  rx_count++;
  Serial.print(rx_count);
  Serial.println();

}

So it is becoming very clear to me after a lot of tests that the RFM69 interrupt code can be "interfered with" or can "interfere with" the Arduino library Serial interrupt code.
I have NEVER lost any Serial.print(...) content in many months of Arduino debugging work.
So it seems to me they have given it a very high execution priority.
In any microprocessor application, competing interrupts will always have to be managed very carefully.

It appears to me we have your RFM69 interrupt maybe being over-ridden(?) briefly under certain timing circumstances in relation to Serial.print activity.
My tests have proved message delivery is in the range 99.0% when I started this work, improving to 99.9% with the code above.

In my "production" code the PLUGPACK will never report serial data unless I am trying to fix a bug/issue.
But ... there is other decision making code, yet to be written, which MUST execute immediately after a message is received, to maximise battery life.
And as Arduino debugging does not use an emulator, I will be very careful in using the shortest possible Serial.print() debug messages.
Printing a dot (".") under very specific circumstances has been my minimal Serial.print report ! It stands out from characters!
I also could not do any of this work without the OSCILLOSCOPE output pin to check timing of related events.

In the above case there was a significant execution difference because of my before/after placement of the interrupt related functions.

If you could give us any advice on how to "stay clear of Arduino Serial.print()" interrupts with MotenoM0 and RFM69 library that would be much appreciated.

ianraustralia

  • NewMember
  • *
  • Posts: 9
Finally I have found a solution.  :)

This was only possible using my trusty  oscilloscope.
I could see infrequent, but comparable to the loss rate, timing jitter as the PLUGPACK processed the inbound messages.
It was clear the "shifting Arduino system timing interrupts (supporting millis())" rather than anything to do with Serial processing was the likely culprit.
We can't and don't want to change the Arduino system baseline code.

I can now achieve 3 messages lost in 6000 sent, spaced at 100mS,  by adding a "duplicate send message" line of code to my SENSOR.  :) :) :)

SENSOR code:
Code: [Select]
void radio_send_chirp_message() {
  //
  digitalWrite(OSCILLOSCOPE, HIGH);
 
  Serial.print(send_string); Serial.print(" ");
  Serial.print("tx at "); Serial.print(millis());
  Serial.print(" chars="); Serial.print(send_length);
  Serial.println();
 
  //radio.sendWithRetry(PLUGPACK, send_buffer, send_length, 0); // 0 retries for dev+test, maybe for production ???
  radio.send(PLUGPACK, send_buffer, send_length, false); // 0 retries for dev+test, maybe for production ???
  radio.send(PLUGPACK, send_buffer, send_length, false); // 0 retries for dev+test, maybe for production ???
 
  state_led_green = LED_GREEN_FLASH;
  led_green_flash_start_time = millis();  // confirm chirp msg with green flash
 
  digitalWrite(OSCILLOSCOPE, LOW);
}

Felix I looked at the RFM69 library , RFM69.cpp line 135, to find you had written:
Code: [Select]
attachInterrupt(_interruptNum, RFM69::isr0, RISING);
Arduino attachInterrupt(...) definition is here:  https://www.arduino.cc/reference/en/language/functions/external-interrupts/attachinterrupt/
---------------------------------
Parameters
interrupt: the number of the interrupt (int)
pin: the pin number (Arduino Due, Zero, MKR1000 only)
ISR: the ISR to call when the interrupt occurs; this function must take no parameters and return nothing. This function is sometimes referred to as an interrupt service routine.
mode: defines when the interrupt should be triggered. Four constants are predefined as valid values:

LOW to trigger the interrupt whenever the pin is low,

CHANGE to trigger the interrupt whenever the pin changes value

RISING to trigger when the pin goes from low to high,

FALLING for when the pin goes from high to low.

The Due, Zero and MKR1000 boards allows also:

HIGH to trigger the interrupt whenever the pin is high.
--------------------------------------

Felix, would you consider changing your isr0() code in the library to use a LOW or CHANGE  to "widen" the capture range of the receiveDone() trigger?.
It seems this would avoid missing the transient RISING trigger at the moment.

The application watching loop would catch the LOW/CHANGE soon after the Arduino system interrupt service was completed.

Looks to me like anyone's code could be a victim of what I have discovered but can now rectify ...  with 1 loss in 2000 messages now acceptable/tolerable for my application.

SENSOR transmission time is an extra (duplicated) 10mS which my calulcations tell me is not a huge battery life price to pay for this "interrupt collision avoidance" strategy.
Moteino M0 event/message wakeup time is around 80 mS.

My preference would be to have the library avoid the collisions, rather than this workaround.

Code: [Select]
attachInterrupt(_interruptNum, RFM69::isr0, LOW or CHANGE);

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6866
  • Country: us
    • LowPowerLab
Felix, would you consider changing your isr0() code in the library to use a LOW or CHANGE  to "widen" the capture range of the receiveDone() trigger?.
It seems this would avoid missing the transient RISING trigger at the moment.

The application watching loop would catch the LOW/CHANGE soon after the Arduino system interrupt service was completed.

Have you tried using LOW or CHANGE in your local copy of the library?

If I understand correctly,you're reporting that RISING is missed, sometimes, because of some Arduino interrupt processing routines?

I don't have enough data to really be able to give an accurate assessment and answer to this question. My gut tells me the answer is no, and if this is indeed a repeatable problem then the fix would be somewhere else. I see your data and I wonder how could RISING be missed, especially in a fast processor.

LOW would/could trigger the interrupt "much later" than when the radio issues the READY edge interrupt, I don't think that's a good shift, because we want to respond to an interrupt ASAP.
CHANGE would trigger the interrupt twice every time (of which one is in vain), that sounds like a duct-tape fix (sometimes necessary but I'd like other options first).
BTW 10ms is significant in battery powered applications.

I'm not sure I fully understand the statement about M0:
Quote
Moteino M0 event/message wakeup time is around 80 mS.

ianraustralia

  • NewMember
  • *
  • Posts: 9
Thanks for your suggestions Felix.

I'm sorry ... should have said "My application MotenoM0 event/message wakeup time is around 80 ms using the duplicate send() code."

With only one send() line, I tried LOW in the isr0() code in my RFM69.cpp local file and the interrupt would not fire, and HIGH is not an option,  so no go there.

I then tried CHANGE in my RFM69.cpp local file and the loss rate went back to 1 in 300.

I then tried sendWithRetry(with 1 retry), which caused the transmission time to go up to 140mS, a lot more than the 80mS for the duplicate send() code, and the loss rate at 1 in 300 messages.

I then went back to the duplicate send() code and got a loss rate of 1 in 3000 messages on several tests, and the shorter TX time ... just so much better!

I also temporarily disabled my application RTC (real time clock calendar) which I assume is updating year, month, day, hour, minutes, seconds registers at high priority.
I thought this might be the culprit ... but with no RTC active, the loss rate with a single send() was around 1 in 100/200 .. so not guilty.

My understanding is the Arduino millis() system timer interrupt is happening at 1mS intervals at a higher priority than the application RFM69 isr0() execution.
The observed osciiloscope timing jitter in the PLUGPACK (receiver) code, at about the same rate as the loss rate, to me, points a finger that something in the system code is active.

I don't have any other coded interrupts active in my application. The huge improvement, and lower battery consumption, using the short 80mS duplicate send() code seems like my best option.
If anybody else can suggest a possibly active Arduino system interrupt as the "other player" in the interrupts collisions,  that would be a big step forward.

Felix/anyone ... if I could understand why 1999 in 2000 duplicate sequential  send() cases ONLY detects ONE of the TWO transmissions 10mS apart, detected by receiveDone() == true, that would be helpful.

See very repeatable OSCILLOSCOPE image below.

SENSOR (sender) code to produce spreadsheet file at PLUGPACK (receiver) attached below:
Code: [Select]
void radio_send_chirp_message() {
  //
  digitalWrite(OSCILLOSCOPE, HIGH);

  radio.send(PLUGPACK, send_buffer, send_length, false);
  radio.send(PLUGPACK, send_buffer, send_length, false);

  state_led_green = LED_GREEN_FLASH;
  led_green_flash_start_time = millis();  // confirm 2x chirp msgs sent with green flash

  digitalWrite(OSCILLOSCOPE, LOW);
}
PLUGPACK (receiver) code:
Code: [Select]
void loop() {

  led_red_flash_service();

  if (radio.receiveDone())
  {
    digitalWrite(OSCILLOSCOPE, HIGH);
   
    Serial.print(millis()); Serial.print(" ");
    Serial.print((char*)radio.DATA); Serial.print(" ");
    rx_count++;
    Serial.print(rx_count);
    Serial.print(" ");
    Serial.print("RSSI= "); Serial.print(radio.RSSI);
    Serial.println();
    state_led_red = LED_RED_FLASH;
    led_red_flash_start_time = millis();

    digitalWrite(OSCILLOSCOPE, LOW);
  }
}


« Last Edit: January 31, 2019, 10:00:59 PM by ianraustralia »

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6866
  • Country: us
    • LowPowerLab
ian,
I think you have a solution that is "reasonable". Each implementation will differ and since we're dealing with wireless it is actually an expectation that things can and will go wrong and you have to have retry mechanisms that get you around those cases.
The sendWithRetry() is obviously slower than just using multiple consecutive send()'s. But it's what others use in applications where the retry time is tolerable.
It sounds like you're at a point where you have a workable solution.