Author Topic: Whole network stops transmitting.  (Read 3379 times)

nerdkingdan

  • NewMember
  • *
  • Posts: 18
Whole network stops transmitting.
« on: August 20, 2019, 03:24:46 PM »
We have about 50 moteinos in our building running a variety of projects.   We have about 9 switchmotes, that control the power to the different projects.

Sometimes when we switch on the relays in the switch motes, to turn on the other moteinos, something happens that causes all transmissions to stop.   When we power cycle the whole building, the problem goes away for a random amount of time, sometimes a few days, sometimes a month.   Once the whole network is up, the whole thing works perfectly, when we shut down at the end of the day, the only moteinos still on are 4 in battery powered hand held devices, and the 9 switch motes.   When we power on the next day, most of the time everything comes up and works fine, but sometimes all transmission lock, until we power cycle the building by killing main power.

I ran code for a month that logged the transmissions from all moteinos,  when it locks up, there was no pattern as to what the last transmission was, and there are no active transmissions.

When transmissions stop, our serial monitors show we are attempting transmissions, but not sending them, my guess is a moteino is stuck on transmission, so none of the other moteinos actually attempt to transmit, is that possible?   Could this be a failing moteino?   An outdated library?  We install in 2016, or something else?

We have 8 serial communicating with Raspberry Pis, I2C communicating with Arduinos, and some just handling relays and or reed sensors.   



Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6866
  • Country: us
    • LowPowerLab
Re: Whole network stops transmitting.
« Reply #1 on: August 20, 2019, 05:09:33 PM »
Has this always happened or just started happening recently?
Is the code running on all of them from back in 2016?
If so, an upgrade to the latest RFM69 code might be the solution.
The SwitchMotes can be OTA updated without the need to remove them from the walls.

Before getting there it could also be a worthwhile effort to try to pinpoint the problem or get to as close a conclusion as possible.
You seem to indicate that it looks like one of the nodes is stuck in transmission mode, that condition fits the description of the behavior.

Do you have any spectrum analyzer like a cheap SDR kit to look at the spectrum when the network exhibits the locked behavior?
That would help see if a node is stuck in TX.

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Whole network stops transmitting.
« Reply #2 on: August 20, 2019, 06:22:39 PM »
Do you have any spectrum analyzer like a cheap SDR kit to look at the spectrum when the network exhibits the locked behavior?
That would help see if a node is stuck in TX.
I second this suggestion.  I've seen RFM69 radios, under the right conditions (AKA probably a bug in the code) go rogue and go to continuous TX mode, essentially jamming the network.  A restart of the faulty node is the only solution in this case.

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6866
  • Country: us
    • LowPowerLab
Re: Whole network stops transmitting.
« Reply #3 on: August 21, 2019, 09:18:06 AM »
I've seen RFM69 radios, under the right conditions (AKA probably a bug in the code) go rogue and go to continuous TX mode, essentially jamming the network.  A restart of the faulty node is the only solution in this case.
It would be interesting to pinpoint which one that is - ie if it's the gateway - pinging it via serial to get a response would reveal if it's stuck (no response) or not, a relatively easy one to check - would require re-programming if there is no serial ping/echo mechanism in your gateway sketch - that would also imply using your local RFM69 library (from 2016?) to "keep the bug".
Reprogramming with the latest version could actually change behavior, if indeed the gateway is doing this.
But I think there's a good chance any node can enter this condition and block the whole network.
In such cases isolating the problem is the only way to find it.

nerdkingdan

  • NewMember
  • *
  • Posts: 18
Re: Whole network stops transmitting.
« Reply #4 on: August 21, 2019, 05:18:42 PM »
We have it narrowed down to 2 Moteinos, now, We will be able to fully look into it later this week.

My theory now that we have it down to those two is an I2C time out.   I'll know more when we know which one of the 2 it is.  One is a switchMote, and the other is I2C connected to run an arduino Mega.

TomWS

  • Hero Member
  • *****
  • Posts: 1930
Re: Whole network stops transmitting.
« Reply #5 on: August 21, 2019, 09:45:39 PM »
It would be interesting to pinpoint which one that is - ie if it's the gateway - pinging it via serial to get a response would reveal if it's stuck (no response) or not, a relatively easy one to check - would require re-programming if there is no serial ping/echo mechanism in your gateway sketch - that would also imply using your local RFM69 library (from 2016?) to "keep the bug".
Reprogramming with the latest version could actually change behavior, if indeed the gateway is doing this.
But I think there's a good chance any node can enter this condition and block the whole network.
In such cases isolating the problem is the only way to find it.
In all the cases I've seen, it's a node that takes out the network.  It's one of the reasons I've added HW Watchdog (TPL5010) to all of my Motes.  I've never been able to capture an event that would directly cause the hang (because it's so rare and random), but, after adding HW watchdog, I've never had the problem for more than the WD timeout period.  I've been obviously modifying my code over the years so I have no idea if the bug persists, but, if is does, it doesn't cause a network shutdown anymore. 

For those who have a penchant for chasing these kinds of things, I do think that the problem occurs when there is a transmission collision, but I don't recall the specific information that leads me to believe that...

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6866
  • Country: us
    • LowPowerLab
Re: Whole network stops transmitting.
« Reply #6 on: August 22, 2019, 10:12:31 AM »
We have it narrowed down to 2 Moteinos, now.
Interesting, how did you determine them?
When you mention I2C, can you give more detail (if possible, relevant) about that part?

mattm00700

  • NewMember
  • *
  • Posts: 11
  • Country: us
Re: Whole network stops transmitting.
« Reply #7 on: September 09, 2020, 08:41:04 PM »
I have also seen this and can trigger it reliably.  Feed a Moteino just over its brownout voltage.  When it starts to draw heavy TX current & the Atmel goes into reset, the RFM69 lets out a continuous howl centered on the TX frequency but across a much wider swath of bandwidth and in sharp peaks. FCC part whichever blown out of the water I'm sure.  It goes well out of the ISM band if transmitting near the edges.  For me this has wrecked all traffic on the network until that node goes dead, which doesn't take too long on battery with continuous TX.  The best solution may be a pull (down/up?) resistor on radio's reset unless Atmel is competent, but a software workaround seems to help a lot.  I point the ADC at the 1.1V internal source.  If that reading shifts much from its initial value, you know the rail voltage is dropping under regulated.  If you check and don't start a TX under those conditions, it seems to behave.  Curious about what changed from 2016 library that might cover this.  If it is indeed caused at least sometimes by the Atmel resetting during TX or SPI transaction with radio, it seems like the only thing that can be done is see the condition looming and refuse to TX.

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6866
  • Country: us
    • LowPowerLab
Re: Whole network stops transmitting.
« Reply #8 on: September 09, 2020, 09:31:14 PM »
The MCU (call it Atmel if you like) will brown out much faster (on a Moteino at 2.7V) than the radio module. The Moteino-8mhz is at 1.8v but in both cases the MCU is seriously overclocked and should not be run at that voltage. So the case you describe is ... not something you would want to do for real.
The MCU will stop when the TX starts. So the radio will be left in TX mode while the MCU might be in brownout/unstable.
In which case I think you need a better power supply. The library can't help you at that point.

mattm00700

  • NewMember
  • *
  • Posts: 11
  • Country: us
Re: Whole network stops transmitting.
« Reply #9 on: September 09, 2020, 11:14:09 PM »
Thanks Felix.  I get that its not something you want to do for real, but if you are running on a battery and don't change it before it falls through that voltage range, you're libel to have a radio siren until the battery drops below 1.8V or wherever the radio dies unless you check in code before TX.  I can't be the only person that has let their Moteino battery die.  It has happened many times with Moteinos in the field running on 6V packs with 4 AAs.  I loose all the data on the network until the battery in that node dies many hours later.  I guess a "better power supply" is a battery with low current power supervisor that supplies either plenty of voltage or none at all?  I doubt that's what is attached to many Moteinos Vin.  I'm also not saying it happens every time a battery dies, but often enough to need to fly to the field site with a SDR to figure out why our networks would drop out completely and then come back just fine many hours later.  At least for me, the problem is solved by checking for the failing battery using the ADC 1.1V reference reading shift and not starting a TX under shady voltage supply.  Seems like a fix that might benefit anybody running on a battery if they can spare the time for the ADC read.  The options I see are make sure you never supply voltage in the danger range- like let the battery die, implement a software or hardware solution, or not care about potentially splattering out radio power until the battery dies completely.  Always wanted to meet the local FCC person, surprise visits a bonus.

Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6866
  • Country: us
    • LowPowerLab
Re: Whole network stops transmitting.
« Reply #10 on: September 10, 2020, 12:55:25 PM »
4AAAs will be discharged around 1v each, that's still 4V. That's still way above the 2.7v brownout for 16mhz Moteinos and 1.8v for 8mhz moteinos. And you dont need a supervisor or anything fancy to read the battery voltage under load (take a reading when it makes sense, ex after a transmission).

I have not observed this problem though, I have many Moteinos that are battery powered, when they die, they die, everything else works.
There are many many thousands of Moteinos and custom boards I made over the years running at 16/8mhz with the RFM69, on all kinds of batteries, that die and dont take hundreds of other sibling nodes with them, they dont cause the problem described here. So not sure how to prevent this from happening. Even if you put a nano current watchdog, it will reset your device, and when it wakes it cycles into brownout again.
Since you ask ... how do you expect the library to fix the pseudo problem you describe after a mote is brain dead?

mattm00700

  • NewMember
  • *
  • Posts: 11
  • Country: us
Re: Whole network stops transmitting.
« Reply #11 on: September 10, 2020, 02:14:19 PM »
I think you stopped too soon, keep discharging those batteries you will find they don't have a magic drop to 0V after they pass 1V, putting you in brownout.  I can also make it happen reliably on a nice bench power supply set to the right voltage on every connection.  It seems like quite a claim that no discharging battery will hit that voltage.

How much you see of it may be related to how much transmitting you do, I'm sending a lot of data, not sure.  Its probably also related to how much looking you do and how.  This was wrecking things for quite a while before the SDR revealed what was actually happening.  If you only have Moteinos to work with its a lot harder to figure out what is going on, just no packets getting through. 

I don't think this is a great feature, and at least for me I'm sure how to prevent it: checking for rail voltage sag in code before starting a TX.

I don't expect there is a library solution to this problem, i think its a resistor and a pin connection to RFM69 reset.  I was asking about the library because of your response to nerdkingdan who is/was experiencing the "stuck in TX" thing crashing the network, which implied that some bug was fixed in the 2016 library that might lead to a similar condition.  For me at least it looks like there is a possible library solution if I can fix it with the ADC read, but sure, if it ain't broke don't fix it.  It was broke enough for me.  If you would like to observe the problem, its not too hard to drum up. TX some full packets rapidly, plug into source just over brownout, presto.

It seems like weak sauce to say "just never provide it with a voltage between brownout and 1.8V" on a device that most users run on a unregulated batteries.  I get the odds are low, but this is a real thing, and the result is an illegal transmitter.  I'd prefer to have the RFMs reset held by a resistor if the voltage is in a range that has it brain dead but mouth wide open, otherwise you're just playing the odds as far as I can tell.


Felix

  • Administrator
  • Hero Member
  • *****
  • Posts: 6866
  • Country: us
    • LowPowerLab
Re: Whole network stops transmitting.
« Reply #12 on: September 10, 2020, 02:59:34 PM »
Fair enough, if a resistor solves this, it's a good solution. I will have to look into it and try reproduce and draw conclusions from there.
Meanwhile its easy enough to wire up on a Moteino, if you can solder a pin to a resistor to the RFM RST. Did you already try that?

It seems like quite a claim that no discharging battery will hit that voltage.

...

It seems like weak sauce to say "just never provide it with a voltage between brownout and 1.8V" on a device that most users run on a unregulated batteries.
I didn't quite make such claims. Of course you can deplete a battery. I just said, that its a good idea to change batteries based on a discharge curve, if you are watching the voltage at a reasonable point in the wake cycle. I can show you graphs of how certain types of batteries discharge and its very predictable. And I also said I didn't see this happen. Or my home network for instance would go down every time or most times a mote is discharged, I have some that discharge often because they are mostly awake. And we'd probably see dozens of these threads about being stuck in this forum. So ... not denying anything and for now we have 1 thread and I am interested to find out and observe more and fix whatever will make things better. Please try the resistor method in the meanwhile and let me know your conclusions.


mattm00700

  • NewMember
  • *
  • Posts: 11
  • Country: us
Re: Whole network stops transmitting.
« Reply #13 on: September 11, 2020, 03:37:19 AM »
Thanks again Felix.  I think several things may need to line up to see this happen, but it's easy enough to trigger intentionally and was happening enough to cause major headaches for me.  If the battery is small enough, maybe it just won't provide enough current when discharged this far to actually tx, so it doesn't happen.  AA batteries seem work just fine to cause it.  The other details are using stock 300kbps radio settings at power 20, 915MHz HCW radio, sending(not with retry) packets of 60 bytes payload, 12ms delay between packets.

For the major headache, there are hundreds of these guys potted in urethane sitting in a field that takes a passport to get to, so its not exactly a quick trip down to the sewer sump pump to change the battery if one happens to die.  I do have remote monitoring of battery but not someone who could always catch it in time, and no big deal anyway i thought, they just go dead and you lose one damaged unit, right?  The AA pack is good for a year or more in this ap if everything works as it should, I only need a couple months, but that only happens in theory.  This type of thing is very predictable a wise man once told me with the help of charts and graphs.  A rodent or field worker will damage something, rain gets in, pack drains faster than expected, 200 units in the dark, plane ticket purchased, its Murphy's law.

For anything I do in the future, I will either have control of the power to the radio (best?), or at a minimum a pullup resistor on the radio reset so that the MCU has to assert a pin for radio to be operational.  My understanding is that if a brownout reset happens the pin goes high impedance, pull-up should hold radio quiet, but the pullup will always be consuming power, so a switch on radio power may be best.

I assumed Motino would die quietly, not always so.  If in TX when MCU browns out, it can stink up many MHz of the band with plenty of power.  If you transmit infrequently or short packets it may be impossible to trigger, dont know, but when it happens it happens hard.  The lesson for me is having the radio (that can break the law) fully operational when its controller can go out to lunch in the middle of the transaction is probably a poor choice.  Checking for the lack of regulated voltage before transmitting seems to solve the issue for me, but an auto-reset or RFM power down in hardware would be way more solid, I just don't have the option anymore without replacing a bunch of stuff.  Its slightly cumbersome to check your shoes to make sure you have enough leather each time you plan to cross the street, but it fixes the problem.  Maybe there is a better way in code, this is what I call before TX:

void antiDeathRattle(){

  ADMUX = B01011110;  //  Measure 1.1V internal source
  delay(3); //delay for 3 milliseconds
  ADCSRA |= _BV(ADSC); // Start ADC conversion
  while (bit_is_set(ADCSRA, ADSC)); //wait til complete
  uint16_t batLevel = ADCL; //get first half
  batLevel |= ADCH << 8; //get rest (always read low first)
  while(batLevel > CUTOFF_BATLEVEL); // hang here, DO NOT TX
}

CUTOFF_BATLEVEL of 370 is what I'm using, anything decently over brownout to cover the TX drop should do.  Hopefully this helps somebody.  The hard part, at least for me, is realizing this is what was happening.  If it happened at lot, it would have been discovered in 2 years of development.  When the numbers went up, it happened plenty enough.  If it can happen at all and the result is accidentally putting power continuously on someone else's commercial band, probably a pseudo problem best avoided.  I could be comparing notes about how radio authorities in 2 countries ring you up, explaining how its not actually hosing the band for hours, it just seems like it is.  Would be nice to know if there are other things that might trigger this condition.  Sounds like Tom and Dan have some idea what I'm describing here.  Maybe anything that locks the Atmel (call it MCU if you like) during TX could cause this? Arduinos I2C libraries are good for hangs, so that maybe checks out, but the regular watchdog will catch that as far as I have seen, possibly hanging the RFM.  If that happens RFM is likely unrecoverable without reset/cycle and wailing the whole time.  In brownout you have no control, so it's see it coming or hardware failsafe the radio.  Don't think you can talk to the radio without a reset/power cycle after it starts wailing, so on something permanently powered if it can be triggered, then you might just need to cycle the power to the whole building every few days to a month, that's normal.  You know how the 'fruit folks breakout and use the RFM reset? A waste of a pin and precious time I thought, well no more.  Oddly that board won't init if you don't tickle the reset, dunno.  I think most or all is solved if RFM is hard reset when MCU resets either from brownout or watchdog.  Not having full authority over the radio may work OK if you send a byte if the mail comes today, still not good, but if every I2C hang requires power cycle to restore the network, ugh.  As long as you don't let your battery get too low, or never watchdog reset, maybe never see it.  If you live in the real world it happens at least to some of us, can't be recovered without a cludge wire to reset or power cycle, and I don't think what I see on the radio spectrum could be anything close to legal.  Can't miss it, its impressive and will swamp at least 2MHz of the band either side, drowning traffic until something intervenes.  For me anticipating brownouts in code works, maybe for somebody else an I2C library with timeouts would prevent resets during TX, but relying on software for this is a bandage.  With all the grey hairs this has given me, I would happily lose a pin to gain a radio enable that must be asserted by a happy MCU, but my hardware is locked in at least for a while.  If yours isn't, my advice is to have hardware that will keep the radio quiet if the MCU is sad.  Otherwise, it will probably work, probably be legal if you tell it to be, and probably not wreck the whole network from time to time.  Since I will definitely need work in the future to keep living indoors, I'm definitely never counting on this particular probably again if I can help it.



                             

Neko

  • NewMember
  • *
  • Posts: 37
  • Country: us
Re: Whole network stops transmitting.
« Reply #14 on: September 11, 2020, 11:48:00 AM »
This thread is certainly interesting. I've got 1000 nodes in the field on AA batteries currently, so the possibility of one node taking down a whole network and calling in an airstrike from the Feds is not a comforting one.

On the strength of this thread, I am doing a quick test. I have 30 nodes on fresh batteries churning away on the bench, sending 15-byte packets every 6 seconds. Within each 6-second cycle, the 30 nodes are transmitting in 100-ms "slots" relative to one another.

I then put two nodes on a power supply and have been repeatedly ratcheting down the voltage in 5-mV steps until failure. Here "failure" means that their signals fail to show up on the Gateway. The two nodes drop out at the same point every time (1.84V and 1.83V, respectively) without any effect on the remaining nodes.

I realize that this is not as strenuous a test as matt's. Fewer bytes, fewer nodes, fewer cycles of testing, a power supply rather than a battery. But so far I have not seen a problem. I will continue testing, but it seems that the problem is not easy to reproduce. I also have not had any indications from users that this is happening in any of my 50 installations. However, the batteries in the field have not yet fallen toward brownout levels (they are lasting for several years), and I do have TPL5010's on all my nodes. (The TPL is not being activated in my current tests.)

Still, on matt's recommendation, no reason not to check battery voltage and shut down transmission if it falls below 2V or so. An easy thing to add in future systems.
« Last Edit: September 11, 2020, 11:53:18 AM by Neko »