Recently I was able to fix the "stuck in the bootloader" issue in #322 that had been plaguing us for quite some time. I want to go over what the issue was, how the issue was diagnosed, and how it was fixed.
What exactly is the "stuck in the bootloader" issue? Seemingly randomly, users' keyboards would suddenly stop working and when they would reset their keyboard they would get put into the bootloader instead of back into the firmware. This would require the user to re-flash the firmware again to get into the firmware. That wouldn't be so bad except for the fact that once this occurs, every reset would require the user to re-flash the firmware again. The only way to really fix this issue was to re-flash the bootloader itself, which is a huge pain.
Going into this, all we knew was that this issue was most likely introduced somewhere in the #133, which added Bluetooth profile management. We've had quite a few attempts at trying to recreate the issue, but we never were able to get it to happen consistently.
Diagnosing the issue
This issue had been happening sporadically for the past month, and I finally decided to dig in to see what was going on. We started in the Discord and discussed what was common between all of the people who have experienced this issue. Everyone who had this issue reported that they did quite a bit of profile switching. This lined up with the possible connection to the Bluetooth profile management pull request.
Pinpointing the cause
I had a hunch that this was related to the settings system. The settings system is used by profile Bluetooth switching, and the settings system works directly with the system flash. Based on this hunch, I tried spamming the RGB underglow cycle behavior on my main keyboard. Sure enough after a couple minutes, I got stuck in the bootloader. I was even able to reproduce it again.
This was an important discovery for two reasons. First, I was able to recreate the issue consistently, which meant I could set up logging and more closely monitor what the board was doing. Second, this more or less proved that it was specifically the settings system at fault. Both Bluetooth profile switching and RGB underglow cycling trigger it, and the one common piece is they save their state to settings.
Settings system overview
To understand what's going wrong, we first need to understand how the settings system works. Here's a diagram to explain the flash space that the settings system holds for our nRF52840 based boards (nice!nano, nRFMicro, BlueMicro).
The settings flash space lives at the end of the flash of the chip. In this case
it starts at
0xF8000 and is
0x8000 bytes long, which is 32KB in more
comprehensible units. Then due to the chip's architecture, this flash space is
broken into pages, which are
0x1000 bytes in size (4KB).
The backend that carries out the settings save and read operation in ZMK is called NVS. NVS calls these pages sectors. Due to how flash works, you can't write to the same bytes multiple times without erasing them first, and to erase bytes, you need to erase the entire sector of flash. This means when NVS writes to the settings flash if there's no erased space available for the new value, it will need to erase a sector.
So first I enabled logging of the NVS module by adding
CONFIG_NVS_LOG_LEVEL_DBG=y to my
.conf file. I repeated the same test of
spamming RGB underglow effect cycle and the resulting logs I got were this:
So at start up, we can see that the 8 sectors of 4KB are found by NVS properly,
however, I wasn't sure what the second and third lines meant, but we'll get back
to that. Nonetheless the next two logs from NVS showed erasing the sector at
0xFD000 and then erasing the
It's really odd that the third to last sector and the last sector are erased, and then shortly after the bootloader issue is hit. I really had no explanation for this behavior.
Reaching out to Zephyr
At this point, I nor anyone else working on the ZMK project knew enough about
NVS to explain what was going on here. Pete
Johanson, project founder, reached out on the
Zephyr Project's Slack (ZMK is built on top of Zephyr if you weren't aware).
Justin B and Laczen assisted by first explaining that those
alloc wra and
data wra logs from earlier are showing what data NVS found at startup.
data wra should be
0 when it first starts up on a clean
flash. As we can see from my earlier logging on a clean flash I was instead
f40. NVS is finding data in our settings sectors when they should be
blank! We were then given the advice to double check our bootloader.
The Adafruit nRF52 Bootloader
Most of the boards the contributors of ZMK use have the Adafruit nRF52
Bootloader, which allows
for extremely easy flashing by dragging and dropping
.uf2 files onto the board
as a USB drive. Every bootloader takes up a portion of the flash, and in the
README explains that the first
0x26000 is reserved for the bootloader with the
nRF52840, and we've properly allocated that.
However, there isn't a full explanation of the flash allocation of the bootloader in the README. There's a possibility that the bootloader is using part of the same flash area we're using. I reached out on the Adafruit Discord, and Dan Halbert pointed me towards the linker map of the nRF52840. Let's take a look.
Here's a diagram to show this a bit better.
We've found the issue! As you can see from the red bar (representing our settings flash area), we've put the settings flash area right on top of the Adafruit bootloader's flash space. Oops!
This also shines some light on why NVS erased
It's possible there was no flash written to
0xFD000 because the bootloader
didn't use up all of that space it has, and then there possibly weren't any
bootloader settings set yet, so
0xFF000 could be used and erased by NVS too.
0xFF000, NVS probably next erased a rather important part of the
bootloader that resulted in this issue at hand. In my opinion, we're pretty
lucky that it didn't delete an even more vital part of the bootloader. At least
we could still get to it, so that we could re-flash the bootloader easily!
Now that we've found the issue, we can pretty easily fix this. We'll need to move the settings flash area back so that it doesn't overlap with the bootloader. First we calculate the size of the of flash area the bootloader is using.
So the bootloader is using the last 48KB of the flash, this means all we need to
do is shift back the settings area and code space
0xC000 bytes. We'll apply
this to all of the
.dts files for the boards that were affected by this issue.
And with those changes, we should no longer run into this issue! In the process of these changes, we lost 48KB of space for application code, but we're only using around 20% of it anyways. 🎉