DDR3 Corruption on Wandboard

Discussion of your EDM baseboard, your add-on boards or other peripherals for your wandboard.

DDR3 Corruption on Wandboard

Postby tdefeo23 » Mon Jul 07, 2014 5:13 pm

Has anyone else experienced DDR3 memory corruption when using the Wandboard?

I have confirmed that DDR3 memory is getting randomly corrupted by 1 bit. This happens very infrequently, and may or may not cause a crash depending on what the corrupted memory is used for.
I have been able to reproduce this on at least 5 wandboard EDM modules. Some modules are worse then others.

Furthermore, I have been attempting to run the freescale DDR stress test tool,
https://community.freescale.com/docs/DOC-96412
and I can get it to fail on all of the boards at 475Mhz.

I have been attempting to use this tool to generate better DDR3 timing numbers to plug into UBoot and hopefully fix the problem, but the tool always
generates an unusable value for MMDC_MPWLDECTRL1 after doing write level calibration:
MMDC_MPWLDECTRL1 ch1 after write level cal: 0x017A0013

I think this may be indicative of a problem with the EDM module layout.

Does anyone have any insight into this, or has anyone else seen this behavior?

FYI, We have been running both Android 4.2.2, and Android 4.3, and both have the same problem.

Thanks,
Tony
tdefeo23
 
Posts: 13
Joined: Mon Jan 20, 2014 10:34 pm

Re: DDR3 Corruption on Wandboard

Postby tdefeo23 » Mon Jul 07, 2014 5:20 pm

Forgot to mention this is a Wandboard Quad with 2GB DDR3.
I don't know if the other wandboard configurations have this problem or not.
tdefeo23
 
Posts: 13
Joined: Mon Jan 20, 2014 10:34 pm

Re: DDR3 Corruption on Wandboard

Postby Tapani » Tue Jul 08, 2014 3:49 am

Which WB revision are you using? (Which WIFI chip? Do you have a white SATA power connectore next to the SATA connector?)

We did not see problems when we ran the mentioned Freescale tool, but admittedly, we only used the 528--600MHz band (and not 475MHz).

There is a bug in the iMX6 CPU that can cause 1-bit errors when DVFS switches frequency (it can change memory clock as well), which might explain runtime errors. In the 3.0.35-4.1.0 kernel there are patches to remendy this (which, admittedly won't help you on Android).
Write level calibration is not applicable on the WB. It is only when using fly-by layout of memory.

Regardless, we take this report seriously.
Tapani
 
Posts: 712
Joined: Tue Aug 27, 2013 8:32 am

Re: DDR3 Corruption on Wandboard

Postby tdefeo23 » Tue Jul 08, 2014 9:05 pm

Thanks for taking this seriously!

Some more information:
When running the stress test, I was using the MMDC numbers from the UBOOT in android 4.2.2
I realized that the numbers in UBOOT for android 4.3 were different, so I switched to them and the stress test behaves better, but I still had a failure (this time at 396MHz).

I have backported the MMDC numbers to android 4.2, and indeed the boards behave much better, however, I STILL get an occasional random kernel crash.
FYI, I have disabled all but the performance CPU frequency scaling governers, so the CPU should not be changing frequency.

The wandboard carrier I am using DOES have the white SATA power connecter. I have seen the issue on multiple Wandboard modules, including several REV B1's and Rev C1's, and also a Tech-Nexion module.
Some modules behave better then others.

Can you post the DDR script that you are using when you run the stress test so I can compare the numbers?

Thanks,
Tony
tdefeo23
 
Posts: 13
Joined: Mon Jan 20, 2014 10:34 pm

Re: DDR3 Corruption on Wandboard

Postby Richard » Thu Jul 10, 2014 9:59 am

Hello, tdefeo23:


The attachment is the DDR script for wandboard i.mx6Q (with 2GB memory).
https://dl.dropboxusercontent.com/u/11860830/EDM_MX6Q_2G_AFR.inc
The settings is used in Android-4.3-wandboard-20140516 release.


Issue the command to execute freescale DDR Stress Tester:
DDR_Stress_Tester.exe -t mx6x -df EDM_MX6Q_2G_AFR.inc


Could I ask which CPU module(i.mx6Q or i.mx6DL, or i.mx6Solo) you try has this problem?

Richard
Richard
Site Admin
 
Posts: 138
Joined: Tue Dec 17, 2013 6:57 am

Re: DDR3 Corruption on Wandboard

Postby tdefeo23 » Thu Jul 10, 2014 11:33 pm

Hi Richard,

Thanks for the update.

So far three of the boards have passed the DDR_Stress_Tester using the new settings, over a frequency range of 475Mhz to 580Mhz.

I have just done a clean build of android 4.3 from the 20140625 source tarball.
I will run our app overnight on the boards that have passed the memory test, and see if we have any issues.

To answer your question, all of the boards are i.mx6Q 2GB.

I'll follow up with the results of the overnight tests tomorrow.

Thanks,
Tony
tdefeo23
 
Posts: 13
Joined: Mon Jan 20, 2014 10:34 pm

Re: DDR3 Corruption on Wandboard

Postby tdefeo23 » Fri Jul 11, 2014 12:03 am

One of the boards has already crashed running our app with new build of Android 4.3 from the 20140625 source tarball with a media server heap corruption in dlfree error:

pid: 1449, tid: 4527, name: OMXCallbackDisp >>> /system/bin/mediaserver <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr deadbaad
Abort message: '@@@ ABORTING: heap corruption detected by dlfree'

Note that this board has passed the ddr memory stress test.

I'll reboot and try again to see if it fails in the same manner or not.

FYI, I should mention that our app is a video game that is making extensive use of openGL, audio, and decoding fullscreen 1080p videos. In other words, it is exercizing the board pretty heavily.
tdefeo23
 
Posts: 13
Joined: Mon Jan 20, 2014 10:34 pm

Re: DDR3 Corruption on Wandboard

Postby tdefeo23 » Fri Jul 11, 2014 4:56 pm

After upgrading to the 4.3 Android build, I let three games run overnight (two Wandboard quads, and one TechNexion quad).

As mentioned in my earlier post, one of the wandboards failed after about an hour with a libc heap corruption error in the mediaserver.

The TechNexion board also failed after about 4 hours with a SIGSEGV in the dalvic VM process running the game:

F/libc ( 2211): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1), thread 2232
I/DEBUG ( 1434): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *
I/DEBUG ( 1434): Build fingerprint: 'Freescale/wandboard/wandboard:4.3/1.1.0-rc4/20131206:userdebug/dev-keys
I/DEBUG ( 1434): Revision: '405522'
I/DEBUG ( 1434): pid: 2211, tid: 2232, name: Thread-65 >>> com.itsgames.coino
I/DEBUG ( 1434): signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0000000

So this is now two different boards failing in different ways. This still feels like a memory corruption issue to me.

Any help on this would be appreciated! We are developing a commercial product based on the EDM standard, but due to these ongoing
instability issues, we have to start looking at other alternatives (which I would rather not do!).

Thanks,
Tony
tdefeo23
 
Posts: 13
Joined: Mon Jan 20, 2014 10:34 pm

Re: DDR3 Corruption on Wandboard

Postby Tapani » Tue Jul 15, 2014 2:13 am

Tony,

before jumping to conclusions, there are many other items that can go wrong.
We are fairly convinced that the hardware is ok.

Software... is mixture of bits and pieces from Freescale, Google, Vivante, and who knows where.
Also errors like that can have any number of explanations other than corrupted memory. Heat?
Tapani
 
Posts: 712
Joined: Tue Aug 27, 2013 8:32 am

Re: DDR3 Corruption on Wandboard

Postby tdefeo23 » Tue Jul 15, 2014 11:43 pm

Hi Tapani,

I completely understand, I did not intend to jump to any conclusions!
I understand that there are any number of causes for the kernel crash I posted.
My point was that the system was crashing in multiple places (app crash, kernel oops, etc) with no discernible pattern.

That being said, I have now figured out a way to consistently reproduce the problem, and you're right, it is just as likely a software or thermal problem as a hardware problem.

The key to consistently reproducing the problem was to allow the Cpu to run at 1.2Ghz
(either by blowing the SPEED_GRADING fuse: "imxotp blow --force 4 0x2b0302", or by hacking the kernel in cpu_op-mx6.c

When running at 1.2Ghz, I can reproduce the failure on every board I have tried (3 wandboards, and 2 technexion boards so far).

The easiest way to get it to fail is to go to the android settings->Data usage screen, and drag the data usage sliders around.

Eventually the system will crash with a kernel oops, or , sometimes the settings app will crash, sometimes even zygote will restart. I can also reproduce this behavior from our proprietary app.

Note that at 1.2Ghz, all the boards I have tried will crash pretty quickly, at 996Mhz, some will crash quickly, while others will run for days without issue, and at 792Mhz NONE of the boards have crashed.

Any ideas or help would be appreciated. I realize it is just as likely that this is a Freescale or android problem, and not a Wandboard specific problem, but it is obviously in all of our best interest to at least understand what the problem is (hardware, software, thermal), and hopefully come up with a solution or workaround.

Thanks!

Tony
tdefeo23
 
Posts: 13
Joined: Mon Jan 20, 2014 10:34 pm

Next

Return to Hardware and peripherals

Who is online

Users browsing this forum: No registered users and 15 guests

cron