MCA details seldom change inbetween the models of a family so don't
be too conservative and enable decoding on everything starting from
K8 onwards. Minor adjustments can come in later but most importantly,
we have some decoding infrastructure in place for upcoming models by
default.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This is just an aesthetic change but it was silly to say TILEPro
when booting up on the tilegx architecture.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
What it is pointed by a csrow/channel vector is a rank information, and
not a channel information.
On a traditional architecture, the memory controller directly access the
memory ranks, via chip select rows. Different ranks at the same DIMM is
selected via different chip select rows. So, typically, one
csrow/channel pair means one different DIMM.
On FB-DIMMs, there's a microcontroller chip at the DIMM, called Advanced
Memory Buffer (AMB) that serves as the interface between the memory
controller and the memory chips.
The AMB selection is via the DIMM slot, and not via a csrow.
It is up to the AMB to talk with the csrows of the DRAM chips.
So, the FB-DIMM memory controllers see the DIMM slot, and not the DIMM
rank. RAMBUS is similar.
Newer memory controllers, like the ones found on Intel Sandy Bridge and
Nehalem, even working with normal DDR3 DIMM's, don't use the usual
channel A/channel B interleaving schema to provide 128 bits data access.
Instead, they have more channels (3 or 4 channels), and they can use
several interleaving schemas. Such memory controllers see the DIMMs
directly on their registers, instead of the ranks, which is better for
the driver, as its main usageis to point to a broken DIMM stick (the
Field Repleceable Unit), and not to point to a broken DRAM chip.
The drivers that support such such newer memory architecture models
currently need to fake information and to abuse on EDAC structures, as
the subsystem was conceived with the idea that the csrow would always be
visible by the CPU.
To make things a little worse, those drivers don't currently fake
csrows/channels on a consistent way, as the concepts there don't apply
to the memory controllers they're talking with. So, each driver author
interpreted the concepts using a different logic.
In order to fix it, let's rename the data structure that points into a
DIMM rank to "rank_info", in order to be clearer about what's stored
there.
Latter patches will provide a better way to represent the memory
hierarchy for the other types of memory controller.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
When i5400_edac driver is removed and re-loaded a few times, it causes
an OOPS, as it is currently decrementing some PCI device usage two
times.
When called inside a loop, pci_get_device() will call
pci_put_device(). That mangles the error count. In this specific
case, it seems easier to just duplicate the call.
Also fixes the error logic when pci_get_device fails.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
If I only ack the detection register after a error have been detected
I'm unable to reliably detect errors. I have verified this behavior
using both an error injection DIMM and software to inject errors.
I can't find any documentation supporting this behavior in Intel 5100
Memory Controller Hub Chipset, see 1. So this is all based on
experimentation.
[1] Intel® 5100 Memory Controller Hub Chipset
http://www.intel.com/content/dam/doc/datasheet/5100-
memory-controller-hub-chipset-datasheet.pdf
Signed-off-by: Niklas Söderlund <niklas.soderlund@ericsson.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
According to [1] the define for M1Err in the FERR_NF_MEM register is
wrong. It should be at position 1 not 0.
[1] Intel 5100 Memory Controller Hub Chipset Doc.Nr: 318378
http://www.intel.com/content/dam/doc/datasheet/5100-
memory-controller-hub-chipset-datasheet.pdf
Reported-by: Ba Thang Nguyen <thang.b.nguyen@dektech.com.au>
Signed-off-by: Niklas Söderlund <niklas.soderlund@ericsson.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
>From the driver design, the variable limit wants to compare with its
previous value, we should set the value of limit instead of the value
of tmp_mb to the variable prev.
Signed-off-by: Hui Wang <jason77.wang@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
We can identify dram interleave mode from the Dram Rule register
rather than Dram Interleave list register.
In this context, the reg of INTERLEAVE_MODE(reg) contains the Dram
Interleave list register, we can't get interleave mode from the reg,
while the variable interleave_mode saves the the mode got from the
Dram Rule register, so we use the variable to replace
INTERLEAVE_MDDE(reg) here.
Signed-off-by: Hui Wang <jason77.wang@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
This driver needs to access PCIe Extended Configuration Space
Registers (0x100~0xfff), to correctly access those registers, we need
to enable PCI_MMCONFIG option. Since this option is not enabled for
X86_64 by default, we let the driver depend on it to prevent users
forgetting to enable this option.
Signed-off-by: Hui Wang <jason77.wang@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
It seems that nobody is cross-compiling for this arch anymore...
drivers/edac/ppc4xx_edac.c: In function 'ppc4xx_edac_probe':
drivers/edac/ppc4xx_edac.c:188:12: error: storage class specified for parameter 'ppc4xx_edac_remove'
...
drivers/edac/ppc4xx_edac.c:1068:19: error: 'match' undeclared (first use in this function)
drivers/edac/ppc4xx_edac.c:1068:19: note: each undeclared identifier is reported only once for each function it appears in
drivers/edac/ppc4xx_edac.c:1068:36: warning: left-hand operand of comma expression has no effect [-Wunused-value]
Acked-by: Josh Boyer <jwboyer@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
... so that checkpatch can chill out.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Reviewed-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Correct their formulation, replace per-family functions with a single,
unified lookup table.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Reviewed-by: Andreas Herrmann <andreas.herrmann3@amd.com>
This MC1 error signature is called differently now, fix it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Reviewed-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Use "System Read Data Error" as a more general name for MC0 bus errors
on F15h and update some error definitions.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Reviewed-by: Andreas Herrmann <andreas.herrmann3@amd.com>
These const tables are currently marked __devinitdata, but
Documentation/PCI/pci.txt says:
"o The ID table array should be marked __devinitconst; this is done
automatically if the table is declared with DEFINE_PCI_DEVICE_TABLE()."
So use DEFINE_PCI_DEVICE_TABLE(x).
Based on PaX and earlier work by Andi Kleen.
Signed-off-by: Lionel Debroux <lionel_debroux@yahoo.fr>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The original scrub rate API definition states that if scrub rate
accessors are not implemented, a negative value (-1) should be written
to the sysfs file (/sys/devices/system/edac/mc/mc<N>/sdram_scrub_rate,
where N is the memory controller number on the system). This is
counter-intuitive and awkward at the very least because, when setting
the scrub rate, userspace has to write to sysfs and then read it back to
check error status of the operation.
As Tony notes, best it would be to not have the sdram_scrub_rate in
sysfs if scrub rate support is not implemented. It is too late about
that and a bunch of drivers on a bunch of arches would need to be
changed and tested which is not a trivial task ATM.
Instead, settle for the next best thing of returning -ENODEV when
implementation is missing and -EINVAL when there was an error
encountered while setting the scrub rate.
Reported-by: Han Pingtian <phan@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20110916105856.GA13253@hpt.nay.redhat.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
While initializing the array of csrow attribute instances, a few csrows
were uninitialized. This happened because the module only performed a
check for DRAM base ctl register0's and not DRAM base ctl register1's
chip select enable bit. There could be systems with DIMMs populated
on only single memory channel whereas the module also assumed that a
dual channel dimm had double the memory size of a single memory channel
instead of checking the memory on each channel.
This patch fixes these above issues.
Signed-off-by: Ashish Shenoy <ashenoy@riverbed.com>
Signed-off-by: Prasanna S. Panchamukhi <ppanchamukhi@riverbed.com>
Link: http://lkml.kernel.org/r/4F459CFA.5090604@riverbed.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
For files that are actively using linux/device.h, make sure
that they call it out. This will allow us to clean up some
of the implicit uses of linux/device.h within include/*
without introducing build regressions.
Yes, this was created by "cheating" -- i.e. the headers were
cleaned up, and then the fallout was found and fixed, and then
the two commits were reordered. This ensures we don't introduce
build regressions into the git history.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This provides unified readq()/writeq() helper functions for 32-bit
drivers.
For some cases, readq/writeq without atomicity is harmful, and order of
io access has to be specified explicitly. So in this patch, new two
header files which contain non-atomic readq/writeq are added.
- <asm-generic/io-64-nonatomic-lo-hi.h> provides non-atomic readq/
writeq with the order of lower address -> higher address
- <asm-generic/io-64-nonatomic-hi-lo.h> provides non-atomic readq/
writeq with reversed order
This allows us to remove some readq()s that were added drivers when the
default non-atomic ones were removed in commit dbee8a0aff ("x86:
remove 32-bit versions of readq()/writeq()")
The drivers which need readq/writeq but can do with the non-atomic ones
must add the line:
#include <asm-generic/io-64-nonatomic-lo-hi.h> /* or hi-lo.h */
But this will be nop in 64-bit environments, and no other #ifdefs are
required. So I believe that this patch can solve the problem of
1. driver-specific readq/writeq
2. atomicity and order of io access
This patch is tested with building allyesconfig and allmodconfig as
ARCH=x86 and ARCH=i386 on top of tip/master.
Cc: Kashyap Desai <Kashyap.Desai@lsi.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Ravi Anand <ravi.anand@qlogic.com>
Cc: Vikas Chaudhary <vikas.chaudhary@qlogic.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Jason Uhlenkott <juhlenko@akamai.com>
Cc: James Bottomley <James.Bottomley@parallels.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Roland Dreier <roland@purestorage.com>
Cc: James Bottomley <jbottomley@parallels.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
module_param(bool) used to counter-intuitively take an int. In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.
It's time to remove the int/unsigned int option. For this version
it'll simply give a warning, but it'll break next kernel version.
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Several fields in struct cpuinfo_x86 were not defined for the
!SMP case, likely to save space. However, those fields still
have some meaning for UP, and keeping them allows some #ifdef
removal from other files. The additional size of the UP kernel
from this change is not significant enough to worry about
keeping up the distinction:
text data bss dec hex filename
4737168 506459 972040 6215667 5ed7f3 vmlinux.o.before
4737444 506459 972040 6215943 5ed907 vmlinux.o.after
for a difference of 276 bytes for an example UP config.
If someone wants those 276 bytes back badly then it should
be implemented in a cleaner way.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Cc: Steffen Persvold <sp@numascale.com>
Link: http://lkml.kernel.org/r/1324428742-12498-1-git-send-email-kjwinchester@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
No functionality change, this is done so that in a follow-on patch all
queued-up MCEs can be decoded after registering on the chain.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The below patch fixes some typos in various parts of the kernel, as well as fixes some comments.
Please let me know if I missed anything, and I will try to get it changed and resent.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
compatible in dts has been changed, so the driver needs to be updated
accordingly.
Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
The sb_edac driver is marginally useful on a 32-bit kernel, and
currently has 64-bit divide compile errors when building that config.
For now, make this build on only for 64-bit kernels.
Signed-off-by: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The edac driver for Sandy Bridge was found to be reporting "FPM"
for edac_mode, which clearly doesn't make sense. It was found that
sb_edac.c:get_dimm_config was reusing a variable for both mem_type
and edac_type, and thus was overwriting the value after setting
it correctly. This patch fixes that issue.
Before the patch:
/sys/devices/system/edac/mc/mc0/csrow0/edac_mode:FPM
/sys/devices/system/edac/mc/mc0/csrow1/edac_mode:FPM
/sys/devices/system/edac/mc/mc0/csrow2/edac_mode:FPM
/sys/devices/system/edac/mc/mc0/csrow3/edac_mode:FPM
After:
/sys/devices/system/edac/mc/mc0/csrow0/edac_mode:S4ECD4ED
/sys/devices/system/edac/mc/mc0/csrow1/edac_mode:S4ECD4ED
/sys/devices/system/edac/mc/mc0/csrow2/edac_mode:S4ECD4ED
/sys/devices/system/edac/mc/mc0/csrow3/edac_mode:S4ECD4ED
Signed-off-by: Mark A. Grondona <mgrondona@llnl.gov>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Some changes on it were required due to changeset cd90cc84c6bf0, that
changed the glue with the MCE logic.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
This driver is known to work on mine and Tony's test environments,
using software error injection, and a partial hardware/software
error injection tool.
There's no broader range test yet to double check if the error decoding
logic will actually point to the right DIMM, so use it with care.
More tests are required to be sure that the driver will work on all
different types of memory configurations.
If you're willing to risk using it, I suggest you to enable EDAC debugs
for your test machines, as the debug logs helps to track what's going
inside the driver.
Please feed me with bug reports, if you notice that the driver
is miss-behaving.
Tested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
The error cleanup logic was broken. Due to that, one error is generated for
every error polling.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
on i386:
ERROR: "__udivdi3" [drivers/edac/i7core_edac.ko] undefined!\
In both get_sdram_scrub_rate() and set_sdram_scrub_rate()
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Get a more reliable DCLK value from DMI, name the SCRUBINTERVAL mask
and guard against potential overflow in the scrub rate computations.
Signed-off-by: Nils Carlson <nils.carlson@ericsson.com>
Both AMD and Intel i7 EDAC drivers use MCE features and are thus
dependent of this functionality present in the kernel. Express this in
Kconfig so that randconfig builds don't break.
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Nehalem-EX uses a different memory controller. However, as the
memory controller is not visible on some Nehalem/Nehalem-EP, we
need to indirectly probe via a X58 PCI device. The same devices
are found on (some) Nehalem-EX. So, on those machines, the
probe routine needs to return -ENODEV, as the actual Memory
Controller registers won't be detected.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Remove edac_mce pieces and use the normal MCE decoder notifier chain by
retaining the same functionality with considerably less code.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
This file really needs the full module.h header file present, but
was just getting it implicitly before. Fix it up in advance so we
avoid build failures once the cleanup commit is present.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
mce->socketid and cpu_data(mce->cpu).phys_proc_id are the same,
compare with mce_setup (in mce.c):
m->cpu = m->extcpu = smp_processor_id();
...
m->socketid = cpu_data(m->extcpu).phys_proc_id;
This makes it easier for example for XEN patches to hook into
the MCE subsystem.
Compile tested on x86_64.
Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: JBeulich@novell.com
CC: linux-edac@vger.kernel.org
CC: Mauro Carvalho Chehab <mchehab@redhat.com>
Add scrubbing support to i7core_edac, tested on intel Xeon L5638.
Signed-off-by: Samuel Gabrielsson <samuel.gabrielsson@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
As we'll need to use those structs for trace functions, they should
be on a more public place. So, move struct mem_ctl_info & friends
to edac.h.
No functional changes on this patch.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Error injection needs the pci device 0:0. So, we need to revert
this changeset: 79daef2099.
Tests need to be made to be sure that refcount won't be wrong
as noticed before.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>