Compare commits

...

10 Commits

Author SHA1 Message Date
openeuler-ci-bot
506357cddc
!157 Fix cpu isolate errors when some cpus are offline before the service started
From: @slim6882 
Reviewed-by: @Lostwayzxc, @znzjugod 
Signed-off-by: @znzjugod
2024-04-25 12:10:27 +00:00
Shengwei Luo
12b4bb0fd0 Fix cpu isolate errors when some cpus are offline before the service started
Signed-off-by: slim6882 <yangjunshuo@huawei.com>
2024-04-25 17:09:08 +08:00
openeuler-ci-bot
31c5bc676d
!148 rasdaemon: Fix for vendor errors are not recorded in the SQLite database if some cpus are offline
From: @xia-bing1 
Reviewed-by: @hunan4222, @znzjugod 
Signed-off-by: @znzjugod
2024-04-25 06:58:17 +00:00
Bing Xia
4cdf0a2c6b rasdaemon: Fix for vendor errors are not recorded in the SQLite database if some cpus are offline
Fix for vendor errors are not recorded in the SQLite database if some cpus
are offline at the system start.

Signed-off-by: Bing Xia <xiabing12@h-partners.com>
2024-04-23 15:20:14 +08:00
openeuler-ci-bot
5f9abe58c8
!141 add dynamic switch of ras events support and disable block_rq_complete
From: @pshysimon 
Reviewed-by: @gaoruoshu 
Signed-off-by: @gaoruoshu
2024-04-09 08:23:33 +00:00
caixiaomeng
544fd1a7d7 add dynamic switch of ras events support and disable block_rq_complete 2024-04-08 17:30:57 +08:00
openeuler-ci-bot
8daed7ec36
!132 [22.03-LTS-Next]backport upstream patches
From: @zhangruifang2020 
Reviewed-by: @zhuofeng6, @gaoruoshu 
Signed-off-by: @gaoruoshu
2024-03-26 01:30:27 +00:00
zhangruifang2020
a5a053ef71 backport upstream patches 2024-03-25 14:27:40 +08:00
openeuler-ci-bot
6c63ac4744
!120 fix rasdaemon disable service after upgrade
From: @pshysimon 
Reviewed-by: @lvying6 
Signed-off-by: @lvying6
2023-12-28 12:39:37 +00:00
caixiaomeng
639a2e6a2b fix rasdaemon disable service after upgrade 2023-12-28 16:37:23 +08:00
6 changed files with 801 additions and 2 deletions

View File

@ -0,0 +1,103 @@
From 370ac83b39f09eda0fb8a5cfa40ecfc71846eb0d Mon Sep 17 00:00:00 2001
From: Shiju Jose <shiju.jose@huawei.com>
Date: Wed, 20 Mar 2024 12:16:05 +0000
Subject: [PATCH] rasdaemon: Fix for vendor errors are not recorded in the
SQLite database if some cpus are offline
Fix for vendor errors are not recorded in the SQLite database if some cpus
are offline at the system start.
Issue:
This issue is reproducible by offline some cpus, run
./rasdaemon -f --record & and
inject vendor specific error supported in the rasdaemon.
Reason:
When the system starts with some of the cpus offline and then run
the rasdaemon, read_ras_event_all_cpus() exit with error and switch to
the multi thread way. However read() in read_ras_event() return error in
threads for each of the offline CPUs and does clean up including calling
ras_ns_finalize_vendor_tables(), which invokes sqlite3_finalize() on vendor
tables created. Thus the vendor error data does not stored in the SQLite
database when such error is reported next time.
Solution:
In ras_ns_add_vendor_tables() and ras_ns_finalize_vendor_tables() use
reference count and close vendor tables which created in
ras_ns_add_vendor_tables() based on the reference count.
Reported-by: Junhao He <hejunhao3@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Junhao He <hejunhao3@huawei.com>
Signed-off-by: Bing Xia <xiabing12@h-partners.com>
---
ras-non-standard-handler.c | 16 ++++++++++++++++
ras-non-standard-handler.h | 1 +
2 files changed, 17 insertions(+)
diff --git a/ras-non-standard-handler.c b/ras-non-standard-handler.c
index 20d514b..13e2acf 100644
--- a/ras-non-standard-handler.c
+++ b/ras-non-standard-handler.c
@@ -65,6 +65,7 @@ int register_ns_ev_decoder(struct ras_ns_ev_decoder *ns_ev_decoder)
#endif
if (!ras_ns_ev_dec_list) {
ras_ns_ev_dec_list = ns_ev_decoder;
+ ras_ns_ev_dec_list->ref_count = 0;
} else {
list = ras_ns_ev_dec_list;
while (list->next)
@@ -85,6 +86,8 @@ int ras_ns_add_vendor_tables(struct ras_events *ras)
return -1;
ns_ev_decoder = ras_ns_ev_dec_list;
+ if (ras_ns_ev_dec_list)
+ ras_ns_ev_dec_list->ref_count++;
while (ns_ev_decoder) {
if (ns_ev_decoder->add_table && !ns_ev_decoder->stmt_dec_record) {
error = ns_ev_decoder->add_table(ras, ns_ev_decoder);
@@ -127,6 +130,16 @@ void ras_ns_finalize_vendor_tables(void)
#ifdef HAVE_SQLITE3
struct ras_ns_ev_decoder *ns_ev_decoder = ras_ns_ev_dec_list;
+ if (!ras_ns_ev_dec_list)
+ return;
+
+ if (ras_ns_ev_dec_list->ref_count > 0)
+ ras_ns_ev_dec_list->ref_count--;
+ else
+ return;
+ if (ras_ns_ev_dec_list->ref_count > 0)
+ return;
+
while (ns_ev_decoder) {
if (ns_ev_decoder->stmt_dec_record) {
ras_mc_finalize_vendor_table(ns_ev_decoder->stmt_dec_record);
@@ -140,6 +153,9 @@ void ras_ns_finalize_vendor_tables(void)
static void unregister_ns_ev_decoder(void)
{
#ifdef HAVE_SQLITE3
+ if (!ras_ns_ev_dec_list)
+ return;
+ ras_ns_ev_dec_list->ref_count = 1;
ras_ns_finalize_vendor_tables();
#endif
ras_ns_ev_dec_list = NULL;
diff --git a/ras-non-standard-handler.h b/ras-non-standard-handler.h
index 341206a..2777584 100644
--- a/ras-non-standard-handler.h
+++ b/ras-non-standard-handler.h
@@ -22,6 +22,7 @@
struct ras_ns_ev_decoder {
struct ras_ns_ev_decoder *next;
+ uint16_t ref_count;
const char *sec_type;
int (*add_table)(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder);
int (*decode)(struct ras_events *ras, struct ras_ns_ev_decoder *ev_decoder,
--
2.30.0

View File

@ -0,0 +1,450 @@
From b26f624fbe12203b12b65e0674fea60c70e48a21 Mon Sep 17 00:00:00 2001
From: caixiaomeng 00662745 <caixiaomeng2@huawei.com>
Date: Wed, 21 Feb 2024 15:25:11 +0800
Subject: [PATCH] BACKPORT-Add-Dynamic-Switch
---
misc/rasdaemon.env | 5 +-
ras-disabled-events.h | 10 ++
ras-events.c | 247 +++++++++++++++++++++++++++---------------
rasdaemon.c | 36 ++++++
4 files changed, 208 insertions(+), 90 deletions(-)
create mode 100644 ras-disabled-events.h
diff --git a/misc/rasdaemon.env b/misc/rasdaemon.env
index dc40af8..6780eb0 100644
--- a/misc/rasdaemon.env
+++ b/misc/rasdaemon.env
@@ -51,4 +51,7 @@ CPU_CE_THRESHOLD="18"
CPU_ISOLATION_CYCLE="24h"
# Prevent excessive isolation from causing an avalanche effect
-CPU_ISOLATION_LIMIT="10"
\ No newline at end of file
+CPU_ISOLATION_LIMIT="10"
+
+# Disable specified events by config
+DISABLE="block:block_rq_complete"
\ No newline at end of file
diff --git a/ras-disabled-events.h b/ras-disabled-events.h
new file mode 100644
index 0000000..298a5f3
--- /dev/null
+++ b/ras-disabled-events.h
@@ -0,0 +1,10 @@
+#ifndef __RAS_DISABLED_EVENTS_H
+#define __RAS_DISABLED_EVENTS_H
+#define DISABLE "DISABLE"
+#define MAX_DISABLED_TRACEPOINTS_NUM 50
+#define MAX_DISABLED_TRACEPOINTS_STR_LENGTH 255
+#define MAX_TRACEPOINTS_STR_LENGTH 50
+
+extern char choices_disable[MAX_DISABLED_TRACEPOINTS_NUM][MAX_TRACEPOINTS_STR_LENGTH];
+extern int disabled_tracepoints_num;
+#endif
\ No newline at end of file
diff --git a/ras-events.c b/ras-events.c
index bc7da34..675d020 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -43,6 +43,7 @@
#include "ras-logger.h"
#include "ras-page-isolation.h"
#include "ras-cpu-isolation.h"
+#include "ras-disabled-events.h"
/*
* Polling time, if read() doesn't block. Currently, trace_pipe_raw never
@@ -172,6 +173,23 @@ static int get_tracing_dir(struct ras_events *ras)
return 0;
}
+static bool is_disabled_event(char *group, char *event) {
+ char ras_event_name[MAX_PATH + 1];
+
+ snprintf(ras_event_name, sizeof(ras_event_name), "%s:%s",
+ group, event);
+
+ if (disabled_tracepoints_num == 0) {
+ return false;
+ }
+ for (int i = 0; i < disabled_tracepoints_num; ++i) {
+ if (strcmp(choices_disable[i], ras_event_name) == 0) {
+ return true;
+ }
+ }
+ return false;
+}
+
/*
* Tracing enable/disable code
*/
@@ -228,40 +246,41 @@ int toggle_ras_mc_event(int enable)
goto free_ras;
}
- rc = __toggle_ras_mc_event(ras, "ras", "mc_event", enable);
+ rc = __toggle_ras_mc_event(ras, "ras", "mc_event", enable > 0 ? (is_disabled_event("ras", "mc_event") ? 0 : 1) : enable);
#ifdef HAVE_AER
- rc |= __toggle_ras_mc_event(ras, "ras", "aer_event", enable);
+ rc |= __toggle_ras_mc_event(ras, "ras", "aer_event", enable > 0 ? (is_disabled_event("ras", "aer_event") ? 0 : 1) : enable);
#endif
#ifdef HAVE_MCE
- rc |= __toggle_ras_mc_event(ras, "mce", "mce_record", enable);
+ rc |= __toggle_ras_mc_event(ras, "mce", "mce_record", enable > 0 ? (is_disabled_event("mce", "mce_record") ? 0 : 1) : enable);
#endif
#ifdef HAVE_EXTLOG
- rc |= __toggle_ras_mc_event(ras, "ras", "extlog_mem_event", enable);
+ rc |= __toggle_ras_mc_event(ras, "ras", "extlog_mem_event", enable > 0 ? (is_disabled_event("ras", "extlog_mem_event") ? 0 : 1) : enable);
#endif
#ifdef HAVE_NON_STANDARD
- rc |= __toggle_ras_mc_event(ras, "ras", "non_standard_event", enable);
+ rc |= __toggle_ras_mc_event(ras, "ras", "non_standard_event", enable > 0 ? (is_disabled_event("ras", "non_standard_event") ? 0 : 1) : enable);
#endif
#ifdef HAVE_ARM
- rc |= __toggle_ras_mc_event(ras, "ras", "arm_event", enable);
+ rc |= __toggle_ras_mc_event(ras, "ras", "arm_event", enable > 0 ? (is_disabled_event("ras", "arm_event") ? 0 : 1) : enable);
#endif
#ifdef HAVE_DEVLINK
- rc |= __toggle_ras_mc_event(ras, "devlink", "devlink_health_report", enable);
+ rc |= __toggle_ras_mc_event(ras, "devlink", "devlink_health_report", enable > 0 ? (is_disabled_event("devlink", "devlink_health_report") ? 0 : 1) : enable);
#endif
#ifdef HAVE_DISKERROR
- rc |= __toggle_ras_mc_event(ras, "block", "block_rq_complete", enable);
+ rc |= __toggle_ras_mc_event(ras, "block", "block_rq_complete", enable > 0 ? (is_disabled_event("block", "block_rq_complete") ? 0 : 1) : enable);
#endif
#ifdef HAVE_MEMORY_FAILURE
- rc |= __toggle_ras_mc_event(ras, "ras", "memory_failure_event", enable);
+ rc |= __toggle_ras_mc_event(ras, "ras", "memory_failure_event", enable > 0 ? (is_disabled_event("ras", "memory_failure_event") ? 0 : 1) : enable);
#endif
+
free_ras:
free(ras);
return rc;
@@ -870,42 +889,62 @@ int handle_ras_events(int record_events)
ras_page_account_init();
#endif
- rc = add_event_handler(ras, pevent, page_size, "ras", "mc_event",
- ras_mc_event_handler, NULL, MC_EVENT);
- if (!rc)
- num_events++;
- else
- log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
- "ras", "mc_event");
+ if (is_disabled_event("ras", "mc_event")) {
+ log(ALL, LOG_INFO, "Disabled %s:%s tracing from config\n",
+ "ras", "mc_event");
+ } else {
+ rc = add_event_handler(ras, pevent, page_size, "ras", "mc_event",
+ ras_mc_event_handler, NULL, MC_EVENT);
+ if (!rc)
+ num_events++;
+ else
+ log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+ "ras", "mc_event");
+ }
#ifdef HAVE_AER
- rc = add_event_handler(ras, pevent, page_size, "ras", "aer_event",
- ras_aer_event_handler, NULL, AER_EVENT);
- if (!rc)
- num_events++;
- else
- log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
- "ras", "aer_event");
+ if (is_disabled_event("ras", "aer_event")) {
+ log(ALL, LOG_INFO, "Disabled %s:%s tracing from config\n",
+ "ras", "aer_event");
+ } else {
+ rc = add_event_handler(ras, pevent, page_size, "ras", "aer_event",
+ ras_aer_event_handler, NULL, AER_EVENT);
+ if (!rc)
+ num_events++;
+ else
+ log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+ "ras", "aer_event");
+ }
#endif
#ifdef HAVE_NON_STANDARD
- rc = add_event_handler(ras, pevent, page_size, "ras", "non_standard_event",
- ras_non_standard_event_handler, NULL, NON_STANDARD_EVENT);
- if (!rc)
- num_events++;
- else
- log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
- "ras", "non_standard_event");
+ if (is_disabled_event("ras", "non_standard_event")) {
+ log(ALL, LOG_INFO, "Disabled %s:%s tracing from config\n",
+ "ras", "non_standard_event");
+ } else {
+ rc = add_event_handler(ras, pevent, page_size, "ras", "non_standard_event",
+ ras_non_standard_event_handler, NULL, NON_STANDARD_EVENT);
+ if (!rc)
+ num_events++;
+ else
+ log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+ "ras", "non_standard_event");
+ }
#endif
#ifdef HAVE_ARM
- rc = add_event_handler(ras, pevent, page_size, "ras", "arm_event",
- ras_arm_event_handler, NULL, ARM_EVENT);
- if (!rc)
- num_events++;
- else
- log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
- "ras", "arm_event");
+ if (is_disabled_event("ras", "arm_event")) {
+ log(ALL, LOG_INFO, "Disabled %s:%s tracing from config\n",
+ "ras", "arm_event");
+ } else {
+ rc = add_event_handler(ras, pevent, page_size, "ras", "arm_event",
+ ras_arm_event_handler, NULL, ARM_EVENT);
+ if (!rc)
+ num_events++;
+ else
+ log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+ "ras", "arm_event");
+ }
#endif
cpus = get_num_cpus(ras);
@@ -915,72 +954,102 @@ int handle_ras_events(int record_events)
#endif
#ifdef HAVE_MCE
- rc = register_mce_handler(ras, cpus);
- if (rc)
- log(ALL, LOG_INFO, "Can't register mce handler\n");
- if (ras->mce_priv) {
- rc = add_event_handler(ras, pevent, page_size,
- "mce", "mce_record",
- ras_mce_event_handler, NULL, MCE_EVENT);
- if (!rc)
- num_events++;
- else
- log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
- "mce", "mce_record");
+ if (is_disabled_event("mce", "mce_record")) {
+ log(ALL, LOG_INFO, "Disabled %s:%s tracing from config\n",
+ "mce", "mce_record");
+ } else {
+ rc = register_mce_handler(ras, cpus);
+ if (rc)
+ log(ALL, LOG_INFO, "Can't register mce handler\n");
+ if (ras->mce_priv) {
+ rc = add_event_handler(ras, pevent, page_size,
+ "mce", "mce_record",
+ ras_mce_event_handler, NULL, MCE_EVENT);
+ if (!rc)
+ num_events++;
+ else
+ log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+ "mce", "mce_record");
+ }
}
#endif
#ifdef HAVE_EXTLOG
- rc = add_event_handler(ras, pevent, page_size, "ras", "extlog_mem_event",
- ras_extlog_mem_event_handler, NULL, EXTLOG_EVENT);
- if (!rc) {
- /* tell kernel we are listening, so don't printk to console */
- (void)open("/sys/kernel/debug/ras/daemon_active", 0);
- num_events++;
- } else
- log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
- "ras", "extlog_mem_event");
+ if (is_disabled_event("ras", "extlog_mem_event")) {
+ log(ALL, LOG_INFO, "Disabled %s:%s tracing from config\n",
+ "ras", "extlog_mem_event");
+ } else {
+ rc = add_event_handler(ras, pevent, page_size, "ras", "extlog_mem_event",
+ ras_extlog_mem_event_handler, NULL, EXTLOG_EVENT);
+ if (!rc) {
+ /* tell kernel we are listening, so don't printk to console */
+ (void)open("/sys/kernel/debug/ras/daemon_active", 0);
+ num_events++;
+ } else
+ log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+ "ras", "extlog_mem_event");
+ }
#endif
#ifdef HAVE_DEVLINK
- rc = add_event_handler(ras, pevent, page_size, "net",
- "net_dev_xmit_timeout",
- ras_net_xmit_timeout_handler, NULL, DEVLINK_EVENT);
- if (!rc)
- filter_str = "devlink/devlink_health_report:msg=~\'TX timeout*\'";
-
- rc = add_event_handler(ras, pevent, page_size, "devlink",
- "devlink_health_report",
- ras_devlink_event_handler, filter_str, DEVLINK_EVENT);
- if (!rc)
- num_events++;
- else
- log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
- "devlink", "devlink_health_report");
+ if (is_disabled_event("net", "net_dev_xmit_timeout")) {
+ log(ALL, LOG_INFO, "Disabled %s:%s tracing from config\n",
+ "net", "net_dev_xmit_timeout");
+ } else {
+ rc = add_event_handler(ras, pevent, page_size, "net",
+ "net_dev_xmit_timeout",
+ ras_net_xmit_timeout_handler, NULL, DEVLINK_EVENT);
+ if (!rc)
+ filter_str = "devlink/devlink_health_report:msg=~\'TX timeout*\'";
+
+ if (is_disabled_event("devlink", "devlink_health_report")) {
+ log(ALL, LOG_INFO, "Disabled %s:%s tracing from config\n",
+ "devlink", "devlink_health_report");
+ } else {
+ rc = add_event_handler(ras, pevent, page_size, "devlink",
+ "devlink_health_report",
+ ras_devlink_event_handler, filter_str, DEVLINK_EVENT);
+ if (!rc)
+ num_events++;
+ else
+ log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+ "devlink", "devlink_health_report");
+ }
+ }
#endif
#ifdef HAVE_DISKERROR
- rc = filter_ras_mc_event(ras, "block", "block_rq_complete", "error != 0");
- if (!rc) {
- rc = add_event_handler(ras, pevent, page_size, "block",
- "block_rq_complete", ras_diskerror_event_handler,
- NULL, DISKERROR_EVENT);
- if (!rc)
- num_events++;
- else
- log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
- "block", "block_rq_complete");
+ if (is_disabled_event("block", "block_rq_complete")) {
+ log(ALL, LOG_INFO, "Disabled %s:%s tracing from config\n",
+ "block", "block_rq_complete");
+ } else {
+ rc = filter_ras_mc_event(ras, "block", "block_rq_complete", "error != 0");
+ if (!rc) {
+ rc = add_event_handler(ras, pevent, page_size, "block",
+ "block_rq_complete", ras_diskerror_event_handler,
+ NULL, DISKERROR_EVENT);
+ if (!rc)
+ num_events++;
+ else
+ log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+ "block", "block_rq_complete");
+ }
}
#endif
#ifdef HAVE_MEMORY_FAILURE
- rc = add_event_handler(ras, pevent, page_size, "ras", "memory_failure_event",
- ras_memory_failure_event_handler, NULL, MF_EVENT);
- if (!rc)
- num_events++;
- else
- log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
- "ras", "memory_failure_event");
+ if (is_disabled_event("ras", "memory_failure_event")) {
+ log(ALL, LOG_INFO, "Disabled %s:%s tracing from config\n",
+ "ras", "memory_failure_event");
+ } else {
+ rc = add_event_handler(ras, pevent, page_size, "ras", "memory_failure_event",
+ ras_memory_failure_event_handler, NULL, MF_EVENT);
+ if (!rc)
+ num_events++;
+ else
+ log(ALL, LOG_ERR, "Can't get traces from %s:%s\n",
+ "ras", "memory_failure_event");
+ }
#endif
if (!num_events) {
diff --git a/rasdaemon.c b/rasdaemon.c
index 66f4dea..0437662 100644
--- a/rasdaemon.c
+++ b/rasdaemon.c
@@ -25,6 +25,7 @@
#include "ras-record.h"
#include "ras-logger.h"
#include "ras-events.h"
+#include "ras-disabled-events.h"
/*
* Arguments(argp) handling logic and main
@@ -34,6 +35,9 @@
#define TOOL_DESCRIPTION "RAS daemon to log the RAS events."
#define ARGS_DOC "<options>"
+char choices_disable[MAX_DISABLED_TRACEPOINTS_NUM][MAX_TRACEPOINTS_STR_LENGTH];
+int disabled_tracepoints_num;
+
const char *argp_program_version = TOOL_NAME " " VERSION;
const char *argp_program_bug_address = "Mauro Carvalho Chehab <mchehab@kernel.org>";
@@ -43,6 +47,36 @@ struct arguments {
int foreground;
};
+static void parse_disabled_choices() {
+ char disabled_tracepoints_str[MAX_DISABLED_TRACEPOINTS_STR_LENGTH];
+ const char* sep = ";";
+ char* tracepoint_str;
+ char* config_disabled_tracepoints = getenv(DISABLE);
+ if (config_disabled_tracepoints == NULL) {
+ return;
+ }
+
+ if (strlen(config_disabled_tracepoints) >= MAX_DISABLED_TRACEPOINTS_STR_LENGTH) {
+ log(ALL, LOG_WARNING, "Failed to read disabled events config string, length exceeds %d characters.\n", MAX_DISABLED_TRACEPOINTS_STR_LENGTH);
+ return;
+ }
+ strcpy(disabled_tracepoints_str, config_disabled_tracepoints);
+
+ tracepoint_str = strtok(disabled_tracepoints_str, sep);
+ int index = 0;
+
+ while(tracepoint_str != NULL && index < MAX_DISABLED_TRACEPOINTS_NUM) {
+ if (strlen(tracepoint_str) >= MAX_TRACEPOINTS_STR_LENGTH) {
+ log(ALL, LOG_WARNING, "Failed to read disabled events config item %s string, length exceeds %d characters, skipped.\n", tracepoint_str, MAX_TRACEPOINTS_STR_LENGTH);
+ }
+ else {
+ strcpy(choices_disable[index++], tracepoint_str);
+ }
+ tracepoint_str = strtok(NULL, sep);
+ }
+ disabled_tracepoints_num = index;
+}
+
static error_t parse_opt(int k, char *arg, struct argp_state *state)
{
struct arguments *args = state->input;
@@ -102,6 +136,8 @@ int main(int argc, char *argv[])
return -1;
}
+ parse_disabled_choices();
+
if (args.enable_ras) {
int enable;
--
2.33.0

View File

@ -0,0 +1,118 @@
From 0e823890cafdb7220bc916f77b21ca2bdf6cdadc Mon Sep 17 00:00:00 2001
From: zhuofeng <zhuofeng2@huawei.com>
Date: Thu, 7 Dec 2023 10:26:56 +0800
Subject: [PATCH] Fix potential overflow with some arrays at page-isolation
logic
Overflows may happen in the `threshold_string` and `cycle_string` arrays.
If the PAGE_CE_THRESHOLD value in page isolation is set to 50 bits,
there is a risk of array overflow. Because sprintf is an insecure
function, use snprintf instead.
An error is reported when the AddressSanitizer is used.
rasdaemon: Improper PAGE_CE_ACTION, set to default soft
rasdaemon: Page offline choice on Corrected Errors is soft
=================================================================
==221920==ERROR: AddressSanitizer: stack-buffer-overflow on address 0xffffdd91d932 at pc 0xffffa24071c4 bp 0xffffdd91d720 sp 0xffffdd91ced8
WRITE of size 55 at 0xffffdd91d932 thread T0
#0 0xffffa24071c0 in vsprintf (/usr/lib64/libasan.so.6+0x5c1c0)
#1 0xffffa24073cc in sprintf (/usr/lib64/libasan.so.6+0x5c3cc)
#2 0x459558 in parse_env_string /home/rasdaemon/ras-page-isolation.c:185
#3 0x4596f4 in page_isolation_init /home/rasdaemon/ras-page-isolation.c:202
#4 0x459934 in ras_page_account_init /home/rasdaemon/ras-page-isolation.c:211
#5 0x40f700 in handle_ras_events /home/rasdaemon/ras-events.c:902
#6 0x405b8c in main /home/rasdaemon/rasdaemon.c:211
#7 0xffffa20b6f38 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
#8 0xffffa20b7004 in __libc_start_main_impl ../csu/libc-start.c:409
#9 0x4038ec in _start (/home/rasdaemon/rasdaemon+0x4038ec)
Address 0xffffdd91d932 is located in stack of thread T0 at offset 82 in frame
#0 0x459574 in page_isolation_init /home/rasdaemon/ras-page-isolation.c:190
This frame has 2 object(s):
[32, 82) 'threshold_string' (line 191)
[128, 178) 'cycle_string' (line 192) <== Memory access at offset 82 partially underflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
(longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow (/usr/lib64/libasan.so.6+0x5c1c0) in vsprintf
Shadow bytes around the buggy address:
0x200ffbb23ad0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x200ffbb23ae0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x200ffbb23af0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x200ffbb23b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x200ffbb23b10: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
=>0x200ffbb23b20: 00 00 00 00 00 00[02]f2 f2 f2 f2 f2 00 00 00 00
0x200ffbb23b30: 00 00 02 f3 f3 f3 f3 f3 00 00 00 00 00 00 00 00
0x200ffbb23b40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x200ffbb23b50: f1 f1 f1 f1 f1 f1 04 f2 00 00 f2 f2 00 00 00 00
0x200ffbb23b60: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2
0x200ffbb23b70: f2 f2 f2 f2 00 00 00 00 00 00 00 00 f2 f2 f2 f2
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==221920==ABORTING
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
---
ras-page-isolation.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/ras-page-isolation.c b/ras-page-isolation.c
index fd7bd70..caa8c31 100644
--- a/ras-page-isolation.c
+++ b/ras-page-isolation.c
@@ -171,18 +171,18 @@ parse:
config->unit = no_unit ? config->unit : "";
}
-static void parse_env_string(struct isolation *config, char *str)
+static void parse_env_string(struct isolation *config, char *str, unsigned int size)
{
int i;
if (config->overflow) {
/* when overflow, use basic unit */
for (i = 0; config->units[i].name; i++) ;
- sprintf(str, "%lu%s", config->val, config->units[i-1].name);
+ snprintf(str, size, "%lu%s", config->val, config->units[i-1].name);
log(TERM, LOG_INFO, "%s is set overflow(%s), truncate it\n",
config->name, config->env);
} else {
- sprintf(str, "%s%s", config->env, config->unit);
+ snprintf(str, size, "%s%s", config->env, config->unit);
}
}
@@ -199,8 +199,8 @@ static void page_isolation_init(void)
parse_isolation_env(&threshold);
parse_isolation_env(&cycle);
- parse_env_string(&threshold, threshold_string);
- parse_env_string(&cycle, cycle_string);
+ parse_env_string(&threshold, threshold_string, sizeof(threshold_string));
+ parse_env_string(&cycle, cycle_string, sizeof(cycle_string));
log(TERM, LOG_INFO, "Threshold of memory Corrected Errors is %s / %s\n",
threshold_string, cycle_string);
}
--
2.33.0

View File

@ -0,0 +1,32 @@
From 77600e0cd71cd5c34126635b199e7b66f4d74874 Mon Sep 17 00:00:00 2001
From: Shengwei Luo <luoshengwei@huawei.com>
Date: Tue, 23 Apr 2024 17:09:10 +0800
Subject: [PATCH] rasdaemon: Fix cpu isolate errors when some cpus are offline
before the service started.
The upstream patch use (sysconf(_SC_NPROCESSORS_ONLN)) instead of
(sysconf(_SC_NPROCESSORS_CONF)). However ras_cpu_isolation_init()
need the all cpu info, so fix it.
Fixes: f1ea76375281 ("rasdaemon: Check CPUs online, not configured")
Signed-off-by: Shengwei Luo <luoshengwei@huawei.com>
---
ras-events.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/ras-events.c b/ras-events.c
index ffac02b..1aa6db6 100644
--- a/ras-events.c
+++ b/ras-events.c
@@ -950,7 +950,7 @@ int handle_ras_events(int record_events)
cpus = get_num_cpus(ras);
#ifdef HAVE_CPU_FAULT_ISOLATION
- ras_cpu_isolation_init(cpus);
+ ras_cpu_isolation_init(sysconf(_SC_NPROCESSORS_CONF));
#endif
#ifdef HAVE_MCE
--
2.33.0

View File

@ -0,0 +1,57 @@
From 83f7052a8d8c9641809611d9485256d8ed843c31 Mon Sep 17 00:00:00 2001
From: caixiaomeng 00662745 <caixiaomeng2@huawei.com>
Date: Wed, 6 Mar 2024 14:21:41 +0800
Subject: [PATCH] huawei-fix-rasdaemon-print-loading-config-logs-multi
---
rasdaemon.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/rasdaemon.c b/rasdaemon.c
index 0437662..7ece6c1 100644
--- a/rasdaemon.c
+++ b/rasdaemon.c
@@ -47,7 +47,7 @@ struct arguments {
int foreground;
};
-static void parse_disabled_choices() {
+static void parse_disabled_choices(int enable_ras) {
char disabled_tracepoints_str[MAX_DISABLED_TRACEPOINTS_STR_LENGTH];
const char* sep = ";";
char* tracepoint_str;
@@ -57,16 +57,18 @@ static void parse_disabled_choices() {
}
if (strlen(config_disabled_tracepoints) >= MAX_DISABLED_TRACEPOINTS_STR_LENGTH) {
- log(ALL, LOG_WARNING, "Failed to read disabled events config string, length exceeds %d characters.\n", MAX_DISABLED_TRACEPOINTS_STR_LENGTH);
+ if (enable_ras) {
+ log(ALL, LOG_WARNING, "Failed to read disabled events config string, length exceeds %d characters.\n", MAX_DISABLED_TRACEPOINTS_STR_LENGTH);
+ }
return;
}
strcpy(disabled_tracepoints_str, config_disabled_tracepoints);
-
+
tracepoint_str = strtok(disabled_tracepoints_str, sep);
int index = 0;
while(tracepoint_str != NULL && index < MAX_DISABLED_TRACEPOINTS_NUM) {
- if (strlen(tracepoint_str) >= MAX_TRACEPOINTS_STR_LENGTH) {
+ if (enable_ras && strlen(tracepoint_str) >= MAX_TRACEPOINTS_STR_LENGTH) {
log(ALL, LOG_WARNING, "Failed to read disabled events config item %s string, length exceeds %d characters, skipped.\n", tracepoint_str, MAX_TRACEPOINTS_STR_LENGTH);
}
else {
@@ -136,7 +138,7 @@ int main(int argc, char *argv[])
return -1;
}
- parse_disabled_choices();
+ parse_disabled_choices(args.enable_ras);
if (args.enable_ras) {
int enable;
--
2.33.0

View File

@ -1,6 +1,6 @@
Name: rasdaemon
Version: 0.6.7
Release: 16
Release: 21
License: GPLv2
Summary: Utility to get Platform Reliability, Availability and Serviceability (RAS) reports via the Kernel tracing events
URL: https://github.com/mchehab/rasdaemon.git
@ -40,6 +40,8 @@ Patch18: 0010-rasdaemon-Fix-for-a-memory-out-of-bounds-issue-and-o.patch
Patch19: 0001-rasdaemon-use-standard-length-PATH_MAX-for-path-name.patch
Patch20: rasdaemon-diskerror-fix-incomplete-diskerror-log.patch
Patch21: 0001-Check-CPUs-online-not-configured.patch
Patch22: backport-Fix-potential-overflow-with-some-arrays-at-page-isol.patch
Patch23: bugfix-fix-cpu-isolate-errors-when-some-cpus-are-.patch
Patch6000: backport-rasdaemon-ras-mc-ctl-Fix-script-to-parse-dimm-sizes.patch
Patch6001: backport-rasdaemon-ras-memory-failure-handler-handle-localtim.patch
@ -55,6 +57,9 @@ Patch9005: 0003-rasdaemon-Add-support-for-creating-the-vendor-error-.patch
Patch9006: 0004-rasdaemon-Add-four-modules-supported-by-HiSilicon-co.patch
Patch9007: fix-ras-events-quit-loop-in-read_ras_event-when-kbuf-dat.patch
Patch9008: 0001-rasdaemon-ras-mc-ctl-Modify-check-for-HiSilicon-KunP.patch
Patch9009: add-dynamic-switch-of-ras-events-support-and-disable-block-rq-complete.patch
Patch9010: fix-rasdaemon-print-loading-config-logs-multiple-times.patch
Patch9011: 0001-rasdaemon-Fix-for-vendor-errors-are-not-recorded-in-.patch
%description
The rasdaemon program is a daemon which monitors the platform
@ -103,9 +108,43 @@ if [ $1 -eq 2 ] ; then
fi
%preun
/usr/bin/systemctl disable rasdaemon.service >/dev/null 2>&1 || :
if [ $1 -eq 0 ] ; then
/usr/bin/systemctl disable rasdaemon.service >/dev/null 2>&1 || :
fi
%changelog
* Thu Apr 25 2024 yangjunshuo <yangjunshuo@huawei.com> - 0.6.7-21
- Type:bugfix
- ID:NA
- SUG:NA
- DESC:fix cpu isolate errors when some cpus are offline
before the service started
* Tue Apr 23 2024 Bing Xia <xiabing12@h-partners.com> - 0.6.7-20
- Type:bugfix
- ID:NA
- SUG:NA
- DESC:Fix for vendor errors are not recorded in the SQLite database if
some cpus are offline at the system start
* Mon Apr 8 2024 caixiaomeng <caixiaomeng2@huawei.com> - 0.6.7-19
- Type:bugfix
- ID:NA
- SUG:NA
- DESC:add-dynamic-switch-of-ras-events-support-and-disable-block-rq-complete
* Mon Mar 25 2024 zhangruifang <zhangruifang@h-partners.com> - 0.6.7-18
- Type:bugfix
- ID:NA
- SUG:NA
- DESC:backport upstream patches
* Thu Dec 28 2023 caixiaomeng <caixiaomeng2@huawei.com> - 0.6.7-17
- Type:bugfix
- ID:NA
- SUG:NA
- DESC: fix rasdaemon disable service after upgrade
* Wed Dec 20 2023 caixiaomeng <caixiaomeng2@huawei.com> - 0.6.7-16
- Type:bugfix
- ID:NA