ag1*_*112 8 java sql oracle algorithm plsql
我正在开发一个基于java-oracle的项目,我遇到了一个问题,在我看来这需要一个分析解决方案.我正在寻找基于SQL查询或任何算法的解决方案或任何我可以遵循以获得所需结果的免费分析工具.
问题陈述: 让我们说我在下面的表中有columnA-D和最后一列作为分数,我想找到每个列的值的标准,当在SQL where子句中合并时,它总是给我分数列的正值.所以基本上什么组合的columnA-D总能给我积极的分数?
columnA|columnB|columnC|columnD|Score
1 40 10 3 -20
0 40 2 3 10
0 10 3 3 20
1 15 3 3 -5
0 10 2 2 -15
0 15 6 3 -10
Run Code Online (Sandbox Code Playgroud)
上述数据集的预期结果: - 上述数据集的 可视解释给出条件:"ColumnA = 0且columnB> 10,columnC <5将确保得分始终> 0".(视觉上它的清晰列D没有效果).
请注意,上面的数据集是为了简单起见.实际上,我的项目包含大约40列,近2500行.有一件事是确保每列都具有有限的值范围.
这是我开始使用的算法(如果有人认为我在正确的方向,需要输入进一步改进它):
准备:创建所有可能表达式的列表,如A = 0,B> 10,C <5(对于40列,我最终总计约150个表达式)
我们称之为"表达式"变量.
第一轮算法:
set totalPositiveRows =从我的表中选择count(*)得分> 0;
set totalNegativeRows =从我的表中选择count(*)得分<0;
对于表达式中的每个expr,计算以下三个变量设置positivePercentage =查找满足此expr的totalPositiveRows的百分比; //如果得分> 0的总共100行中有60行满足expr,那么positivePercentage = 60%
set negativePercentage= find percentage of totalNegativeRows which satisfy this expr; //like if 40 rows out of total 100 rows having score<0 satisfy expr , then negativePercentage=40%
set diffPercentage=positivePercentage-negativePercentage;
Run Code Online (Sandbox Code Playgroud)设置initialexpr =选择具有最大值diffPercentage的expr设置initalPositivePercentage =选择相应的positivePercentage值; set initalNegativePercentage =选择相应的negativePercentage值; 我的想法是,我现在需要继续扩展initalexpr,直到initalNegativePercentage变为0.
随后运行的算法直到initalNegativePercentage变为0: -
对于表达式中的每个expr,计算以下三个变量
newexpr = initialexpr +"和"+ expr; set positivePercentage =查找满足newexpr的totalPositiveRows的百分比; set negativePercentage =查找满足newexpr的totalNegativeRows的百分比;
//计算它减少了多少负百分比?set positiveReduction = initalPositivePercentage-positivePercentage; set negativeReduction = initalNegativePercentage-negativePercentage; if(negativeReduction> = positiveReduction)//记下它//否则丢弃它
选择给出最大负减少的expr,这将成为新的初始expr.设置initialexpr =选择expr,最大值为negativeReduction set initalPositivePercentage =选择相应的值; set initalNegativePercentage =选择相应的值;
重复上面的算法.
请评论.
下面是一个蛮力的解决方案.对于理论计算机科学网站来说,这也许是一个很好的问题.我认为这是一个类似于布尔可满足性的NP完全问题,但这只是一个疯狂的猜测.可能有一种更聪明的方法来解决这个问题,但我不认为我会找到它.
基本思想是将每个表达式与列的每个不同值交叉连接,然后交叉连接所有列.将使用每个表达式列表查询该表,并为正分和负分数生成计数.如果表达式返回预期的正分数并且没有负分,则它是有效的.
这是假设你只使用表达式>,<和=.每个新列或表达式都会使此问题呈指数级变慢.
测试架构
drop table table1;
create table table1(a number, b number, c number, d number, score number);
insert into table1
select 1, 40, 10, 3, -20 from dual union all
select 0, 40, 2, 3, 10 from dual union all
select 0, 10, 3, 3, 20 from dual union all
select 1, 15, 3, 3, -5 from dual union all
select 0, 10, 2, 2, -15 from dual union all
select 0, 15, 6, 3, -10 from dual;
Run Code Online (Sandbox Code Playgroud)
代码墙
declare
v_inline_view varchar2(32767);
v_inline_views clob;
v_inline_view_counter number := 0;
v_and_expression varchar2(32767);
v_query clob;
v_sqls sys.odcivarchar2list;
v_dynamic_query_counter number := 0;
begin
--#1: Create inline views.
--One for every combination of expression and distinct value, per column.
for inline_views in
(
--Inline view for every possible expression for each column.
select
replace(q'[
(
select *
from
(
--Possible expressions.
select distinct
case
when operator is null then null
else ' AND %%COLUMN%% '||operator||' '||%%COLUMN%%
end %%COLUMN%%_expression
from
--All operators.
(
select '>' operator from dual union all
select '<' operator from dual union all
select '=' operator from dual union all
select null operator from dual
)
--All distinct values.
cross join
(
select distinct %%COLUMN%% from table1
)
)
--where %%COLUMN%%_expression is null or %%COLUMN%%_expression %%EXPRESSION_PERFORMANCE_EXCLUSIONS%%
)
]', '%%COLUMN%%', column_name) inline_view
from user_tab_columns
where table_name = 'TABLE1'
and column_name <> 'SCORE'
order by column_name
) loop
--Assign to temorary so it can be modified.
v_inline_view := inline_views.inline_view;
--#1A: Optimize inline view - throw out expressions if they don't return any positive results.
declare
v_expressions sys.odcivarchar2list;
v_expressions_to_ignore varchar2(32767);
v_has_score_gt_0 number;
begin
--Gather expressions for one column.
execute immediate v_inline_view bulk collect into v_expressions;
--Loop through and test each expression.
for i in 1 .. v_expressions.count loop
--Always keep the null expression.
if v_expressions(i) is not null then
--Count the number of rows with a positive score.
execute immediate 'select nvl(max(case when score > 0 then 1 else 0 end), 0) from table1 where '||replace(v_expressions(i), ' AND ', null)
into v_has_score_gt_0;
--If the expression returns nothing positive then add it to exclusion.
if v_has_score_gt_0 = 0 then
v_expressions_to_ignore := v_expressions_to_ignore||','''||v_expressions(i)||'''';
end if;
end if;
end loop;
--Convert it into an IN clause.
if v_expressions_to_ignore is not null then
--Remove comment, replace placeholder with expression exclusions.
v_inline_view := regexp_replace(v_inline_view, '(.*)(--where)( .* )(%%EXPRESSION_PERFORMANCE_EXCLUSIONS%%)(.*)', '\1where\3 not in ('||substr(v_expressions_to_ignore, 2)||')');
end if;
end;
--Aggregate and count inline views.
if v_inline_view_counter <> 0 then
v_inline_views := v_inline_views||'cross join';
end if;
v_inline_views := v_inline_views||v_inline_view;
v_inline_view_counter := v_inline_view_counter + 1;
end loop;
--#2: Create an AND expression to combine all column expressions.
select listagg(column_name||'_expression', '||') within group (order by column_name)
into v_and_expression
from user_tab_columns
where table_name = 'TABLE1'
and column_name <> 'SCORE';
--#3: Create a that will create all possible expression combinations.
v_query :=
replace(replace(q'[
--8281 combinations
select '
select
'''||expressions||''' expression,
nvl(sum(case when score > 0 then 1 else 0 end), 0) gt_0_score_count,
nvl(sum(case when score <= 0 then 1 else 0 end), 0) le_0_score_count
from table1
where 1=1 '||expressions v_sql
from
(
--Combine expressions
select %%AND_EXPRESSION%% expressions
from
%%INLINE_VIEWS%%
) combined_expressions
]', '%%AND_EXPRESSION%%', v_and_expression), '%%INLINE_VIEWS%%', v_inline_views);
--TEST: It might be useful to see the full query here.
--dbms_output.put_line(v_query);
--#4: Gather expressions.
--With larger input you'll want to use a LIMIT
execute immediate v_query
bulk collect into v_sqls;
--#5: Test each expression.
--Look for any queries that return the right number of rows.
for i in 1 .. v_sqls.count loop
declare
v_expression varchar2(4000);
v_gt_0_score_count number;
v_le_0_score_count number;
begin
execute immediate v_sqls(i) into v_expression, v_gt_0_score_count, v_le_0_score_count;
v_dynamic_query_counter := v_dynamic_query_counter + 1;
--TODO: Dynamically generate "2".
if v_gt_0_score_count = 2 and v_le_0_score_count = 0 then
dbms_output.put_line('Expression: '||v_expression);
end if;
exception when others then
dbms_output.put_line('Problem with: '||v_sqls(i));
end;
end loop;
dbms_output.put_line('Queries executed: '||v_dynamic_query_counter);
end;
/
Run Code Online (Sandbox Code Playgroud)
结果
结果看似正确.它们与您的略有不同,因为"columnB> 10"不正确.
Expression: AND A = 0 AND C < 6 AND D = 3
Expression: AND A = 0 AND C < 6 AND D > 2
Expression: AND A < 1 AND C < 6 AND D = 3
Expression: AND A < 1 AND C < 6 AND D > 2
Queries executed: 441
Run Code Online (Sandbox Code Playgroud)
问题
这种蛮力方法至少在两个方面效率极低.即使对于这个简单的例子,它也需要6370个查询.结果可能包括重复,这些重复是非常重要的.或者也许你会很幸运,而且很少有解决方案让你能够注意到它们.
您可以采取一些措施来提高查询性能.最简单的方法是单独检查每个条件,如果没有"获得"任何计数,则将其丢弃.
优化
排除不会返回任何积极结果的个别表达.使用示例数据,可以将查询执行次数从6370减少到441.
并行运行该过程还可以将性能提高一个数量级.它可能需要并行流水线功能.
但即使是100倍的性能提升也可能无法解决NP完全问题.您可能需要根据样本数据找到一些额外的"捷径".
通过取消注释其中一个dbms_output.put_line语句,打印出生成测试查询的查询可能会有所帮助.添加一个count(*)以查看将执行多少查询并使用较小的数据集运行以估计需要多长时间.
如果估计是十亿年,并且您无法想到使蛮力方法更快地运行的任何技巧,那么可能是时候在https://cstheory.stackexchange.com/上提出这个问题了.