Mar*_*ark 12 python algorithm r mathematical-optimization julia
令人遗憾的是,如此多的伟大国家(例如印度)和球员(例如莫·萨拉赫)可能永远不会参加国际足联(足球/足球)世界杯(同样的论点也适用于由少数统治者主导的其他体育赛事)队,例如国际板球和篮球锦标赛)。
尝试创建一个更加平衡的赛事,同时仍然保留基于位置的元素,每个团队具有(大致)相似的实力,并且所有玩家都来自连续区域(最好是同一块土地,但显然不可能在所有情况下都可行,并且超出了这个问题的范围)。
例如,在较弱的地区,可能他们的球队会跨越多个国家(例如,可能是南亚球队),而对于强国,可能会有多个球队(例如,一支只有来自郊区的球员的球队)巴黎)。
然后我可以绘制这些区域,也许使用 Voronois。
我设法通过(《足球经理》和《转会市场》的抓取)收集了相关信息,但我对如何设计选择球队的算法感到困惑。
有大量坐标,对应玩家的出生地。这些地方都有玩家,并且玩家都有评分(从 1 - 100)和位置。
任务是,给定一定的球队规模 (11) 和一定数量的球队(在下面的示例中为 10),以及每个位置所需的一定数量的球员(1,尽管替补球员会很方便),将将区域划分为连续的区域,其中由给定区域的玩家组成的最佳团队与其他区域的团队大致具有相同的技能。
我已经阅读了一些有关图论的内容,但我不确定从哪里开始创建解决这个问题的程序。
我正在寻找一些关于这方面的指导,并且(如果可能的话),如果你可以用玩具示例创建一些东西,那就太棒了!
另外,如果解决整体问题太困难,如果您能找到一种方法来缩小问题范围,并且可以创建一些解决较小问题的方法,然后可以将其推广到更大的问题,那就太好了。
set.seed(0)
library(tidyverse)
df <- tibble(
# generate random latitudes and longitudes, and the number of players for that location
lat = runif(100, 0, 90),
lon = runif(100, 0, 180),
players = rnorm(100, 0, 5) |> abs() |> ceiling() |> as.integer()
)
num_positions <- 11
position <- letters[1:num_positions]
df <- df |>
# generate position and skill data, and unnest
mutate(position = map(players, ~sample(position, .x, replace = TRUE)),
skill = map(players, ~sample(1:100, .x, replace = TRUE)))|>
unnest_longer(c(position, skill))
# plot the data
df |>
summarise(skill = mean(skill), players = first(players), .by = c(lat, lon)) |>
ggplot(aes(x = lon, y = lat)) +
geom_point(aes(size = players, color = skill)) +
scale_size_continuous(range = c(1, 10)) +
scale_color_viridis_c(option = "magma") + # similar to the Colombian shirt colours
theme_minimal() +
theme(legend.position = "none",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
n_teams <- 10
Run Code Online (Sandbox Code Playgroud)
在我想做的事情中,它将涉及 64 支球队,因此至少有 704 名球员,并且可能是样本数据集大小的 3 倍左右。真实的数据集有很多行,但是通过过滤它,我应该能够将其减少到几千行。
在现实生活中,有些球员可以在多个位置上打得很好,但在我给出的示例代码中,每个球员只有一个位置。为每个球员添加多个位置可能会让这个问题更难解决,所以这超出了这个问题的范围。
如果你能在一个球体(比如地球)上做到这一点,那就太棒了,但矩形也可以。
Reinderien 很有帮助地指出,使用连续性而没有紧凑性可能会导致一些类似阿米加拉断层的杰瑞曼德怪物,类似于我们在美国选举地图中看到的情况。因此,尽管我仍在尝试找出一个更好的标题,但如果它对您来说更有效,我会说优先考虑紧凑性而不是连续性(尽管我猜测这会引发飞地问题?)
set.seed(0)
library(tidyverse)
df <- tibble(
# generate random latitudes and longitudes, and the number of players for that location
lat = runif(100, 0, 90),
lon = runif(100, 0, 180),
players = rnorm(100, 0, 5) |> abs() |> ceiling() |> as.integer()
)
num_positions <- 11
position <- letters[1:num_positions]
df <- df |>
# generate position and skill data, and unnest
mutate(position = map(players, ~sample(position, .x, replace = TRUE)),
skill = map(players, ~sample(1:100, .x, replace = TRUE)))|>
unnest_longer(c(position, skill))
# plot the data
df |>
summarise(skill = mean(skill), players = first(players), .by = c(lat, lon)) |>
ggplot(aes(x = lon, y = lat)) +
geom_point(aes(size = players, color = skill)) +
scale_size_continuous(range = c(1, 10)) +
scale_color_viridis_c(option = "magma") + # similar to the Colombian shirt colours
theme_minimal() +
theme(legend.position = "none",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
n_teams <- 10
Run Code Online (Sandbox Code Playgroud)
using Random, DataFrames
Random.seed!(0)
df = DataFrame(lat = rand(100) * 90,
lon = rand(100) * 180,
players = Int64.(ceil.(abs.(5randn(100)))))
num_positions = 11
position = ['a'+ i for i in 0:num_positions-1]
df[!, :position] = [rand(position, players) for players in df.players]
df[!, :skill] = [rand(1:100, players) for players in df.players]
df = flatten(df, [:position, :skill])
Run Code Online (Sandbox Code Playgroud)
首先我们来谈谈连续性。我认为你应该重视紧凑性而不是连续性;毕竟,美国人已经证明,只要你足够相信,任何事物都可以是连续的。连续性是一个表示“困难的问题”,而紧凑性是一个表示“容易的问题”。可扩展性成为一个挑战,因为您没有执行简单的相似性聚类(例如k均值);您正在使用更复杂的标准执行聚类。这不是我的专长,因此在我的演示中,我展示了随机正确的简化:假设玩家技能和位置都是均匀分布的,那么基于也是均匀分布的人口密度随机选择的团队质心将代表现实(特别是如果您运行带有随机扰动的多个外部迭代)。具体来说,如果您有时间燃烧并想要完善解决方案,您可以将此方法包装在差分进化循环中,并以团队质心作为进化参数。
说到复杂的标准 - 该系统只有在构建正确的团队时才有用,而正确的团队需要职位标准。我提出了一个矩阵,您可以调整该矩阵来设置每个团队所需位置数量的下限和上限。
告诉优化器缩小允许的技能级别范围是可能的,但可能非常慢。您可能最好将其作为预解决步骤来执行,从而消除与您的目标技能水平相差一定数量的标准差的潜在玩家。或者,或者将公平性不是作为 LP 目标,而是作为 LP 约束,以每个团队的最小和最大技能总和为界限。我演示的是后者;但请注意,当我们将平均团队技能限制为总体平均技能(在本例中为 50)时,这是最可行的。定义范围太窄(小于 20%)或定义目标远离平均值会使解决方案变得困难。
全部一起,
import geopandas
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pulp
import shapely
from numpy.random import default_rng, Generator
n_teams = 64
players_per_team = 11
total_players = n_teams * players_per_team
max_skill = 100
def synthesize_world_data(rand: Generator) -> tuple[
geopandas.GeoDataFrame, # player and country data
pd.DataFrame, # field position data
]:
# Load the default world dataset from geopandas,
# dropping a couple of countries with degenerate bounds.
# There are better ways to fix this (not shown here).
world = (
geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
.rename(columns={'name': 'country'})
)
world.index.name = 'country'
world[['left', 'bottom', 'right', 'top']] = shapely.bounds(world['geometry'])
is_degenerate = np.isclose(world['right'] - world['left'], 360)
world = world[~is_degenerate]
# Assume that talent is uniformly distributed throughout the world,
# so players are spatially uniformly distributed. Draw from a pool
# of players ~2 times what is needed for final team formation.
pop_ratio = 2 * total_players / world['pop_est'].sum()
world['players_est'] = world['pop_est'] * pop_ratio
# Hack: perform an inefficient but statistically correct spatial random distribution of players
# within their countries by allocating random coordinates within rectilinear bounds and then doing
# geometric elimination of those players that have decided to live in the ocean
world['geodensity'] = shapely.area(world['geometry'])/(world['right'] - world['left'])/(world['top'] - world['bottom'])
n_soggy_players = (world['players_est'] / world['geodensity']).round().astype(int)
soggy_allocated = pd.DataFrame(
np.arange(n_soggy_players.max()) < n_soggy_players.values[:, np.newaxis],
index=n_soggy_players.index,
columns=pd.RangeIndex(name='country_player', start=0, stop=n_soggy_players.max()),
).replace(False, np.nan).stack()
world, _ = world.align(soggy_allocated, axis=0)
world['player_lon'] = rand.uniform(low=world['left'], high=world['right'])
world['player_lat'] = rand.uniform(low=world['bottom'], high=world['top'])
not_soggy = shapely.contains_xy(world['geometry'], world[['player_lon', 'player_lat']])
world = world[not_soggy]
# Assume that skill and position are uniformly distributed
# https://en.wikipedia.org/wiki/Association_football_positions
world['player_skill'] = rand.uniform(low=0, high=max_skill, size=len(world))
positions = pd.DataFrame(
{
'min_players': [1, 2, 2, 2],
'max_players': [1, 5, 5, 5],
},
index=pd.Index(name='name', data=['goalkeep', 'defender', 'midfield', 'forward']),
)
pos_mean = (positions['min_players'] + positions['max_players'])/2
world['position'] = rand.choice(
a=positions.index,
p=pos_mean/pos_mean.sum(),
size=len(world),
)
world.set_index(
pd.RangeIndex(name='player', start=0, stop=len(world)),
append=False, inplace=True,
)
return world, positions
def guess_team_centroids(rand: Generator, world: geopandas.GeoDataFrame) -> tuple[
pd.DataFrame, # centroids per team
pd.Series, # player-centroid distances
]:
coord_fields = ['player_lat', 'player_lon']
centroid_idx = rand.choice(a=world.index, size=n_teams)
centroids = (
world.loc[
centroid_idx,
coord_fields,
]
.rename(columns={'player_lat': 'team_lat', 'player_lon': 'team_lon'})
.set_index(pd.RangeIndex(name='team', start=0, stop=n_teams))
)
coords = np.deg2rad(
pd.merge(
how='cross', left=centroids.reset_index(),
right=world[coord_fields].reset_index(),
).set_index(['team', 'player'])
)
# Haversine norm (how is this not in geopandas?)
a = (
np.sin((coords['player_lat'] - coords['team_lat'])/2)**2 +
np.sin((coords['player_lon'] - coords['team_lon'])/2)**2
* np.cos(coords['team_lat']) * np.cos(coords['player_lat'])
)
c = np.arctan2(np.sqrt(a), np.sqrt(1 - a))
r = (world.crs.ellipsoid.semi_major_metre + world.crs.ellipsoid.semi_minor_metre) * 1e-3
distances = r*c
return centroids, distances
def make_vars(world: geopandas.GeoDataFrame) -> pd.Series:
team_idx = pd.RangeIndex(name='team', start=0, stop=n_teams)
assign_idx = pd.MultiIndex.from_product((team_idx, world.index))
names = assign_idx.to_frame().astype(str)
assigns = (
'asn_t' + names['team'] + '_p' + names['player']
).apply(pulp.LpVariable, cat=pulp.LpBinary)
return assigns
def make_objective(
assigns: pd.Series,
distances: pd.Series,
) -> pulp.LpAffineExpression:
# For compactness, minimize the distance from each player to their team's centroid
return pulp.lpDot(assigns, distances)/total_players/100
def add_constraints(
prob: pulp.LpProblem,
world: geopandas.GeoDataFrame,
positions: pd.DataFrame,
assigns: pd.Series,
) -> None:
for team, group in assigns.groupby('team'):
# There must be 11 players per team
prob.addConstraint(
name=f'teamsize_t{team}',
constraint=pulp.lpSum(group) == players_per_team,
)
# Enforce competitive team skill sum
skill_deviation = pulp.lpDot(
group,
world.loc[group.index.get_level_values('player'), 'player_skill'],
)/players_per_team - max_skill/2
prob.addConstraint(
name=f'skill_lo_t{team}',
constraint=skill_deviation >= -10,
)
prob.addConstraint(
name=f'skill_hi_t{team}',
constraint=skill_deviation <= 10,
)
# Each player may only be assigned up to one team
for player, group in assigns.groupby('player'):
prob.addConstraint(
name=f'playerexcl_p{player}',
constraint=pulp.lpSum(group) <= 1,
)
# Enforce the team position bounds
for position, (tmin, tmax) in positions.iterrows():
pos_players = world.index[world['position'] == position]
pos_assigns = assigns.loc[(slice(None), pos_players)]
for team, group in pos_assigns.groupby('team'):
total = pulp.lpSum(group)
prob.addConstraint(
name=f'poslo_t{team}_{position}',
constraint=total >= tmin,
)
prob.addConstraint(
name=f'poshi_t{team}_{position}',
constraint=total <= tmax,
)
def solve(prob: pulp.LpProblem, world: geopandas.GeoDataFrame, assigns: pd.Series) -> geopandas.GeoDataFrame:
prob.solve()
if prob.status != pulp.LpStatusOptimal:
raise ValueError('Solution status', prob.status)
assigns = (
assigns.apply(pulp.LpVariable.value)
.astype(int)
.unstack(level='team')
)
team_player_idx, team_idx = assigns.values.nonzero()
world.loc[team_player_idx, 'team'] = team_idx
return world
def dump_solution(
world: geopandas.GeoDataFrame,
) -> None:
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)
world = world.loc[
world['team'].notna(),
['continent', 'country', 'team', 'position', 'player_skill'],
]
print('Players by country:')
print(world.sort_values(['continent', 'country', 'team', 'position']), end='\n\n')
print('Players by team and position:')
print(world.sort_values(['team', 'position']), end='\n\n')
print('Team summary:')
grouped = world.groupby('team')
teams = world.groupby(['team', 'position']).country.count().unstack('position')
teams.insert(loc=0, column='continent', value=grouped['continent'].agg(pd.Series.mode).astype(str))
teams.insert(loc=1, column='country', value=grouped['country'].agg(pd.Series.mode).astype(str))
teams.insert(loc=2, column='skill', value=grouped['player_skill'].sum() / players_per_team)
teams.sort_values(['continent', 'country', 'skill'], inplace=True)
print(teams, end='\n\n')
def plot_solution(world: geopandas.GeoDataFrame) -> plt.Figure:
fig, ax = plt.subplots()
world['geometry'].boundary.plot(ax=ax)
for team, group in world.groupby('team'):
ax.scatter(group['player_lon'], group['player_lat'])
return fig
def main() -> None:
rand = default_rng(seed=0)
print('Synthesizing world data...')
world, positions = synthesize_world_data(rand)
centroids, distances = guess_team_centroids(rand, world)
print('Making assignment variables...')
assigns = make_vars(world)
prob = pulp.LpProblem(name='football_clustering', sense=pulp.LpMinimize)
print('Defining objective...')
prob.objective = make_objective(assigns, distances)
print('Adding constraints...')
add_constraints(prob, world, positions, assigns)
# print(prob)
print('Solving...')
world = solve(prob, world, assigns)
dump_solution(world)
plot_solution(world)
plt.show()
if __name__ == '__main__':
main()
Run Code Online (Sandbox Code Playgroud)
...
Team summary:
position continent country skill defender forward goalkeep midfield
team
6.0 Africa Algeria 44.963361 3 2 1 5
2.0 Africa Burkina Faso 45.802411 5 3 1 2
56.0 Africa Dem. Rep. Congo 47.018698 3 2 1 5
10.0 Africa Ethiopia 40.294830 3 4 1 3
11.0 Africa Ethiopia 43.277617 2 5 1 3
38.0 Africa Madagascar 59.876627 3 2 1 5
54.0 Africa Morocco 55.331606 3 2 1 5
63.0 Africa Nigeria 40.088676 4 3 1 3
55.0 Africa South Africa 55.269027 3 4 1 3
62.0 Africa Uganda 47.396431 3 2 1 5
36.0 Africa Uganda 48.577035 2 5 1 3
35.0 Africa ['Burundi' 'Dem. Rep. Congo' 'Tanzania'] 53.517437 3 4 1 3
20.0 Asia Bangladesh 57.757983 4 2 1 4
22.0 Asia Bangladesh 58.492236 5 3 1 2
3.0 Asia China 40.099363 2 4 1 4
9.0 Asia China 40.716526 3 2 1 5
34.0 Asia China 41.488606 4 2 1 4
53.0 Asia China 42.986541 4 4 1 2
7.0 Asia China 43.353136 2 5 1 3
49.0 Asia China 44.045816 3 2 1 5
1.0 Asia China 45.908339 3 3 1 4
52.0 Asia China 49.910424 5 3 1 2
40.0 Asia China 50.852176 3 2 1 5
14.0 Asia China 53.105643 3 2 1 5
15.0 Asia China 53.245060 4 2 1 4
23.0 Asia China 57.018255 4 2 1 4
45.0 Asia China 58.395190 3 2 1 5
57.0 Asia China 59.197112 2 3 1 5
26.0 Asia China 59.516640 3 4 1 3
21.0 Asia China 59.535480 4 3 1 3
44.0 Asia India 41.307787 2 5 1 3
48.0 Asia India 41.448672 2 5 1 3
37.0 Asia India 41.713956 2 3 1 5
5.0 Asia India 44.313757 2 5 1 3
19.0 Asia India 47.688853 2 5 1 3
39.0 Asia India 53.535032 4 4 1 2
43.0 Asia India 54.314630 2 3 1 5
50.0 Asia India 56.103479 2 5 1 3
0.0 Asia India 56.155755 2 3 1 5
59.0 Asia India 56.180263 4 4 1 2
17.0 Asia India 58.247667 2 5 1 3
16.0 Asia India 58.393727 2 4 1 4
31.0 Asia India 58.884043 3 3 1 4
41.0 Asia India 59.004900 2 3 1 5
12.0 Asia India 59.838218 2 4 1 4
28.0 Asia Indonesia 44.747853 4 4 1 2
27.0 Asia Indonesia 53.665286 4 2 1 4
60.0 Asia Japan 44.237347 4 2 1 4
4.0 Asia Pakistan 41.385745 2 5 1 3
58.0 Asia Philippines 42.990101 2 3 1 5
42.0 Asia Saudi Arabia 42.746346 4 2 1 4
32.0 Asia Turkey 49.816575 3 4 1 3
51.0 Asia ['China' 'Taiwan'] 42.435033 2 4 1 4
47.0 Asia ['Pakistan' 'Tajikistan'] 45.447561 4 2 1 4
25.0 Europe France 53.116500 5 2 1 3
30.0 Europe Germany 40.177276 3 5 1 2
13.0 Europe Italy 46.469427 3 2 1 5
61.0 Europe Poland 46.816901 3 5 1 2
33.0 Europe Romania 58.854495 3 3 1 4
29.0 North America Mexico 49.474287 2 5 1 3
46.0 North America Mexico 55.503530 3 2 1 5
24.0 North America United States of America 40.777742 2 4 1 4
8.0 South America Brazil 55.562492 5 2 1 3
18.0 South America ['Brazil' 'Venezuela'] 53.190197 3 5 1 2
Run Code Online (Sandbox Code Playgroud)
抱歉,到目前为止,我还没有针对您的完整问题提供真正准确的解决方案(正如我们在评论中讨论的那样,该问题涉及许多因素/约束,并且似乎不容易一次性解决),但我想也许我们可以从一个最小的例子出发,看看能否找到一些线索。
如果我正确理解您的目标,您希望对玩家进行分组,使组建的团队具有接近的实力,并具有以下限制:
从优化的角度来看,添加更多约束确实可以在一定程度上加速问题的解决过程,因为可行变量集的大小缩小了,但同时也大大增加了制定优化问题的复杂性。(正如 @Reinderien 在评论中所纠正的那样:“只有当约束处于优化之前的预求解步骤中时,这才是真正正确的。否则,通常情况相反。优化器需要更加努力地找到可行的解决方案,因为它们占据参数空间的较小部分。”)
我的目的并不是在这个单一答案中完全解决您的问题,而是想从您的最小子问题中吸取一点点,并看看添加更多约束的可能性。
如果我们只关注目标加上约束1),则该问题可以转化为经典的整数规划问题:
给定一个数字向量
vec和组数,如何将组中grpNr的条目均匀分配(是一个总是整除的整数),使得组的总和给出最小的差异。vecgrpNrgrpNrlength(vec)
R 中有很多解决优化问题的选项,这里只是一种带有CVXR包的实现
library(CVXR)
# Input numeric vector where entries are to be grouped
set.seed(0)
vec <- runif(20, 0, 10)
# Given number of groups
grpNr <- 5
# Define optimization variables: Binary variables indicate
x <- Variable(length(vec), grpNr, boolean = TRUE)
# Define objective: minimize the difference in group sums
obj <- Minimize(norm(t(vec) %*% x - sum(vec) / grpNr))
# Constraint: each item must be in exactly one group
constrs <- list(
sum_entries(x, 1) == 1,
sum_entries(x, 2) == length(vec) / grpNr
)
# Formulate and solve the problem
prb <- Problem(obj, constrs)
res <- solve(prb)
# Info of resulting groups
grpInfo <- apply(round(res$getValue(x)) > 0, 2, which, simplify = FALSE)
Run Code Online (Sandbox Code Playgroud)
我们就可以得到分组信息
> grpInfo
[[1]]
[1] 1 3 8 11
[[2]]
[1] 7 12 13 19
[[3]]
[1] 6 10 14 18
[[4]]
[1] 4 16 17 20
[[5]]
[1] 2 5 9 15
Run Code Online (Sandbox Code Playgroud)
以及组总和的分布
> vec %*% round(res$getValue(x))
[,1] [,2] [,3] [,4] [,5]
[1,] 22.75283 22.72827 22.35437 22.20429 22.18618
Run Code Online (Sandbox Code Playgroud)
说实话,我的能力受到我的知识的限制,所以我不知道我是否走在正确的道路上,或者我们下一步该走向何方……
应该指出的是,上面只是我第一次尝试最小的子问题,不幸的是,可扩展性可能很糟糕x(因为如果我们有更长vec或更大的 ,大小会爆炸grpNr)。
我将继续思考潜在的改进,也希望看到其他天才答案的贡献,特别是针对整个问题。:)