在邻近区域组建实力相似的团队

Mar*_*ark 12 python algorithm r mathematical-optimization julia

主意

令人遗憾的是,如此多的伟大国家(例如印度)和球员(例如莫·萨拉赫)可能永远不会参加国际足联(足球/足球)世界杯(同样的论点也适用于由少数统治者主导的其他体育赛事)队,例如国际板球和篮球锦标赛)。

在此输入图像描述

尝试创建一个更加平衡的赛事,同时仍然保留基于位置的元素,每个团队具有(大致)相似的实力,并且所有玩家都来自连续区域(最好是同一块土地,但显然不可能在所有情况下都可行,并且超出了这个问题的范围)。

例如,在较弱的地区,可能他们的球队会跨越多个国家(例如,可能是南亚球队),而对于强国,可能会有多个球队(例如,一支只有来自郊区的球员的球队)巴黎)。

然后我可以绘制这些区域,也许使用 Voronois。

背景

我设法通过(《足球经理》和《转会市场》的抓取)收集了相关信息,但我对如何设计选择球队的算法感到困惑。

问题

有大量坐标,对应玩家的出生地。这些地方都有玩家,并且玩家都有评分(从 1 - 100)和位置。

任务是,给定一定的球队规模 (11) 和一定数量的球队(在下面的示例中为 10),以及每个位置所需的一定数量的球员(1,尽管替补球员会很方便),将将区域划分为连续的区域,其中由给定区域的玩家组成的最佳团队与其他区域的团队大致具有相同的技能。

问题

我已经阅读了一些有关图论的内容,但我不确定从哪里开始创建解决这个问题的程序。

我正在寻找一些关于这方面的指导,并且(如果可能的话),如果你可以用玩具示例创建一些东西,那就太棒了!

另外,如果解决整体问题太困难,如果您能找到一种方法来缩小问题范围,并且可以创建一些解决较小问题的方法,然后可以将其推广到更大的问题,那就太好了。

示例代码(在 R 中,但我在下面进一步包含了 Python 和 Julia 等效代码 - 使用其中之一(或类似)的答案会很棒)

set.seed(0)

library(tidyverse)

df <- tibble(
    # generate random latitudes and longitudes, and the number of players for that location
    lat = runif(100, 0, 90),
    lon = runif(100, 0, 180),
    players = rnorm(100, 0, 5) |> abs() |> ceiling() |> as.integer()
)

num_positions <- 11

position <- letters[1:num_positions]

df <- df |>
   # generate position and skill data, and unnest
  mutate(position = map(players, ~sample(position, .x, replace = TRUE)),
         skill = map(players, ~sample(1:100, .x, replace = TRUE)))|>
         unnest_longer(c(position, skill))

# plot the data
df |> 
  summarise(skill = mean(skill), players = first(players), .by = c(lat, lon)) |>
  ggplot(aes(x = lon, y = lat)) +
    geom_point(aes(size = players, color = skill)) +
    scale_size_continuous(range = c(1, 10)) +
    scale_color_viridis_c(option = "magma") + # similar to the Colombian shirt colours
    theme_minimal() +
    theme(legend.position = "none",
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

n_teams <- 10
Run Code Online (Sandbox Code Playgroud)

样本数据的散点图(圆圈大小为玩家数量,颜色为平均技能) 样本数据的散点图

笔记

  1. 在我想做的事情中,它将涉及 64 支球队,因此至少有 704 名球员,并且可能是样本数据集大小的 3 倍左右。真实的数据集有很多行,但是通过过滤它,我应该能够将其减少到几千行。

  2. 在现实生活中,有些球员可以在多个位置上打得很好,但在我给出的示例代码中,每个球员只有一个位置。为每个球员添加多个位置可能会让这个问题更难解决,所以这超出了这个问题的范围。

  3. 如果你能在一个球体(比如地球)上做到这一点,那就太棒了,但矩形也可以。

  4. Reinderien 很有帮助地指出,使用连续性而没有紧凑性可能会导致一些类似阿米加拉断层的杰瑞曼德怪物,类似于我们在美国选举地图中看到的情况。因此,尽管我仍在尝试找出一个更好的标题,但如果它对您来说更有效,我会说优先考虑紧凑性而不是连续性(尽管我猜测这会引发飞地问题?)

更新:

Python代码:
set.seed(0)

library(tidyverse)

df <- tibble(
    # generate random latitudes and longitudes, and the number of players for that location
    lat = runif(100, 0, 90),
    lon = runif(100, 0, 180),
    players = rnorm(100, 0, 5) |> abs() |> ceiling() |> as.integer()
)

num_positions <- 11

position <- letters[1:num_positions]

df <- df |>
   # generate position and skill data, and unnest
  mutate(position = map(players, ~sample(position, .x, replace = TRUE)),
         skill = map(players, ~sample(1:100, .x, replace = TRUE)))|>
         unnest_longer(c(position, skill))

# plot the data
df |> 
  summarise(skill = mean(skill), players = first(players), .by = c(lat, lon)) |>
  ggplot(aes(x = lon, y = lat)) +
    geom_point(aes(size = players, color = skill)) +
    scale_size_continuous(range = c(1, 10)) +
    scale_color_viridis_c(option = "magma") + # similar to the Colombian shirt colours
    theme_minimal() +
    theme(legend.position = "none",
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

n_teams <- 10
Run Code Online (Sandbox Code Playgroud)
朱莉娅代码:
using Random, DataFrames
Random.seed!(0)

df = DataFrame(lat = rand(100) * 90, 
               lon = rand(100) * 180, 
               players = Int64.(ceil.(abs.(5randn(100)))))

num_positions = 11

position = ['a'+ i for i in 0:num_positions-1]

df[!, :position] = [rand(position, players) for players in df.players]
df[!, :skill] = [rand(1:100, players) for players in df.players]

df = flatten(df, [:position, :skill])
Run Code Online (Sandbox Code Playgroud)

Rei*_*ien 6

首先我们来谈谈连续性。我认为你应该重视紧凑性而不是连续性;毕竟,美国人已经证明,只要你足够相信,任何事物都可以是连续的。连续性是一个表示“困难的问题”,而紧凑性是一个表示“容易的问题”。可扩展性成为一个挑战,因为您没有执行简单的相似性聚类(例如k均值);您正在使用更复杂的标准执行聚类。这不是我的专长,因此在我的演示中,我展示了随机正确的简化:假设玩家技能和位置都是均匀分布的,那么基于也是均匀分布的人口密度随机选择的团队质心将代表现实(特别是如果您运行带有随机扰动的多个外部迭代)。具体来说,如果您有时间燃烧并想要完善解决方案,您可以将此方法包装在差分进化循环中,并以团队质心作为进化参数。

说到复杂的标准 - 该系统只有在构建正确的团队时才有用,而正确的团队需要职位标准。我提出了一个矩阵,您可以调整该矩阵来设置每个团队所需位置数量的下限和上限。

告诉优化器缩小允许的技能级别范围是可能的,但可能非常慢。您可能最好将其作为预解决步骤来执行,从而消除与您的目标技能水平相差一定数量的标准差的潜在玩家。或者,或者将公平性不是作为 LP 目标,而是作为 LP 约束,以每个团队的最小和最大技能总和为界限。我演示的是后者;但请注意,当我们将平均团队技能限制为总体平均技能(在本例中为 50)时,这是最可行的。定义范围太窄(小于 20%)或定义目标远离平均值会使解决方案变得困难。

全部一起,

import geopandas
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pulp
import shapely
from numpy.random import default_rng, Generator

n_teams = 64
players_per_team = 11
total_players = n_teams * players_per_team
max_skill = 100


def synthesize_world_data(rand: Generator) -> tuple[
    geopandas.GeoDataFrame,  # player and country data
    pd.DataFrame,            # field position data
]:
    # Load the default world dataset from geopandas,
    # dropping a couple of countries with degenerate bounds.
    # There are better ways to fix this (not shown here).
    world = (
        geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
        .rename(columns={'name': 'country'})
    )
    world.index.name = 'country'
    world[['left', 'bottom', 'right', 'top']] = shapely.bounds(world['geometry'])
    is_degenerate = np.isclose(world['right'] - world['left'], 360)
    world = world[~is_degenerate]

    # Assume that talent is uniformly distributed throughout the world,
    # so players are spatially uniformly distributed. Draw from a pool
    # of players ~2 times what is needed for final team formation.
    pop_ratio = 2 * total_players / world['pop_est'].sum()
    world['players_est'] = world['pop_est'] * pop_ratio

    # Hack: perform an inefficient but statistically correct spatial random distribution of players
    # within their countries by allocating random coordinates within rectilinear bounds and then doing
    # geometric elimination of those players that have decided to live in the ocean
    world['geodensity'] = shapely.area(world['geometry'])/(world['right'] - world['left'])/(world['top'] - world['bottom'])
    n_soggy_players = (world['players_est'] / world['geodensity']).round().astype(int)
    soggy_allocated = pd.DataFrame(
        np.arange(n_soggy_players.max()) < n_soggy_players.values[:, np.newaxis],
        index=n_soggy_players.index,
        columns=pd.RangeIndex(name='country_player', start=0, stop=n_soggy_players.max()),
    ).replace(False, np.nan).stack()
    world, _ = world.align(soggy_allocated, axis=0)

    world['player_lon'] = rand.uniform(low=world['left'], high=world['right'])
    world['player_lat'] = rand.uniform(low=world['bottom'], high=world['top'])
    not_soggy = shapely.contains_xy(world['geometry'], world[['player_lon', 'player_lat']])
    world = world[not_soggy]

    # Assume that skill and position are uniformly distributed
    # https://en.wikipedia.org/wiki/Association_football_positions
    world['player_skill'] = rand.uniform(low=0, high=max_skill, size=len(world))
    positions = pd.DataFrame(
        {
            'min_players': [1, 2, 2, 2],
            'max_players': [1, 5, 5, 5],
        },
        index=pd.Index(name='name', data=['goalkeep', 'defender', 'midfield', 'forward']),
    )
    pos_mean = (positions['min_players'] + positions['max_players'])/2
    world['position'] = rand.choice(
        a=positions.index,
        p=pos_mean/pos_mean.sum(),
        size=len(world),
    )
    world.set_index(
        pd.RangeIndex(name='player', start=0, stop=len(world)),
        append=False, inplace=True,
    )

    return world, positions


def guess_team_centroids(rand: Generator, world: geopandas.GeoDataFrame) -> tuple[
    pd.DataFrame,  # centroids per team
    pd.Series,     # player-centroid distances
]:
    coord_fields = ['player_lat', 'player_lon']

    centroid_idx = rand.choice(a=world.index, size=n_teams)
    centroids = (
        world.loc[
            centroid_idx,
            coord_fields,
        ]
        .rename(columns={'player_lat': 'team_lat', 'player_lon': 'team_lon'})
        .set_index(pd.RangeIndex(name='team', start=0, stop=n_teams))
    )

    coords = np.deg2rad(
        pd.merge(
            how='cross', left=centroids.reset_index(),
            right=world[coord_fields].reset_index(),
        ).set_index(['team', 'player'])
    )

    # Haversine norm (how is this not in geopandas?)
    a = (
        np.sin((coords['player_lat'] - coords['team_lat'])/2)**2 +
        np.sin((coords['player_lon'] - coords['team_lon'])/2)**2
        * np.cos(coords['team_lat']) * np.cos(coords['player_lat'])
    )
    c = np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    r = (world.crs.ellipsoid.semi_major_metre + world.crs.ellipsoid.semi_minor_metre) * 1e-3

    distances = r*c
    return centroids, distances


def make_vars(world: geopandas.GeoDataFrame) -> pd.Series:
    team_idx = pd.RangeIndex(name='team', start=0, stop=n_teams)

    assign_idx = pd.MultiIndex.from_product((team_idx, world.index))
    names = assign_idx.to_frame().astype(str)
    assigns = (
        'asn_t' + names['team'] + '_p' + names['player']
    ).apply(pulp.LpVariable, cat=pulp.LpBinary)

    return assigns


def make_objective(
    assigns: pd.Series,
    distances: pd.Series,
) -> pulp.LpAffineExpression:
    # For compactness, minimize the distance from each player to their team's centroid
    return pulp.lpDot(assigns, distances)/total_players/100


def add_constraints(
    prob: pulp.LpProblem,
    world: geopandas.GeoDataFrame,
    positions: pd.DataFrame,
    assigns: pd.Series,
) -> None:
    for team, group in assigns.groupby('team'):
        # There must be 11 players per team
        prob.addConstraint(
            name=f'teamsize_t{team}',
            constraint=pulp.lpSum(group) == players_per_team,
        )

        # Enforce competitive team skill sum
        skill_deviation = pulp.lpDot(
            group,
            world.loc[group.index.get_level_values('player'), 'player_skill'],
        )/players_per_team - max_skill/2
        prob.addConstraint(
            name=f'skill_lo_t{team}',
            constraint=skill_deviation >= -10,
        )
        prob.addConstraint(
            name=f'skill_hi_t{team}',
            constraint=skill_deviation <= 10,
        )

    # Each player may only be assigned up to one team
    for player, group in assigns.groupby('player'):
        prob.addConstraint(
            name=f'playerexcl_p{player}',
            constraint=pulp.lpSum(group) <= 1,
        )

    # Enforce the team position bounds
    for position, (tmin, tmax) in positions.iterrows():
        pos_players = world.index[world['position'] == position]
        pos_assigns = assigns.loc[(slice(None), pos_players)]
        for team, group in pos_assigns.groupby('team'):
            total = pulp.lpSum(group)
            prob.addConstraint(
                name=f'poslo_t{team}_{position}',
                constraint=total >= tmin,
            )
            prob.addConstraint(
                name=f'poshi_t{team}_{position}',
                constraint=total <= tmax,
            )


def solve(prob: pulp.LpProblem, world: geopandas.GeoDataFrame, assigns: pd.Series) -> geopandas.GeoDataFrame:
    prob.solve()
    if prob.status != pulp.LpStatusOptimal:
        raise ValueError('Solution status', prob.status)

    assigns = (
        assigns.apply(pulp.LpVariable.value)
        .astype(int)
        .unstack(level='team')
    )

    team_player_idx, team_idx = assigns.values.nonzero()
    world.loc[team_player_idx, 'team'] = team_idx
    return world


def dump_solution(
    world: geopandas.GeoDataFrame,
) -> None:
    pd.set_option('display.max_rows', 1000)
    pd.set_option('display.max_columns', 1000)
    pd.set_option('display.width', 1000)
    world = world.loc[
        world['team'].notna(),
        ['continent', 'country', 'team', 'position', 'player_skill'],
    ]

    print('Players by country:')
    print(world.sort_values(['continent', 'country', 'team', 'position']), end='\n\n')

    print('Players by team and position:')
    print(world.sort_values(['team', 'position']), end='\n\n')

    print('Team summary:')
    grouped = world.groupby('team')
    teams = world.groupby(['team', 'position']).country.count().unstack('position')
    teams.insert(loc=0, column='continent', value=grouped['continent'].agg(pd.Series.mode).astype(str))
    teams.insert(loc=1, column='country', value=grouped['country'].agg(pd.Series.mode).astype(str))
    teams.insert(loc=2, column='skill', value=grouped['player_skill'].sum() / players_per_team)
    teams.sort_values(['continent', 'country', 'skill'], inplace=True)
    print(teams, end='\n\n')


def plot_solution(world: geopandas.GeoDataFrame) -> plt.Figure:
    fig, ax = plt.subplots()
    world['geometry'].boundary.plot(ax=ax)

    for team, group in world.groupby('team'):
        ax.scatter(group['player_lon'], group['player_lat'])

    return fig


def main() -> None:
    rand = default_rng(seed=0)

    print('Synthesizing world data...')
    world, positions = synthesize_world_data(rand)
    centroids, distances = guess_team_centroids(rand, world)

    print('Making assignment variables...')
    assigns = make_vars(world)
    prob = pulp.LpProblem(name='football_clustering', sense=pulp.LpMinimize)

    print('Defining objective...')
    prob.objective = make_objective(assigns, distances)

    print('Adding constraints...')
    add_constraints(prob, world, positions, assigns)

    # print(prob)
    print('Solving...')
    world = solve(prob, world, assigns)

    dump_solution(world)
    plot_solution(world)
    plt.show()


if __name__ == '__main__':
    main()
Run Code Online (Sandbox Code Playgroud)

全局输出示例

...
Team summary:
position      continent                                   country      skill  defender  forward  goalkeep  midfield
team                                                                                                               
6.0              Africa                                   Algeria  44.963361         3        2         1         5
2.0              Africa                              Burkina Faso  45.802411         5        3         1         2
56.0             Africa                           Dem. Rep. Congo  47.018698         3        2         1         5
10.0             Africa                                  Ethiopia  40.294830         3        4         1         3
11.0             Africa                                  Ethiopia  43.277617         2        5         1         3
38.0             Africa                                Madagascar  59.876627         3        2         1         5
54.0             Africa                                   Morocco  55.331606         3        2         1         5
63.0             Africa                                   Nigeria  40.088676         4        3         1         3
55.0             Africa                              South Africa  55.269027         3        4         1         3
62.0             Africa                                    Uganda  47.396431         3        2         1         5
36.0             Africa                                    Uganda  48.577035         2        5         1         3
35.0             Africa  ['Burundi' 'Dem. Rep. Congo' 'Tanzania']  53.517437         3        4         1         3
20.0               Asia                                Bangladesh  57.757983         4        2         1         4
22.0               Asia                                Bangladesh  58.492236         5        3         1         2
3.0                Asia                                     China  40.099363         2        4         1         4
9.0                Asia                                     China  40.716526         3        2         1         5
34.0               Asia                                     China  41.488606         4        2         1         4
53.0               Asia                                     China  42.986541         4        4         1         2
7.0                Asia                                     China  43.353136         2        5         1         3
49.0               Asia                                     China  44.045816         3        2         1         5
1.0                Asia                                     China  45.908339         3        3         1         4
52.0               Asia                                     China  49.910424         5        3         1         2
40.0               Asia                                     China  50.852176         3        2         1         5
14.0               Asia                                     China  53.105643         3        2         1         5
15.0               Asia                                     China  53.245060         4        2         1         4
23.0               Asia                                     China  57.018255         4        2         1         4
45.0               Asia                                     China  58.395190         3        2         1         5
57.0               Asia                                     China  59.197112         2        3         1         5
26.0               Asia                                     China  59.516640         3        4         1         3
21.0               Asia                                     China  59.535480         4        3         1         3
44.0               Asia                                     India  41.307787         2        5         1         3
48.0               Asia                                     India  41.448672         2        5         1         3
37.0               Asia                                     India  41.713956         2        3         1         5
5.0                Asia                                     India  44.313757         2        5         1         3
19.0               Asia                                     India  47.688853         2        5         1         3
39.0               Asia                                     India  53.535032         4        4         1         2
43.0               Asia                                     India  54.314630         2        3         1         5
50.0               Asia                                     India  56.103479         2        5         1         3
0.0                Asia                                     India  56.155755         2        3         1         5
59.0               Asia                                     India  56.180263         4        4         1         2
17.0               Asia                                     India  58.247667         2        5         1         3
16.0               Asia                                     India  58.393727         2        4         1         4
31.0               Asia                                     India  58.884043         3        3         1         4
41.0               Asia                                     India  59.004900         2        3         1         5
12.0               Asia                                     India  59.838218         2        4         1         4
28.0               Asia                                 Indonesia  44.747853         4        4         1         2
27.0               Asia                                 Indonesia  53.665286         4        2         1         4
60.0               Asia                                     Japan  44.237347         4        2         1         4
4.0                Asia                                  Pakistan  41.385745         2        5         1         3
58.0               Asia                               Philippines  42.990101         2        3         1         5
42.0               Asia                              Saudi Arabia  42.746346         4        2         1         4
32.0               Asia                                    Turkey  49.816575         3        4         1         3
51.0               Asia                        ['China' 'Taiwan']  42.435033         2        4         1         4
47.0               Asia                 ['Pakistan' 'Tajikistan']  45.447561         4        2         1         4
25.0             Europe                                    France  53.116500         5        2         1         3
30.0             Europe                                   Germany  40.177276         3        5         1         2
13.0             Europe                                     Italy  46.469427         3        2         1         5
61.0             Europe                                    Poland  46.816901         3        5         1         2
33.0             Europe                                   Romania  58.854495         3        3         1         4
29.0      North America                                    Mexico  49.474287         2        5         1         3
46.0      North America                                    Mexico  55.503530         3        2         1         5
24.0      North America                  United States of America  40.777742         2        4         1         4
8.0       South America                                    Brazil  55.562492         5        2         1         3
18.0      South America                    ['Brazil' 'Venezuela']  53.190197         3        5         1         2
Run Code Online (Sandbox Code Playgroud)


Tho*_*ing 3

免责声明

抱歉,到目前为止,我还没有针对您的完整问题提供真正准确的解决方案(正如我们在评论中讨论的那样,该问题涉及许多因素/约束,并且似乎不容易一次性解决),但我想也许我们可以从一个最小的例子出发,看看能否找到一些线索。

最小子问题(仅具有其中一个约束)

如果我正确理解您的目标,您希望对玩家进行分组,使组建的团队具有接近的实力,并具有以下限制:

  1. 每队人数应相等,即 11 人。
  2. 每个团队都有其所需的所有角色(或技能);
  3. 团队的区域应该是连续的。

从优化的角度来看,添加更多约束确实可以在一定程度上加速问题的解决过程,因为可行变量集的大小缩小了,但同时也大大增加了制定优化问题的复杂性。(正如 @Reinderien 在评论中所纠正的那样“只有当约束处于优化之前的预求解步骤中时,这才是真正正确的。否则,通常情况相反。优化器需要更加努力地找到可行的解决方案,因为它们占据参数空间的较小部分。”

我的目的并不是在这个单一答案中完全解决您的问题,而是想从您的最小子问题中吸取一点点,并看看添加更多约束的可能性。

如果我们只关注目标加上约束1),则该问题可以转化为经典的整数规划问题:

给定一个数字向量vec和组数,如何将组中grpNr的条目均匀分配(是一个总是整除的整数),使得组的总和给出最小的差异。vecgrpNrgrpNrlength(vec)

问题的提出和解决

R 中有很多解决优化问题的选项,这里只是一种带有CVXR包的实现

library(CVXR)

# Input numeric vector where entries are to be grouped
set.seed(0)
vec <- runif(20, 0, 10)

# Given number of groups
grpNr <- 5

# Define optimization variables: Binary variables indicate
x <- Variable(length(vec), grpNr, boolean = TRUE)

# Define objective: minimize the difference in group sums
obj <- Minimize(norm(t(vec) %*% x - sum(vec) / grpNr))

# Constraint: each item must be in exactly one group
constrs <- list(
    sum_entries(x, 1) == 1,
    sum_entries(x, 2) == length(vec) / grpNr
)

# Formulate and solve the problem
prb <- Problem(obj, constrs)
res <- solve(prb)

# Info of resulting groups
grpInfo <- apply(round(res$getValue(x)) > 0, 2, which, simplify = FALSE)
Run Code Online (Sandbox Code Playgroud)

我们就可以得到分组信息

> grpInfo
[[1]]
[1]  1  3  8 11

[[2]]
[1]  7 12 13 19

[[3]]
[1]  6 10 14 18

[[4]]
[1]  4 16 17 20

[[5]]
[1]  2  5  9 15
Run Code Online (Sandbox Code Playgroud)

以及组总和的分布

> vec %*% round(res$getValue(x))
         [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 22.75283 22.72827 22.35437 22.20429 22.18618
Run Code Online (Sandbox Code Playgroud)

前进之路

说实话,我的能力受到我的知识的限制,所以我不知道我是否走在正确的道路上,或者我们下一步该走向何方……

应该指出的是,上面只是我第一次尝试最小的子问题,不幸的是,可扩展性可能很糟糕x(因为如果我们有更长vec或更大的 ,大小会爆炸grpNr)。

我将继续思考潜在的改进,也希望看到其他天才答案的贡献,特别是针对整个问题。:)