足球成绩预测

使用Scikit-learn库的Python机器学习模型来预测俄罗斯超级联赛(RPL)足球比赛的结果。

参赛作品


文章《 机器学习:预测2018 EPL数学》启发了我写这篇文章。 我们的机器学习模型将训练从2015/2016赛季开始的俄罗斯超级联赛(RPL)比赛统计数据,以预测即将到来的比赛的结果。 数据取自wyscout.com足球统计网站。
代码和数据可在github上获得

资料


我们连接必要的库:

import pandas as pd import numpy as np import collections 

比赛数据在github上

 data = pd.read_csv("RPL.csv", encoding = 'cp1251', delimiter=';') data.head() 

图片
xG和PPDA是什么意思?
xG(预期目标)预期目标的模型。 它基于射门得分的指标,在​​此基础上,如果我们考虑到团队所提供的所有射门,我们可以估算出球队真正必须得分的目标。 了解有关xG的更多信息。
PPDA(每个防守动作允许通过的传球次数)是一种足球统计指标,可让您确定比赛中的压力强度。 PPDA值越低,防御中的比赛强度就越高。 了解有关PPDA的更多信息
PPDA =攻击团队通过的次数/防御中的行动次数


我们将预测2018/2019赛季第二部分的比赛结果(即2019年比赛)。 本赛季参加比赛的球队名单(不包括阿森纳,奥伦堡,迪纳摩,克雷利亚·索维托夫和叶尼塞,因为他们没有上赛季的统计数据,或者统计数据很少):

 RPL_2018_2019 = pd.read_csv('Team Name 2018 2019.csv', encoding = 'cp1251') teamList = RPL_2018_2019['Team Name'].tolist() teamList 

图片

我们删除了与不参加2018/2019赛季的球队的比赛:

 deleteTeam = [x for x in pd.unique(data['']) if x not in teamList] for name in deleteTeam: data = data[data[''] != name] data = data[data[''] != name] data = data.reset_index(drop=True) 

该函数返回一个赛季的球队统计数据:

 def GetSeasonTeamStat(team, season): goalScored = 0 #  goalAllowed = 0 #  gameWin = 0 # gameDraw = 0 # gameLost = 0 # totalScore = 0 #   matches = 0 #   xG = 0 #  shot = 0 # shotOnTarget = 0 #   cross = 0 # accurateCross = 0 #  totalHandle = 0 #  averageHandle = 0 #     Pass = 0 # accuratePass = 0 #  PPDA = 0 #    for i in range(len(data)): if (((data[''][i] == season) and (data[''][i] == team) and (data[''][i] == 2)) or ((data[''][i] == season-1) and (data[''][i] == team) and (data[''][i] == 1))): matches += 1 goalScored += data[''][i] goalAllowed += data[''][i] if (data[''][i] > data[''][i]): totalScore += 3 gameWin += 1 elif (data[''][i] < data[''][i]): gameLost +=1 else: totalScore += 1 gameDraw += 1 xG += data['xG'][i] shot += data[''][i] shotOnTarget += data['  '][i] Pass += data[''][i] accuratePass += data[' '][i] totalHandle += data[''][i] cross += data[''][i] accurateCross += data[' '][i] PPDA += data['PPDA'][i] averageHandle = round(totalHandle/matches, 3) #      return [gameWin, gameDraw, gameLost, goalScored, goalAllowed, totalScore, round(xG, 3), round(PPDA, 3), shot, shotOnTarget, Pass, accuratePass, cross, accurateCross, round(averageHandle, 3)] 

函数用法示例:

 GetSeasonTeamStat("", 2018) #    2017/2018 

图片

为了方便起见,我们可以添加以下代码:

 returnNames = ["", "", "", "\n ", " ", "\n ", "\nxG ( )", "PPDA ( )", "\n", "  ", "\n", " ", "\n", " ", "\n (   )"] for i, n in zip(returnNames, GetSeasonTeamStat("", 2018)): print(i, n) 

图片

为什么我们的统计数据与实际统计数据不同
2017/2018赛季Spartak的实际统计数据:

图片

统计信息不同,因为 我们考虑了2018/2019赛季未参加RPL的球队的比赛。 也就是说,我们不考虑Spartak-SKA,Spartak-Tosno等的比赛。

该函数将返回本赛季所有球队的统计信息:

 def GetSeasonAllTeamStat(season): annual = collections.defaultdict(list) for team in teamList: team_vector = GetSeasonTeamStat(team, season) annual[team] = team_vector return annual 

模型训练


我们将编写一个将返回训练数据的函数。 她创建了一个包含所有季节团队向量的字典。 对于每个游戏,该函数都会计算特定季节球队矢量之间的差异,并将其写入xTrain。 然后,如果主队获胜,该函数会将yTrain设置为1,否则将yTrain设置为0。

 def GetTrainingData(seasons): totalNumGames = 0 for season in seasons: annual = data[data[''] == season] totalNumGames += len(annual.index) numFeatures = len(GetSeasonTeamStat('', 2016)) #     xTrain = np.zeros(( totalNumGames, numFeatures)) yTrain = np.zeros(( totalNumGames )) indexCounter = 0 for season in seasons: team_vectors = GetSeasonAllTeamStat(season) annual = data[data[''] == season] numGamesInYear = len(annual.index) xTrainAnnual = np.zeros(( numGamesInYear, numFeatures)) yTrainAnnual = np.zeros(( numGamesInYear )) counter = 0 for index, row in annual.iterrows(): team = row[''] t_vector = team_vectors[team] rivals = row[''] r_vector = team_vectors[rivals] diff = [a - b for a, b in zip(t_vector, r_vector)] if len(diff) != 0: xTrainAnnual[counter] = diff if team == row['']: yTrainAnnual[counter] = 1 else: yTrainAnnual[counter] = 0 counter += 1 xTrain[indexCounter:numGamesInYear+indexCounter] = xTrainAnnual yTrain[indexCounter:numGamesInYear+indexCounter] = yTrainAnnual indexCounter += numGamesInYear return xTrain, yTrain 

我们学习了2015/2016年至2018/2019年所有季节的训练数据。

 years = range(2016,2019) xTrain, yTrain = GetTrainingData(years) 

为了预测获胜的可能性,我们将使用Scikit-Learn库中的LinearRegression机器学习算法。

 from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(xTrain, yTrain) 

我们将编写一个返回预测的函数。 它将返回一个介于0和1之间的值,其中0是损耗,而1是增益。

 def createGamePrediction(team1_vector, team2_vector): diff = [[a - b for a, b in zip(team1_vector, team2_vector)]] predictions = model.predict(diff) return predictions 

结果


例如,让我们看一下匹配Zenit-Spartak的算法预测

 team1_name = "" team2_name = "" team1_vector = GetSeasonTeamStat(team1_name, 2019) team2_vector = GetSeasonTeamStat(team2_name, 2019) print (',   ' + team1_name + ':', createGamePrediction(team1_vector, team2_vector)) print (',   ' + team2_name + ':', createGamePrediction(team2_vector, team1_vector)) 

图片

事实证明,在Zenit-Spartak比赛中,Zenit获胜的概率是47%(03/17/2019 Spartak 1-1 Zenit)。

我建议根据以下情况进行预测:
最多40%-团队不会赢(亏损或平局)
从40%到60%-抽奖的可能性很高
从60%开始-球队绝对不会输(赢或输)

与其他所有俱乐部相比得出CSKA的预测

 for team_name in teamList: team1_name = "" team2_name = team_name if(team1_name != team2_name): team1_vector = GetSeasonTeamStat(team1_name, 2019) team2_vector = GetSeasonTeamStat(team2_name, 2019) print(team1_name, createGamePrediction(team1_vector, team2_vector), " - ", team2_name, createGamePrediction(team2_vector, team1_vector,)) 

图片

该算法对几乎所有未以平局结束的比赛给出了正确的预测。 唯一不准确的预测:CSKA莫斯科-泽尼特。 CSKA获胜的可能性更高,为0.001,可以假设这支球队的实力相同并且将进行平局,但最终泽尼特(3-1)获胜。

结论


我们的算法非常原始。 它仅考虑比赛统计数据(然后仅考虑15个基本参数),而足球比赛的结果取决于许多因素。 甚至场上的条件或天气也会影响比赛的结果。

接下来,我想增加信号的数量,创建测试样本,尝试各种算法,配置模型并获得最准确的预测。

如果您留下您的想法和意见,我将不胜感激。

Source: https://habr.com/ru/post/zh-CN456226/


All Articles