Clustering seasonal performances of soccer teams based on situational score line

In this research, the basic pattern of seasonal performances of soccer teams is investigated. We propose a clustering method to reveal the seasonal performance. In the proposed method, a new performance indicator called situational score line is used as a feature describing the seasonal performance. It consists of score line, opponent rating, and away rating. Using k-means, the features are clustered into four clusters. Cluster 1, which has a pattern of decreasing performance, is the basic pattern of Italian Serie A and German Bundesliga. Cluster 2 has a stable performance, which is mostly shown in English Premier League, Italian Serie A, and Spanish La Liga. Cluster 3 has the highest competitiveness and is one of the most common patterns in French Ligue 1 and Spanish La Liga. Finally, Cluster 4, which has a rising performance, is the basic pattern of the English Premier League.


Introduction
In soccer games, there are many factors that make the performance of a certain team during a season stable.The variation in performance is due to players' conditions, such as fatigue during some matches [1], endurance [2,3], or injuries [4].Numerous teams' performance dropped because their star players could not play the games.This is a common issue especially for teams which depend too much on particular players.Factors affecting those variations are considered seasonal.Thus, there is a pattern of team performances which might be repeated in other seasons.
Researchers have suggested some performance indicators of the soccer teams.A performance indicator is a selection, or combination, of action variables that aim to define some or all aspects of a performance [5].Some of action variables are total shots and shots on targets [6], passing patterns [7], ball possession [8,9], and ball recovery patterns [10].Apart from that, score line, whether a team is winning, draw, or loses, is seen as the ultimate performance indicators [11].However, it is known that the same score line could mean differently in other matches, depending on the situation.For example, winning a match against a tough team is valued higher than against a team which is in the last place.Therefore, the quality of opponent is regarded as a crucial variable determining the performance indicator [12].Furthermore, home-field advantage is also reckoned as one of situational variables [13,14].It is well-known that, when playing against the same opponent, winning in the away match reflects a better result than in the home match.Hence, a new performance indicator is proposed by incorporating quality of opponent and homefield advantage into the score line.
In this work, patterns are extracted from the seasonal variation of team performance by clustering the proposed indicator.The aims are to discover and analyze the basic pattern of seasonal team performances.

Materials and Methods
Soccer match results are collected from five biggest leagues in Europe, namely La Liga (Spain), Premier League (England), Bundesliga (Germany), Serie A (Italy), and Ligue 1 (France), during 1993-2014 [15].For each league, only the winning team in a season is selected as a representative.Totally, there were 110 seasonal team performances.The winning teams here are selected based on their performance on the fields.Hence, even though Serie A winner in 2004/2005 was awarded to Inter andin 2005/2006 was awarded to none (due to Calciopoli Scandal [16]), Juventus is still picked in this work because they earned the biggest points at the end of the season.
This section includes two parts.In the first part, the calculation method for the seasonal team performance indicator calculated will be explained.Using the indicator, the feature vectors can be defined.In the second part, the seasonal team performances are clustered to reveal the basic pattern.

Situational Score Line
Here, a new seasonal performance indicator is proposed.This indicator consists of three measurements, namely score line, away rating, and opponent rating.

Score line
Score line is the outcome of a match played by a particular team, which is calculated using the same formula as the standard soccer rule.The score line ((  )) at a matchday   is defined as (  ) = { 3, if the team won at matchday   1, if the team drew at matchday   0, if the team lost at matchday   (1)

Away Rating
Home-field advantage gives a huge benefit for a team.Therefore, in order to measure the team's performance, we should take a look at performances in the away matches.The away rating measures the performance of a team in the away matches.It is defined as the total point earned in the away matches divided by the total maximum away points.Thus, the away rating (  ) is calculated as ) where, ℎ(  ) = { 1, if away match at matchday   , 0, if home match at matchday   , and  is the number of matches played by the team in a season.

Opponent Rating
Opponent rating measures the quality of the opponent team in a particular match.It is calculated by dividing the total point earned by the opponent team at the end of the season by the total maximum point that can be earned.The opponent rating at a matchday   is defined as follows The three measurements are then combined into a seasonal performance called situational score line, (  ), given by The comparison of performance indicator based on the score line ((  )) and situational score line ((  )) is shown in Fig. 1.It can be seen that (  ) has more number of peaks and more distinct values of performance than (  ).It means that more changes in the performance can be observed by using (  ).
It is known that each league has different performance scale (UEFA country coefficient).Thus, in order to make it on the same scale, normalization with respect to the scale was performed.The method was discussed in [17].Furthermore, the numbers of matches in a season are not all entirely the same for different leagues, e.g., every Bundesliga team had to play 42 matches during 1995-1996 while every Serie A team only had to play 34 matches for each season within 1993-2003.Thus, the number of matches needs to be normalized.In this case, length normalization with the same method as [18] was performed.It was done by connecting two adjacent points with a straight line.
Let () be the result of scale and length normalization of (  ) .Then, in order to reveal the pattern of team performance as well as to reduce the fluctuation, it is assumed that () is an even periodic function of time.Afterwards, () is expanded into a Fourier series as

Clustering Seasonal Teams' Performance
The basic pattern of the seasonal performance is revealed by clustering the features of performance.Let X be a set of feature vectors { 1 ,  2 ,  3 , … ,   } , where  is the total number of data points.The feature vectors are defined as where   ′ means the transposition of   .Here, Fourier coefficients   were considered as features reflecting the seasonal team performance in a season.Then, one of the most used clustering method, k-means, is employed here.The goal of k-means is to minimize the within-cluster distance (  ) defined as where   ,  , and   are the cluster center, the number of clusters, and the set of data points that belong to cluster , respectively.Whilst, (  ,   ) is the Euclidean distance between   and   .To solve the clustering problem in ( 8), Lloyd's algorithm is used as follows. .
where |  | is the number of elements in   .4. Repeat step 2 and 3 until convergence is obtained.
In Lloyd's algorithm, there are two parameters that have to be determined in advance.They are the initial cluster center and the number of clusters.It is known that both of them affect the results of clustering.To set the initial cluster centers, instead of using random values, a method called k-means++ can be employed [19].The algorithm is as follows.
1. Choose a cluster center  1 at random from the set of features .

Compute 𝐷(𝐱 𝑝 )
2 , the distance between data point   to the nearest cluster center.Furthermore, in order to determine the number of clusters (), a method involving cluster validity is used [20].Cluster validity is described as a ratio between average of withincluster distance   (8) and the minimum distance between cluster centers   .validity =     (9) where,   = min (‖  −   ‖ 2 ), ( = 1,2, … ,  − 1), ( =  + 1, … , ) The k-means clustering was computed repeatedly with different number of clusters.The appropriate number of clusters () is determined using the Elbow method [21], employing cluster validity as the cost function.

Results and Discussion
The clustering process results in four clusters (the complete results are shown in Table 2).In this work, the cluster centers are considered as the basic pattern of seasonal performance.Furthermore, cluster results for each league are presented to show the difference of pattern in each league.

Seasonal Performance Patterns
Fig. 3 shows the seasonal performance of each cluster, represented by its centers.It can be observed that all patterns have up-and down-periods.No teams could maintain top performance throughout the season.Moreover, all basic patterns even dropped at the end of the season.This is common to happen since most of the teams, 82 out of 110, were also competing in European competition until the second half of the season.Dealing with more matches against many tough opponents from different countries made a team struggling harder in the later half.
The basic pattern of cluster 1 is shown in Fig. 3(a).It is shown that the team performance in this cluster was dropping down until the end of the season.The teams only had a slight improvement in the middle of the season.Then, after midseason break, they managed to bring a different performance.As a result, in the beginning of the second half, the performance was raised.However, in the end, it still dropped.Price et al. mentioned that rates of injury are increasing after the mid-season break [22].Obviously, the team's performance will be influenced by the player's injury.Apart from that, even though the teams had a significant decrease in performance, at the end they still managed to win the league.Therefore, this cluster shows the dominance of the winners.
Cluster 2, which is shown in Fig. 3(b), tends to have a stable performance.From the beginning until about 1.5 months before the season ended, there was no significant decrease in performance.It means that teams started falling down when they only had approximately 5 matches to play.Such performance is expected because 36% teams in this cluster were qualified at least until semifinals of European competition (26%, 18%, and 30% in cluster 1, 3, and 4 respectively).Moreover, if we look at the average point margin to the second places in the corresponding league, it shows dominance.The margin in this cluster is 9.36 points (6.6, 4.32, and 5.26 in clusters 1, 3, and 4, respectively).Therefore, teams in this cluster are considered dominant in their league.
In cluster 3, shown in Fig. 3(c), for the first half of the season, the performance dropped significantly.However, after the mid-season break, it raised high until the last quarter of the season, and finally dropped again.Similar to the trend of cluster 1, the turning point was in the middle of the season.Furthermore, among all four clusters, teams in this cluster had the smallest winning margin, that is 4.32 points.It can be concluded that this cluster has the highest level of competitiveness.
The basic pattern of cluster 4 is shown in Fig. 3(d).Generally, this cluster has a slight rising performance.It can be seen from the end of the season which has a higher performance than the beginning.In this case, it is distinct from the other three.This cluster also has a repeatable pattern, repeated three times.We can figure out that the first turning point, happened around September, is the start of European competition group stage, in which teams were taking part.The second one is around February, when the knockout phase is starting.Hence, the teams in this cluster were heavily affected by the European competition.

Cluster Results in Each League
Clustering result displaying the corresponding cluster number in each league is shown in Table 1.It is known that Serie A of Italy was once recognized as the best football leagues in the world.However, the quality started dropping since 2000 due to financial problems [23].Teams could not afford to pay their star players.Thus many star players left Italy and moved to clubs in another country.As a result, it was tough for Italian teams to compete consistently in European competitions.Furthermore, it also affected the performance of the teams.In Table 1 column Serie A, it is shown that after 2000, there were changes in the performance pattern.The winners tent to have performance similar to that in clusters 1 and 2. The winners that belong to cluster 1 became the UEFA Champions League (UCL) finalists three times (Juventus 2002/2003, Inter 2009/2010, and Juventus 2014/2015).This means that for these teams the decreasing performance in the league did not affect the performance in the UCL.On the other hand, only Milan in 2003Milan in /2004, teams who belong in cluster 2, who qualified until quarter finals of UCL.Thus, the teams in cluster 2 were only dominating the league but not European competition.
In the English Premier League, most of the team performances are grouped in cluster 2 and 4.However, after 2003/2004, pattern of cluster 2 appeared more frequently.Even though cluster 2 was the major pattern, the teams' most successful period was belonging to cluster 4, when Manchester United played in UCL finals two times in a row (2007/2008 and 2008/2009).
In the Spanish La Liga, Barcelona and Real Madrid are the two most successful teams.Since 1993, for the last 22 seasons, Barcelona has won the league 10 times, while Real Madrid 7 times. Moreover, with exceptions in 1995/1996and 2001/2002, at least one of them ended the league in the top two.It can be implied that both are taking rule the league.The results in Table 1 align with the facts.Most performance patterns belong to cluster 2, as we expect from a league with such dominating teams.The second most common pattern is cluster 3.In this cluster, other teams also had quality to oppose those two big teams, such as Deportivo La Coruna, Athletic Bilbao, Valencia, and Atletico Madrid.Thus, the league was more competitive when other teams become real contenders to Barcelona and Real Madrid.In German Bundesliga, we can see that before 2007/2008 the performance trends used to belong to cluster 1, which are decreasing performance throughout the season.However, after that, the performance patterns became stable, with four seasons are classified in cluster 2. In those four seasons, the winners were leading with large margins, 15.25 points difference to the second place.For the last three years, Bayern Munich has been winning by an average 18 points margin.With these trends, we can expect that in the following seasons, the winners of Bundesliga still will follow the performance trends of cluster 2.
In the French Ligue 1, performance patterns in cluster 3 were mostly shown in the winner's team.It means that actually the competition here was tight.In the duration of 2001/2002 until 2007/2008, Lyon created history by winning the league seven times consecutively.In that period, the competition becomes quite strict in 2001/2002, 2002/2003, and 2007/2008.While in the other four seasons, Lyon dominated the league.After that, a team dominated the Ligue 1 again starting 2012/2013 when PSG won the league 3 times in a row.

Conclusion
Four categories of soccer performance were derived from clustering the seasonal team's performance in Europe's five biggest leagues.Cluster 1, which has a pattern of decreasing performance, is the basic pattern of the Italian Serie A and German Bundesliga.Cluster 2 has a stable performance.It is mostly shown in the English Premier League, Italian Serie A, and Spanish La Liga.Cluster 3 has a highest competitiveness and is one of the most common patterns in French Ligue 1 and Spanish La Liga.Finally, Cluster 4, which has a rising performance, is the basic pattern of the English Premier League.