Pollsposition
Bayesian statistics and French politics

Table of Contents

In this document we focus on opinion polls in the context of electoral campaigns and the information we can extract from them.

Every pre-election poll tries to answer the question:

If the election were to happen now, who would people be voting for?

While the media and people want an answer to the question:

Who will people be voting for on election day?

These are two very different questions as we will see in the following document. In fact, pollsters cannot even answer the first question with great accuracy! The following book is an exploration of how electoral polling works, the questions it tries to answers, and how statistical modeling can help us improve the answers, and even answer questions pollsters are not equipped to answer! We will unroll the whole machinery and go from a single poll at the beginning of the political campaign to the results on election day.

This book was born while working on a forecasting model for the French presidential elections. We noticed that even a technical audience had difficulties grasping the limitations of polls and the necessity of statistical modeling. With this book we wish to empower this technical audience with the cocepts and the tools necessary to make better use of the polls than what is currently done in the mainstream media. With the hope that they will themselves develop tools and write books that reach a wider audience and improve the public debate around polls.

You will commonly hear the criticism that all polls are bad and that you should thus never trust polls. While it is true that polls are doomed to be wrong all the time, we will show that as long as they make uncorrelated mistakes we can still learn something thanks to the multitude of pollsters. In return, we will argue that our forecast is probably wrong, but one could get closer to the true uncertainty by pooling the results from different models.

In more technical terms, we note \(\boldsymbol{v}_t\) the vector that contains the number of people that would vote for either of the \(c\) candidates at time \(t\). The true value of \(\boldsymbol{v}_t\) is unknow and could only be obtained if we held elections. People (including the media) usually comment on polls results as if \(\boldsymbol{v}_t\) were the quantity reported by polling institutes. Worse, they sometimes comment as if the polls reported \(\boldsymbol{v}_{\tau}\) where \(\tau\) is election day. This is not the case.

In this book we will show how the voting intentions \(\boldsymbol{v}_t\) relate to the answers \(\boldsymbol{n}_t^h\) reported by polling institutes \(h \in \left[1, \mathcal{H}\right]\). In other words, we will try to infer the value of \(\boldsymbol{v}_t\) from the results \(\boldsymbol{n}_t^h\). First we will see what information we can extract from answers to a single poll, and then move on to polling information from several poll to better account for the uncertainty in \(\boldsymbol{v}_t\). We will then show how, using additional information, we can try to predict the value \(\boldsymbol{v}_\tau\) where \(\tau\) is the date of the election.

What can a single poll say?

In the following we are concerned only with the result of a single poll. We will thus drop the time index \(t\) and the pollster index \(h\). We consider the published results from a single poll which typically consists in:

  • The number of polled people whoe are registered to vote
  • An estimate of how many of these intend to vote
  • The number of people who expressed an intention
  • Voting intentions (adjusted and sometimes not adjusted)
  • Vote certainty

And will show in the following what information we can extract from each of them.

Voting intentions

Pollsters ask a series of questions to people who are registered to vote. Among which, they ask people who they would vote for should the elections happen the following weekend. As a result they generally report :

\begin{equation} \mathbf{n} = \left(n^1,\dots,n^c, n^{nr}\right),\: \sum_i n^i = N \end{equation}

a vector that contains the number of people who designated each candidate. \(y^{nr}\) is the number of people who chose to not respond.

The number of respondants \(y^i\) who answer candidate \(i\) are usually not given in the pollster's released reports. We often get instead \(\tilde{r}^{\,i}\), the rounded portion of respondants who would vote for \(i\). The rounding process is such that the real portion \(r^i\) :

\begin{equation} r_i \in [\tilde{r}_i -\delta, \tilde{r}_i + \delta] \end{equation}

where

\begin{equation} r_i = \frac{n_i}{N-n_{nr}} \end{equation}

We can directly model the vector of ratios \(\mathbf{r}\) with a reparametrized Dirichlet distribution:

\begin{equation} \boldsymbol{r} \sim \operatorname{Dirichlet}(\boldsymbol{p}, N) \end{equation}

where \(\sum_i p_i = 1\) and each \(p_i\) represents the "true" probability to vote for candidate \(i\). A natural prior for the \(p_i\) is the Dirichlet distribution:

\begin{equation} \boldsymbol{p} \sim \operatorname{Dirichlet}(\boldsymbol{\alpha}) \end{equation}

Let us consider a somewhat controversial poll from Elabe where Pécresse was given 20% of the vote intentions while she was around 10% a few weeks back. I reproduce the data below :

Table 1: This ELABE poll released on December, 7th 2021 sparked controversy as Valérie Pécresse seemed to be closing in on Emmanuel Macron. Base: 965 people who are sure to vote and expressed an opinion. The original results can be found on ELABE's website.
Candidate Voting intentions
Arthaud 1
Dupont-Aignan 2
Hidalgo 3
Jadot 7
Lassalle 2
Le Pen 15
Macron 23
Mélenchon 8
Montebourg 2
Pécresse 20
Poutou 2
Roussel 1
Zemmour 14

We fit the simplified model using a vague prior on the true intentions:

import numpy as np
import pymc3 as pm

prior_intentions = np.array(
    [0.02, 0.02, 0.02, 0.10, 0.05, 0.02, 0.10, 0.20, 0.20, 0.02, 0.02, 0.20, 0.15]
)

with pm.Model() as intentions:
    p = pm.Dirichlet("intentions", prior_intentions)
    r = pm.Dirichlet("real_ratios", sample_size * p, observed=list(results.values()))

We sample from the posterior distribution without any incident as attested by the trace:

We can now look at the raw voting intentions inferred from the poll's results:

  • TODO Fit Dirichlet-Dirichlet model on the ELABE data
  • TODO Plot the itentions with 95% uncertainty
  • TODO Intepretation of $α# in terms of sample size
  • TODO Try with different priors to show the influence of the prior value
  • TODO Set the prior to the previous result

Participation

Not everyone who answers polls plans on going to vote, and we would like to estimate the participation level at the election. Among \(N\) people interrogated and registered to vote, we write \(N_v\) the number of people certain to vote \(N_n\) the number of people certain to not vote and \(N_i\) the number of undecided people so that

\begin{equation} N = N_v + N_n + N_i \end{equation}

Les \(n_i\) personnes se divisent en d'autres sous-catégories \(c=1, \dots,C\). En notant \(\boldsymbol{p} = \left(p^a, p^{i}_1, \dots, p^{i}_C, p^v\right)\) la probabilité d'appartenir à chaque catégorie nous pouvons écrire en toute généralité:

\begin{equation} \boldsymbol{n} \sim \operatorname{Multinomial}(\boldsymbol{p}, N) \end{equation}

Chaque sous-catégorie d'indécis regroupe des personnes dont les probabilités d'aller son similaires, et croissantes. Certains instituts demandent aux interrogés d'estimer leur probabilité d'aller voter (Elabe, Ipsos, Opinionway), d'autres leur demande de la qualifier (Ifop, Harris).

Sorry, your browser does not support SVG.

Figure 1: How to go from the pollsters' "Probability to abstain" to a real probability distirbution.

On note \(\theta_i\) la probabilité d'aller voter d'un indécis. On suppose que la distribution de \(\theta_i\) suit un loi Beta

\begin{equation} \theta_{i} \sim \operatorname{Beta}(\alpha, \beta) \end{equation}

De sorte que s'il y a \(C\) sous-catégories d'indécis, la probabilité \(p_c^i\) d'être indécis et d'appartenir à la sous-catégorie \(c\) est donnée par:

\begin{equation} p_{c}^{i} = \int_{(c-1)/C}^{c/C} P(\theta_{i}=x|\alpha, \beta) \mathrm{d}x \end{equation}

Nous pouvons ensuite utiliser le modèle pour estimer les paramètres \(p_a\), \(p_v\), \(\alpha\), \(\beta\) qui vont nous permettre ensuite d'estimer le taux de participation moyen et autres quantités d'intérêt.

Choice uncertainty

The previous curves thus makes the underlying assumption that every unsure people will stick to their answer. But some people will change their mind. They are even more likely to do so the further we are from the election.

Luckily some polls (including Elabe) ask people whether they are certain of their choice :

Let us first look at the distributions when we only consider people who are absolutely sure to vote for that person:

This substantially changes the results, and Pécresse is not guaranteed to go to the runoff. Here is the probability that X has a better score than Y taking the uncertains into account and without.

X Y Uncertains don't vote Uncertains don't change their mind
Pécresse Le Pen 99.7% 44.5%
Le Pen Zemmour 71.5% 97.15%
Mélenchon Jadot 71.5% 99.8%

This is of course a very unrealistic scenario, so let us try to model the presence of uncertain people.

We note \(\tilde{\zeta}_i\) the reported proportion of the \(n_i\) people who say they are sure to vote for \(i\), which is the rounded version of \(\zeta_i\) the true proportion of people who say they are going to vote for \(i\) and are certain to do so.

The number \(v_i\) of people who would actually vote for \(i\) this weekend is given by

\begin{equation} v_{i} = n_{i} \zeta_{i} + \Omega_{i} \end{equation}

where

\begin{equation} \Omega_{i} = \sum_{j} \bar{\zeta}_{j,i} \end{equation}

Where

\begin{equation} \bar{\zeta}_{j,i} = n_j (1-\zeta_j)\: \epsilon_{j,i} \end{equation}

is the number of people who originally said they intended to vote for \(j\) but will actually vote for \(i\).

Although it is needed for posterior predictive sampling, the distributions of \(\epsilon_{j,i}\) is unknown. In the absence of more information we have no choice but to explore several assumptions.

Won't change their mind

Everyone who told the pollster they have the intention to vote for \(i\) will actually vote for \(i\).

\begin{equation} \epsilon_{i,j} = \delta_{i,j} \end{equation}

where \(\delta\) is the Kronecker symbol, \(\delta_{i,i} = 1\) and \(\delta_{i,j} = 0\) if \(i \neq j\). This hypothesis gives us a lower bound on the total uncertainty. This corresponds to the first figure with raw vote intentions we showed earlier.

Completely undecided

A perhaps extreme example. We pretend to not know anything at all about the undecided and assume they will chose uniformly at random among the remanining candidates:

\begin{equation} \epsilon_{i} = \operatorname{Dirichlet}(\boldsymbol{\beta}) \end{equation}

where \(\boldsymbol{\beta} \propto \mathrm{1}\). This gives us an upper bound (given the information we have) on the total uncertainty.

In the following figure we go one step further. We divide people who are certain to vote but did not give any name uniformly among the candidates. We observed an increased uncertainty and results that are less clear cut:

Constant fraction of undecided

An intermediate assumption is that there is a constant fraction of undecided who will vote for the candidate, and the rest will vote for someone else uniformly at random.

Let us show now the evolution of the voting intentions for different pairs of candidates depending on the fraction of uncertaint people who will vote for whoever they said they would:

\(\text{Undecided}^2\)

We assume that \(\boldsymbol{\beta} \propto \boldsymbol{\zeta}\). The more people tend to be certain to vote for candidate \(i\), the more likely uncertain people are likely to vote for them in the end.

Bandwagon effect

We now assume that undecided people are more likely to follow the candidate with the highest score (bandwagon effect, helped by polls)

Time evolution

Pollsters will poll their population samples several times over the course of the political campaign. We can estimate the latent intentions independently for every time step, and this is probably correct when the polling frequency is low. As the elections approaches, most pollsters will provide results almost on a daily basis. It is however very unlikely that voting intentions will exhibit large swings as many people are almost certain of who they will vote for (to model).

We can use this remark to improve our intention estimates : we can assume that intention evolves smoothly over time, so very sharp evolution can be excluded. We will thus note that the current candidate support is a smooth function of the candidate support at previous time steps:

\begin{align*} \boldsymbol{y}_{t} &\sim \operatorname{Multinomial}\left(\boldsymbol{\pi}_{t}, S_{t}\right)\\ \boldsymbol{\pi}_t &\sim \operatorname{Dirichlet}\left(\boldsymbol{p}_t \right)\\ \boldsymbol{p}_t&= \operatorname{Softmax}(\boldsymbol{\mu}_t)\\ \boldsymbol{\mu}_t &= f\left(\boldsymbol{\mu}_0, \dots, \boldsymbol{\mu}_{t-1} \right) \end{align*}

We will explore several options for the time-dependence

Random walk

Markov process. Value at one time step only depends on the value at the previous time step.

\begin{align*} \boldsymbol{\mu}_{t} &= \boldsymbol{\mu}_{t-1} + \boldsymbol{\epsilon} \\ \boldsymbol{\epsilon} &\sim \operatorname{MvNormal}(\boldsymbol{0}, \Sigma) \end{align*}

where \(\Sigma\) is the covariance matrix. We commonly assume that \(\Sigma\) has a diagonal structure, i.e. that there is no correlation in the time evolution of the score of different political families.

  • TODO What is the interpretation for non-diagonal elements of \(\Sigma\)?

Gaussian Processes

We model the relation between the consecutive values of the vector \(\boldsymbol{\lambda}_t\) with gaussian processes with the standard square exponential kernel:

\begin{align*} \boldsymbol{\lambda}_t &\sim \mathcal{GP}(\boldsymbol{\theta}, K)\\ K\left(t, t'\right) &= \Sigma^2\; \exp\left(-\frac{1}{2} \sum_{i=1}^c \left(\frac{t'-t}{\tau_i}\right)^2 \right) \end{align*}

Where \(\boldsymbol{\theta}\) is the mean support for each party, \(\Sigma\) is the covariance matrix between the different candidates, and \(\tau_i\) the typical timescale over which the support for candidate \(i\) changes. We currently assume that the covariance matrix is diagonal, i.e. that the parties' support evolve independently:

\begin{equation*} \Sigma = \operatorname{diag(\sigma_1, \dots, \sigma_c)} \end{equation*}
  • TODO What are we using for \(\boldsymbol{\theta}\) in this model?
  • TODO How does it work when we are using several timescales?
  • TODO What do timescales actually mean? Can we relate it to the amount of variation for the candidate support?
  • TODO How would you interpret the non diagonal elements of \(\Sigma\)? Can that be seen as support transfers?
  • TODO Try the other mean-reverting kernel, the Ornstein-Uhlenbeck kernel
  • TODO I don't see why the process should be stationary, try the Wiener kernel?

Polls and biases

Election polls also suffer from non-sampling errors. This error can manifest in a biased estimator, as well as a higher variance than would be expect from pure sampling errors.

In an ideal world pollsters could sample among people registered to vote uniformly at random, but this is of course not the case. In case of a phone interview, this would imply that everyone has a phone, is able to pick up the phone at all times, and is willing/has time to answer the survey. This is of course unrealistic; every poll suffers from sampling bias.

Sampling bias

Two different pollsters \(\rho\) and \(\rho'\) will build their samples differently, there is thus a pollster-specific sampling bias. Some pollsters use different methods to build their sample depending on the poll. We can assume there also is a method-specific bias, conditioned on the sample. We will model the pollster-method bias with a random variable \(\tilde{\alpha}_{\rho,m}\), and will specify its distribution later.

\(\tilde{\alpha}_{\rho,m}\) is a crude proxy for a complex situation, but the only one we can craft in the absence of more information about the composition of the samples.

Non-response bias

See where the authors show that the opinion swings visible in election polls are probably due to non-reponse bias. People who perceive that their candidate is not doing well are less likely to report voting for them and answer surveys.

In our models we do not account for non-response bias, and I currently don't see how we could account for it without access to a fixed panel of individuals.

Sample adjustment

Pollsters however have more information about the composition of the sample as they usually ask information about the person's gender, age, job and their vote during previous elections. They then use poststratification methods to adjust the results they obtained with this sample to what would be measure with a perfect sample.

In the absence of this information, we assume that adjustment and sampling procedures remain the same for each pollster, and include both biases in the same random variable \(\alpha_{\rho,m}\).

Post-stratification


compares different weighing methods with model-based (MRP) methods.

Multilevel Regresion and Post-stratification (MRP)


Who will really vote?

Talk about the false controversy about methods that try to only keep people who are likely to vote. There's no point taking into account the opinion who are absolutely certain, on the other hand some could change their mind so we cannot exclude everyone.

We describe the poll aggregation model that will be used in Pollsposition.

Poll aggregation

We model the results \(\boldsymbol{y}^{h}_t\) of the poll released by institute \(h \in \left\{1 \dots N_h \right\}\) for candidates \(c \in \left\{1 \dots N_c \right\}\) at time \(t \in \left\{1 \dots T_e\right\}\) with a \(N_c\) -dimensional random variable that follows a multinomial distribution:

\begin{equation} \boldsymbol{y}_{t}^{h} \sim \operatorname{Multinomial}\left(\boldsymbol{\pi}_{t}^{h}, S_{t}^{h}\right) \end{equation}

Where \(S_{t}^h\) is the sample size, and \(\boldsymbol{\pi}_{t}^h\) is usually interpreted and reported as the voting intentions. The voting intentions must sum to one, and we place a Dirichlet prior on the voting intentions:

\begin{equation*} \boldsymbol{\pi}_{t}^{h} \sim \operatorname{Dirichlet}(\alpha\;\boldsymbol{p}_{t}^{h}) \end{equation*}

where \(\alpha\) is a concentration parameter that accounts for overdispersion in the data:

\begin{equation*} \alpha \sim \operatorname{InverseGamma}(1000, 200) \end{equation*}

the first parameter of the \(\operatorname{InverseGamma}\) distribution can be interpreted as the polls' sample size and the second as thei standard deviation. \(\boldsymbol{p}_{t}^{h}\) is a vector given by

\begin{equation*} \boldsymbol{p}_{t}^{h} = \operatorname{Softmax}(\boldsymbol{\mu}_{t}^{h}) \end{equation*}

where \(\boldsymbol{\mu}_{t}^{h}\) is a vector that contains the latent popularities of the different candidates. The poll results we observe are thus a function of the sample size, and this latent popularity.

This vector can be further decomposed in several components and we will write

\begin{equation*} \boldsymbol{\mu}_{t}^{h} = \tilde{\boldsymbol{\lambda}} + \boldsymbol{\lambda}_{t} + \tilde{\boldsymbol{\eta}} + \boldsymbol{\eta}^h \end{equation*}

where \(\boldsymbol{\lambda}_t\) is the time-varying latent support of each candidate, \(\boldsymbol{\eta}^h\) the bias of the political institute \(h\) towards the candidates (house effect), and \(\tilde{\boldsymbol{\eta}}\) the polling market's bias towards the different candidates.

Time-varying latent support

A core quantity is the support for each candidate in the election \(\boldsymbol{\lambda}_t\), and its evolution as new information (polls, special events, update in the fundamental data, etc) arrives.

We model the relation between the consecutive values of the vector \(\boldsymbol{\lambda}_t\) with gaussian processes with the standard square exponential kernel:

\begin{align*} \boldsymbol{\lambda}_t &\sim \mathcal{GP}(\boldsymbol{\theta}, K)\\ K\left(t, t'\right) &= \Sigma^2\; \exp\left(-\frac{1}{2} \sum_{i=1}^c \left(\frac{t'-t}{\tau_i}\right)^2 \right) \end{align*}

Where \(\boldsymbol{\theta}\) is the mean support for each party, \(\Sigma\) is the covariance matrix between the different candidates, and \(\tau_i\) the typical timescale over which the support for candidate \(i\) changes. We currently assume that the covariance matrix is diagonal, i.e. that the parties' support evolve independently:

\begin{equation*} \Sigma = \operatorname{diag(\sigma_1, \dots, \sigma_c)} \end{equation*}
  • TODO What are we using for \(\boldsymbol{\theta}\) in this model?
  • TODO How does it work when we are using several timescales?
  • TODO What do timescales actually mean? Can we relate it to the amount of variation for the candidate support?
  • TODO How would you interpret the non diagonal elements of \(\Sigma\)? Can that be seen as support transfers?
  • TODO Try the other mean-reverting kernel, the Ornstein-Uhlenbeck kernel
  • TODO I don't see why the process should be stationary, try the Wiener kernel?

House effects

House effect

The systematic differences betweent pollsters [538 pollster rating] need to enter the model. Pollsters differ in the ways they constitute their panel and the way they adjust results. We can separate house effects in two elements.

Each pollster also has its idiosyncratic bias regarding the different families:

\begin{equation*} \eta^{hc} \sim \operatorname{ZeroSumNormal}(0, 0.15) \end{equation*}
  • TODO Should this enter the variance of the estimate instead?
  • TODO Why are we using a ZeroSumNormal distribution?

Poll bias

In an ideal world the institutes' biases cancel each other off: an institute's overestimation of the far-right's score would be balanced by another institutes' underestimation. This is what would happen in the limit of a large number of polling institude should the biases be randomly distributed. But biases are not always randomly distributed, as shown in [Shirai-Mehr et al 2018]: there is often a measurable polling market bias.

We can measure it on past elections since we have the election results to compare the polls to, and use the model introduced in [S-M 2018]:

where \(v_i\) is the final vote share and \(\alpha_i\) thus represents the market bias. The authors use a hierarchical structure on the parameters \(\alpha_i\), \(\beta_j\) and \(\tau_j\) which respectively share the same prior distribution centered on 0.

Failure to account for correlated poll errors will lead to improper estimates (and later on, predictions). We thus need to account for this in our models. One way (which is visible what 538 does [ref needed]) is to "correct" the different candidates' score for the correlated error observed on past elections.

In the section

  • TODO Measure this bias on past election data

Predict the result of the election

Poll aggregator

\begin{equation*} \mu^{hec}_t = \lambda^{c}_t + \tilde{\lambda}^{c} + \beta_U\;U_t + \eta^{hec} + \tilde{\eta}^f \end{equation*}

Fundamentals model

The second model tries to forecast the result of the election on election day. The election is a big poll, but that is completely unbiased by definition. If we note \(t_0\) the date of the election day we can write:

\begin{align*} \boldsymbol{R}_{t_0} &\sim \operatorname{Multinomial}\left(\boldsymbol{\tilde{\pi}}, S\right) \\ \boldsymbol{\tilde{\pi}} &\sim \operatorname{Dirichlet}(\alpha^{F}\;\boldsymbol{p}_{t_0})\\ \alpha^{F} &\sim \operatorname{InverseGamma}(1000, 200)\\ \boldsymbol{p}_{t_0} &= \operatorname{Softmax}(\boldsymbol{\mu}_{t_0})\\ \mu_{t_0} &= \lambda^{f}_{t_0} + \tilde{\lambda}^{f} + \beta_U\,U_{t_0} \end{align*}
  • TODO Do we need overdispersion here?

Candidate support

Baseline

We first assume that each political family \(f\) has a baseline amount of popular support, and that the variations we observe during the campaign are variations around that baseline \(\tilde{\lambda}^f\).

\begin{align*} \sigma_{\lambda} &\sim \operatorname{HalfNormal}(0.5)\\ \tilde{\lambda}^{f} &\sim \operatorname{ZeroSumNormal}(0, \sigma_{\lambda}) \end{align*}
  • TODO Why ZeroSumNormal here?

Evolution during the election

House effects

Pollsters differ in the ways they constitute their panel and the way they adjust results. We can separate house effects in two elements.

The first element, the systemic poll bias \(\zeta_f\), is shared by every pollster for each political family \(f\):

\begin{equation} \tilde{\zeta}_{f} \sim \operatorname{ZeroSumNormal}(0, 0.15) \end{equation}

Each pollster also has its idiosyncratic bias regarding the different families:

\begin{equation} \zeta_{hf} \sim \operatorname{ZeroSumNormal}(0, 0.15) \end{equation}
  • TODO Why the ZeroSumNormal here too?

    Making sure that our different effects sum to zero. Think about the month effect. It only makes sense in a relative sense: some months are better than average, some others are worse, but you can't have only good months – they'd be good compared to what? So we want to make sure that the average month effect is 0, while allowing each month to be better or worse than average if needed. And the reasoning is the same for house effects for instance – can you see why?

Discussion