Learning video representations with limited supervision