Using of Microsoft Cognitive Services Computer Vision and Face API – (OCR & Faces) on HoloLens

In the last few years there has been great advancement in the field of neural networks and the Artificial Intelligence (AI). Microsoft has also developed a set of Cognitive Services that are a set on machine learning algorithms to solve problems in the field of Artificial Intelligence (AI). The Cognitive Services APIs are grouped into five categories:

  • Vision—analyze images and videos for content and other useful information.
  • Speech—tools to improve speech recognition and identify the speaker.
  • Language—understanding sentences and intent rather than just words.
  • Knowledge—tracks down research from scientific journals for you.
  • Search—applies machine learning to web searches.

In this tutorial, we see how identification is used with the FACE API. There is also an additional GitHub guide for character recognition through the Computer Vision API (see the end of the article).

It is important to set in the Camera of the scene the projection to Perspective and to be in the right position as shown in the picture. Position (0, -0.02, 0), Rotation (0, 0, 0), Scale (1, 1, 1). Essentially the point where you choose to display the camera is the point based on which the other objects in space will be initialized.

CameraInspector

You will need two GameObjects, the first GameObject named “PhotoCaptureManager” will contain the scripts and will be responsible for capturing the photo from the HoloLens device and then for sending to Azure Cognitive service to process it. The second GameObject named “AudioSource” will contain audio Source components in order to play the voice from text-to-speech implementation. For example “He is pretty much 26 years old and looks happy”.

Before you start coding, you need to create an Azure account and save the service key in order to use in your code. Choose the billing package. With F0 free package you can create 20 calls a minute and 30k a month.

Add in the GameObject “PhotoCaptureManager” the scripts, KeywordGlobalManager.cs , JSONObject.cs, PhotoCaptureManager.cs (take a look to the guide Take and display photos and videos on HoloLens), MCSCognitiveServices.cs which contain web service to Azure Cognitve Services for Face API, parsing the response from the server and encapsulate to list of Face model.

Face Model

public class MCSFaceDto : MonoBehaviour {
    
    public List<Face> faces { get; set; }
}
public class Face
{
    public string faceId { get; set; }
    public FaceRectangle faceRectangle { get; set; }
    public FaceAttributes faceAttributes { get; set; }
    public EmotionAttributes emotionAttributes { get; set; }
}
public class FaceRectangle
{
    public int top { get; set; }
    public int left { get; set; }
    public int width { get; set; }
    public int height { get; set; }
}
public class FaceAttributes
{
    public int gender { get; set; } // male = 0 , female = 1
    public float age { get; set; }
    public FacialHair facialHair { get; set; }
}
public class EmotionAttributes
{
    public float anger { get; set; }
    public float contempt { get; set; }
    public float disgust { get; set; }
    public float fear { get; set; }
    public float happiness { get; set; }
    public float neutral { get; set; }
    public float sadness { get; set; }
    public float surprise { get; set; }
}
public class HeadPose
{
    public float pitch { get; set; }
    public float roll { get; set; }
    public float yaw { get; set; }
}
public class FacialHair
{
    public bool hasMoustache { get; set; }
    public bool hasBeard { get; set; }
    public bool hasSideburns { get; set; }
}

MCSCognitiveServices.cs

public class MCSCognitiveServices : MonoBehaviour {
    
    //MCS Face API
    public IEnumerator<object> PostToFace(byte[] imageData, string type)
    {
        bool returnFaceId = true;
        string[] faceAttributes = new string[] { "age", "gender", "emotion" };

        var url = string.Format("https://westus.api.cognitive.microsoft.com/face/v1.0/{0}?returnFaceId={1}&returnFaceAttributes={2}", type, returnFaceId, Converters.ConvertStringArrayToString(faceAttributes));
        var headers = new Dictionary<string, string>() {
            { "Ocp-Apim-Subscription-Key", Constants.MCS_FACEKEY },
            { "Content-Type", "application/octet-stream" }
        };

        WWW www = new WWW(url, imageData, headers);
        yield return www;
        
        JSONObject j = new JSONObject(www.text);
        if (j != null)
            SaveJsonToModel(j);
    }
    
    private void SaveJsonToModel(JSONObject j)
    {
        MCSFaceDto faceDto = new MCSFaceDto();
        List<Face> faces = new List<Face>();
        
        foreach (var faceItem in j.list)
        {
            Face face = new Face();
            
            face = new Face() { faceId = faceItem.GetField("faceId").ToString() };
            
            var faceRectangle = faceItem.GetField("faceRectangle");
            face.faceRectangle = new FaceRectangle()
            {
                left = int.Parse(faceRectangle.GetField("left").ToString()),
                top = int.Parse(faceRectangle.GetField("top").ToString()),
                width = int.Parse(faceRectangle.GetField("width").ToString()),
                height = int.Parse(faceRectangle.GetField("height").ToString())
            };
            
            var faceAttributes = faceItem.GetField("faceAttributes");
            face.faceAttributes = new FaceAttributes()
            {
                age = int.Parse(faceAttributes.GetField("age").ToString().Split('.')[0]),
                gender = faceAttributes.GetField("gender").ToString().Replace("\"", "") == "male" ? 0 : 1
            };
            
            var emotion = faceAttributes.GetField("emotion");
            face.emotionAttributes = new EmotionAttributes()
            {
                anger = float.Parse(emotion.GetField("anger").ToString()),
                contempt = float.Parse(emotion.GetField("contempt").ToString()),
                disgust = float.Parse(emotion.GetField("disgust").ToString()),
                fear = float.Parse(emotion.GetField("fear").ToString()),
                happiness = float.Parse(emotion.GetField("happiness").ToString()),
                neutral = float.Parse(emotion.GetField("neutral").ToString()),
                sadness = float.Parse(emotion.GetField("sadness").ToString()),
                surprise = float.Parse(emotion.GetField("surprise").ToString()),
            };
            faces.Add(face);
        }

        faceDto.faces = faces;

        PlayVoiceMessage.Instance.PlayTextToSpeechMessage(faceDto);
    }
}
public class Constants
{
	public static string MCS_FACEKEY = "---------Face-Key---------";
}

Finally, a part from these features is created in PlayVoiceMessage.cs script, which is text-to-speech reproduced in the user’s speaker.PlayVoiceMessage.cs file.

public class PlayVoiceMessage : MonoBehaviour {

    public static PlayVoiceMessage Instance { get; private set; }
        
    public GameObject photoCaptureManagerGmObj;
    
    void Awake()
    {
        Instance = this;
    }

    public void PlayTextToSpeechMessage(MCSFaceDto face)
    {
        string message = string.Empty;
        string emotionName = string.Empty;

        if (face.faces.Count > 0)
        {
            EmotionAttributes emotionAttributes = face.faces[0].emotionAttributes;

            Dictionary<string, float> emotions = new Dictionary<string, float>
            {
                { "anger", emotionAttributes.anger },
                { "contempt", emotionAttributes.contempt },
                { "disgust", emotionAttributes.disgust },
                { "fear", emotionAttributes.fear },
                {"happiness", emotionAttributes.happiness },
                {"sadness", emotionAttributes.sadness },
                {"suprise", emotionAttributes.surprise }
            };

            emotionName = emotions.Keys.Max();

            message = string.Format("{0} is pretty much {1} years old and looks {2}", face.faces[0].faceAttributes.gender == 0 ? "He" : "She", face.faces[0].faceAttributes.age, emotionName);         
        }
        else
            message = "I could't detect anyone.";

        // Try and get a TTS Manager
        TextToSpeechManager tts = null;

        if (photoCaptureManagerGmObj != null)
        {
            tts = photoCaptureManagerGmObj.GetComponent<TextToSpeechManager>();
        }

        if (tts != null)
        {
            //Play voice message
            tts.SpeakText(message);
        }
    }
}

Run the application using the voice command “How does this person look”. In a few seconds, you are given the description from the Cognitive Services with text-to-speech.

You can find the samples and download from github to these links https://github.com/gntakakis/Hololens-MSCognitiveServicesFace-Unity and https://github.com/gntakakis/Hololens-MSCognitiveServicesOCR-Unity

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s