Leveraging CoreML to generate personalized videos in iOS applications similar to Google and Apple Photos

If you have used the Photos app and Google Photos app on the iPhone, you might have noticed, it generates videos automatically for you. Basically it will group similar photos or videos and then make a personalized video for you. How do they do it ? Let’s try doing the same using CoreML framework in iOS.

CoreML is a framework provided by apple to Integrate machine learning models into your app. Core ML supports Vision for analyzing images, Natural Language for processing text, Speech for converting audio to text, and Sound Analysis for identifying sounds in audio. Core ML itself builds on top of low-level primitives like Accelerate and BNNS, as well as Metal Performance Shaders. You can read about CoreML here: https://developer.apple.com/documentation/coreml


To start using CoreML, you would need to build a CoreML model. Don’t worry this is easy. You can use the CreateML tool to build a CoreML model. You can build and train a model with the Create ML app bundled with Xcode. Models trained using Create ML are in the Core ML model format and are ready to use in your app. Alternatively, you can use a wide variety of other machine learning libraries and then use Core ML Tools to convert the model into the Core ML format. Once a model is on a user’s device, you can use Core ML to retrain or fine-tune it on-device, with that user’s data.

Creating the Machine Learning Model using CreateML app

Open the CreateML app by going here

Go ahead and create a image classifier project as shown below.

Give it an appropriate name. The next steps are pretty straightforward. For my example, I created a folder called boards. Inside the boards folder, I have added sub folders. In each sub folder, I have added multiple photos of longboards.

Now go ahead and add the boards folder to the training data in CreateML app.

Click on the preview option in CreateML. You can test the data you provided in this tab.

After testing your data, download the model here.

Let’s name this model boardclassifier.

Take this model and drop it inside the xcode project you created. Xcode will automatically create a model class for you. This is very important.

	// MARK: COREML
	lazy var classificationRequest: VNCoreMLRequest = {
		do {
			/*
			 Use the Swift class `MobileNet` Core ML generates from the model.
			 To use a different Core ML classifier model, add it to the project
			 and replace `MobileNet` with that model's generated Swift class.
			 */
			// darshan
			let model = try VNCoreMLModel(for: boardclassifier().model)
			
			let request = VNCoreMLRequest(model: model, completionHandler: { [weak self] request, error in
				self?.processClassifications(for: request, error: error)
			})
			request.imageCropAndScaleOption = .centerCrop
			return request
		} catch {
			fatalError("Failed to load Vision ML model: \(error)")
		}
	}()

We are planning to do the following here. Read through the media available on your phone. Push it through the CoreML model. Get the confidence of each video. If the video has enough confidence we trim the video. Finally we merge the trimmed videos and merge it into a single personalized video.

Architecture

Code / Implementation Details

This is the data structure I use to store data for processing in this app.

class AssetData : Hashable {
	
	var name: String = ""
	var path: String = ""
	var avasset: AVAsset?
	var classificationTotalScore = 0.0
	var classificationIdentifier = ""
	var images:[UIImage]?
	var documentDirectoryPath: URL?
	var classifications:[VNClassificationObservation] = []
	var maxLabel: String = ""
	var trimmedPath: URL?
	
	static func == (lhs: AssetData, rhs: AssetData) -> Bool {
		if (lhs.name == rhs.name && lhs.path == rhs.path) {
			return true
		}
		return false
	}
	
	func hash(into hasher: inout Hasher) {
		hasher.combine(path)
	}
}

Go through all the videos in the bundle folder and generate images asynchronously at regular period. Here I am generating the images every second. For the implementation refer this method in the code

func getImagesForAssetAsynchronously(assetData: AssetData, completionHandler: @escaping (AssetData, Bool)-> Void) {

Once the images are generated, pass it through the classifier to get its confidence level.

	func updateClassificationsForAssetData(for assetData:AssetData, completionHandler: @escaping (AssetData)->Void) {
		classificationText = "Classifying..."
		DispatchQueue.global(qos: .userInitiated).sync {
			for pos in (0..<assetData.images!.count) {
				let image = assetData.images![pos]
				let orientation = CGImagePropertyOrientation(image.imageOrientation)
				guard let ciImage = CIImage(image: image) else { fatalError("Unable to create \(CIImage.self) from \(image).") }
				let handler = VNImageRequestHandler(ciImage: ciImage, orientation: orientation)
				do {
					//AssetDataRequestHandler.currentlyProcessingAssetDataName = assetData
					let visionModel = try VNCoreMLModel(for: boardclassifier().model)
					try handler.perform([self.classificationRequest])
					//print("classification finished for ",assetData.name, pos)
					if pos >= assetData.images!.count - 1 {
						//print("calling completion handler classification finished for ",assetData.name, pos)
						assetData.classifications = AssetDataRequestHandler.classifications
						AssetDataRequestHandler.classifications.removeAll()
						completionHandler(assetData)
					}
				} catch {
					/*
					 This handler catches general image processing errors. The `classificationRequest`'s
					 completion handler `processClassifications(_:error:)` catches errors specific
					 to processing that request.
					 */
					print("Failed to perform classification.\n\(error.localizedDescription)")
				}
			}
		}
	}

After finding out about the confidence level in each video for each label, we can trim the videos for which you have the highest confidence. This is done as shown below

	// MARK: VIDEO HANDLERS
	func trimVideos(assetData: AssetData, completionBlock: @escaping (Bool, AssetData?, URL?)->()) throws -> Void {
		let exportSession = AVAssetExportSession(asset: assetData.avasset!, presetName: AVAssetExportPresetHighestQuality)
		let outputURL = try FileManager.createDirectoryForTrimmedFiles()?.appendingPathComponent(assetData.name+".mov")
		exportSession?.outputURL = outputURL
		exportSession?.shouldOptimizeForNetworkUse = true;
		exportSession?.outputFileType = AVFileType.mov;
		
		let startTime = CMTimeMake(value: Int64(5), timescale: 1)
		let stopTime = CMTimeMake(value: Int64(8), timescale: 1)
		let range = CMTimeRangeFromTimeToTime(start: startTime, end: stopTime)
		exportSession?.timeRange = range
		exportSession?.exportAsynchronously(completionHandler: {
			switch exportSession?.status {
			   case .failed:
				print("Export failed: \(String(describing: exportSession?.error != nil ? exportSession?.error!.localizedDescription : "No Error Info"))")
			   case .cancelled:
				   print("Export canceled")
			   case .completed:
				assetData.trimmedPath = outputURL
				let notificationCenter = NotificationCenter.default
				notificationCenter.post(name: Notification.Name("SendUpdatesToUser"), object: nil, userInfo: ["text":"completed trimming video \(assetData.name)"])
				completionBlock(true, assetData, outputURL)
			   default:
				   break
			   }
		})
	}

Merge the trimmed videos into a single video with the code below

	func merge(mlmodelName: String, arrayVideos:[AVAsset], filename:String, completion:@escaping (_ exporter: AVAssetExportSession, _ mlmodelName: String) -> ()) -> Void {
		let mainComposition = AVMutableComposition()
		let compositionVideoTrack = mainComposition.addMutableTrack(withMediaType: .video, preferredTrackID: kCMPersistentTrackID_Invalid)
		var insertTime = CMTime.zero
		  for videoAsset in arrayVideos {
			try! compositionVideoTrack?.insertTimeRange(CMTimeRangeMake(start: CMTime.zero, duration: videoAsset.duration), of: videoAsset.tracks(withMediaType: .video)[0], at: insertTime)
	//		try! soundtrackTrack?.insertTimeRange(CMTimeRangeMake(start: CMTime.zero, duration: videoAsset.duration), of: videoAsset.tracks(withMediaType: .audio)[0], at: insertTime)
			insertTime = CMTimeAdd(insertTime, videoAsset.duration)
		  }
		compositionVideoTrack?.preferredTransform = arrayVideos[0].preferredTransform
			let soundtrackTrack = mainComposition.addMutableTrack(withMediaType: .audio, preferredTrackID: kCMPersistentTrackID_Invalid)
			do {
				let fileManager = FileManager.default
				let docsArray = try fileManager.contentsOfDirectory(atPath: Bundle.main.resourcePath!)
				for doc in docsArray {
					if doc == "Toybox.m4a" {
						print(doc)
						let audioAsset = AVAsset(url: URL(fileURLWithPath: Bundle.main.resourcePath! + "/Toybox.m4a"))
						try soundtrackTrack?.insertTimeRange(
						CMTimeRangeMake(
						  start: .zero,
						  duration: insertTime),
						of: audioAsset.tracks(withMediaType: .audio)[0],
						at: .zero)
					}
				}
			} catch {
			  print("Failed to load Audio track")
			}
		
		do {
			let outputFileURL = try FileManager.createDirectoryForTrimmedFiles()?.appendingPathComponent(filename+".mp4")
			let fileManager = FileManager.default
			if fileManager.fileExists(atPath: outputFileURL!.absoluteString) {
				try fileManager.removeItem(atPath: outputFileURL!.absoluteString)
			}
			
			let exporter = AVAssetExportSession(asset: mainComposition, presetName: AVAssetExportPresetHighestQuality)

			exporter?.outputURL = outputFileURL
			exporter?.outputFileType = AVFileType.mov
			exporter?.shouldOptimizeForNetworkUse = true

			exporter?.exportAsynchronously {
			  DispatchQueue.main.async {
				completion(exporter!, mlmodelName)
			  }
			}

		} catch {
			print("error merging \(error)")
		}
	}

Once you get the merged video, this is a bonus step. You can add labels and text to it using the below method.

	func addGraphicsToVideos() {
		let videoEditor = VideoEditor()
		var counter = 0
		let notificationCenter = NotificationCenter.default
		for assetData in personalizedAssets {
			videoEditor.makeBirthdayCard(fromVideoAt: URL(fileURLWithPath: assetData.path), forName: assetData.name) { [self] graphicsVideoURL in
				notificationCenter.post(name: Notification.Name("SendUpdatesToUser"), object: nil, userInfo: ["text":"Generated video using \(assetData.name)"])
				counter += 1
				print("video url \(graphicsVideoURL?.absoluteString ?? "")")
				assetData.path = graphicsVideoURL!.path
				if counter == personalizedAssets.count {
					generatePersonalizedVideos.isHidden = false
					activityIndicator.stopAnimating()
					activityIndicator.isHidden = true
					let alert = UIAlertController(title: "Information", message: "Finished generating personalized videos", preferredStyle: .alert)
					alert.addAction(UIAlertAction(title: "OK", style: .default, handler: { _ in
//						let player = AVPlayer(url: graphicsVideoURL!)
//						let vc = AVPlayerViewController()
//						vc.player = player
//						present(vc, animated: true) {
//							vc.player?.play()
//						}
					}))
					self.present(alert, animated: true)
				}
			}
		}
	}

I hope this explains how we do it. If you feel lazy, the entire codebase is here https://github.com/kmdarshan/personalizedVideoCoreML

Leave a Reply

Your email address will not be published. Required fields are marked *