Skip to content

How to extract metadata from images

extract metadata from images

With this article we are going to explain how to extract metadata from images so that we can organise a folder/subfolder full of images.

Understanding the problem

The problem I wanted to solve within this series of articles is to have some sort of automated python script that looks into a folder and subfolders full of images and organise them by date moving into a well organised folder structure (i.e. Year/Month/*.jpg). That is the end goal…my vision. The real issue (or the use case) is that someone in my family (I’m not saying my wife here) has the tendency to shot an incredible average number of pictures per day. It is not wrong, but if you don’t move soon into a cloud archive or somewhere else after few months (weeks in this case) could be really hard to get organised. To achieve this we need to understand how to write an application that will look at all the images from a folder and organise by date into freshly created folder. This is possible only by looking at theimages metadata.

What are the images metadata?

The metadata is text information related to the image and embedded into the file itself. They can include details relevant to when it was created, geo location, originator etc. But let’s have a look in more detail. We can have a look at few of them just by looking at the properties of a picture on the computer (Command+I for Mac or right click and properties for Windows) and we should see something like the image below

How to extract metadata from images

As we can see there are a lot of information already but also one important thing that I can see here is that the created date is different from the modified! Looking at this from a logical point of view, the only thing I can think of is that the meaning of “Created” here is the date the image was actually created on the disk. But that’s just an assumption. Reading here and there, trying to to get more information I found out that (is not a holy secret) each image has a lot of dates (created, edit, modified, accessed, etc.) and of course each one of them has a specific meaning. To solve my problem I must have access to the date field that represents the true date of creation on the phone/cam so that I can then move to the proper folder. To have access to that I might need to understand what actually is happening when we press the “click” button on the phone/cam for taking the picture. Do I? No! What I need to know is that all these information travel with the picture from system to system and, more important, these information are specified by the EXIF standard. Just in case you want to write your own software for extracting these information here these are great resources https://www.media.mit.edu/pia/Research/deepview/exif.html and https://exiftool.org/TagNames/EXIF.html and https://web.archive.org/web/20190624045241if_/http://www.cipa.jp:80/std/documents/e/DC-008-Translation-2019-E.pdf

Reading through the details…the metadata we really need is one of these two

Tag IDTag NameWritableGroupNotes
0x9003DateTimeOriginalstringExifIFD(date/time when original image was taken)
0x9004CreateDatestringExifIFD(called DateTimeDigitized by the EXIF spec.)

There is a slight difference in these 2 in terms of definition. In fact DateTimeOriginal represents the date and time the original image data was generated while the DateTimeDigitized is the date and time when the image was stored as digital data. Let’s force an assumption here: considering that these pictures were all taken from a digital device I’m expecting that both should be almost identical if not, there must be some millisecond of difference. Taking that as true (for now) our application working is pretty clear then:

  • Scan a folder and subfolders for images
  • foreach one of them look at the metadata DateTimeOriginal
  • create the Year/Date folder
  • move the picture there (copy for now, I had a really bad experience with automation around pictures that nearly costed the divorce)

How to get the metadata

The big question is: how we look at the metadata?

The big answer is: there is an infinite amount of libraries that does that but PIL (https://pillow.readthedocs.io

/en/stable/)is the one we need for this basic application.

Assuming we have our test image in /tmp/some-picture.jpg let’s run these 2 lines of code from the python interpreter

>>> from PIL import Image
>>> image_path="/tmp/some-picture.jpg"
>>> img = Image.open(image_path)
>>> meta_available=img._getexif().items()
>>> meta_available
dict_items([(256, 4032), (257, 2268), (296, 2), (34665, 237), (271, 'samsung'), (272, 'SM-G965F'), (305, 'G965FXXUAETG3'), (274, 6), (531, 1), (306, '2020:09:02 12:38:53'), (282, 72.0), (283, 72.0), (36864, b'0220'), (37377, 5.64), (37378, 2.52), (36867, '2020:09:02 12:38:53'), (36868, '2020:09:02 12:38:53'), (37379, 2.51), (37380, 0.0), (37381, 1.16), (37383, 2), (37385, 0), (40960, b'0100'), (37386, 4.3), (40961, 1), (37121, b'\x01\x02\x03\x00'), (40962, 4032), (37520, '0607'), (37521, '0607'), (37522, '0607'), (40963, 2268), (33434, 0.02), (40965, 827), (33437, 2.4), (41729, b'\x01\x00\x00\x00'), (42016, 'I12LLKF00SM'), (34850, 2), (41985, 0), (34855, 125), (41986, 0), (41987, 0), (41988, nan), (41989, 26), (41990, 0), (41992, 0), (41993, 0), (41994, 0)])

Please note the dict_items above was manually changed to remove the GPS coordinates 🙂

As we can see, we already have a lot of information, the one we need to look at is the one that has the decimal value of0x9003 =>DateTimeOriginal => 36867

How did I get to that? Easy, the Exif specs are using the Tag ID in the format 0x… that means that is an HEX number and on the page they are just one after the other. If you run this from the python cli, it will give you back all the meta data tags we have access with their ID in decimal format i.e.– 264 = CellWidth, or

– 36867 = DateTimeOriginal

– 36868 = DateTimeDigitized

from PIL import ExifTags
for item in ExifTags.TAGS: print(f"- {item} = {ExifTags.TAGS[item]}")

- 1 = InteropIndex
- 11 = ProcessingSoftware
- 254 = NewSubfileType
- 255 = SubfileType
- 256 = ImageWidth
- 257 = ImageLength
- 258 = BitsPerSample
- 259 = Compression
- 262 = PhotometricInterpretation
- 263 = Thresholding
- 264 = CellWidth
- 265 = CellLength
- 266 = FillOrder
- 269 = DocumentName
- 270 = ImageDescription
- 271 = Make
- 272 = Model
- 273 = StripOffsets
- 274 = Orientation
- 277 = SamplesPerPixel
- 278 = RowsPerStrip
- 279 = StripByteCounts
- 280 = MinSampleValue
- 281 = MaxSampleValue
- 282 = XResolution
- 283 = YResolution
- 284 = PlanarConfiguration
- 285 = PageName
- 288 = FreeOffsets
- 289 = FreeByteCounts
- 290 = GrayResponseUnit
- 291 = GrayResponseCurve
- 292 = T4Options
- 293 = T6Options
- 296 = ResolutionUnit
- 297 = PageNumber
- 301 = TransferFunction
- 305 = Software
- 306 = DateTime
- 315 = Artist
- 316 = HostComputer
- 317 = Predictor
- 318 = WhitePoint
- 319 = PrimaryChromaticities
- 320 = ColorMap
- 321 = HalftoneHints
- 322 = TileWidth
- 323 = TileLength
- 324 = TileOffsets
- 325 = TileByteCounts
- 330 = SubIFDs
- 332 = InkSet
- 333 = InkNames
- 334 = NumberOfInks
- 336 = DotRange
- 337 = TargetPrinter
- 338 = ExtraSamples
- 339 = SampleFormat
- 340 = SMinSampleValue
- 341 = SMaxSampleValue
- 342 = TransferRange
- 343 = ClipPath
- 344 = XClipPathUnits
- 345 = YClipPathUnits
- 346 = Indexed
- 347 = JPEGTables
- 351 = OPIProxy
- 512 = JPEGProc
- 513 = JpegIFOffset
- 514 = JpegIFByteCount
- 515 = JpegRestartInterval
- 517 = JpegLosslessPredictors
- 518 = JpegPointTransforms
- 519 = JpegQTables
- 520 = JpegDCTables
- 521 = JpegACTables
- 529 = YCbCrCoefficients
- 530 = YCbCrSubSampling
- 531 = YCbCrPositioning
- 532 = ReferenceBlackWhite
- 700 = XMLPacket
- 4096 = RelatedImageFileFormat
- 4097 = RelatedImageWidth
- 4098 = RelatedImageLength
- 18246 = Rating
- 18249 = RatingPercent
- 32781 = ImageID
- 33421 = CFARepeatPatternDim
- 33422 = CFAPattern
- 33423 = BatteryLevel
- 33432 = Copyright
- 33434 = ExposureTime
- 33437 = FNumber
- 33723 = IPTCNAA
- 34377 = ImageResources
- 34665 = ExifOffset
- 34675 = InterColorProfile
- 34850 = ExposureProgram
- 34852 = SpectralSensitivity
- 34853 = GPSInfo
- 34855 = ISOSpeedRatings
- 34856 = OECF
- 34857 = Interlace
- 34858 = TimeZoneOffset
- 34859 = SelfTimerMode
- 34864 = SensitivityType
- 34865 = StandardOutputSensitivity
- 34866 = RecommendedExposureIndex
- 34867 = ISOSpeed
- 34868 = ISOSpeedLatitudeyyy
- 34869 = ISOSpeedLatitudezzz
- 36864 = ExifVersion
- 36867 = DateTimeOriginal
- 36868 = DateTimeDigitized
- 36880 = OffsetTime
- 36881 = OffsetTimeOriginal
- 36882 = OffsetTimeDigitized
- 37121 = ComponentsConfiguration
- 37122 = CompressedBitsPerPixel
- 37377 = ShutterSpeedValue
- 37378 = ApertureValue
- 37379 = BrightnessValue
- 37380 = ExposureBiasValue
- 37381 = MaxApertureValue
- 37382 = SubjectDistance
- 37383 = MeteringMode
- 37384 = LightSource
- 37385 = Flash
- 37386 = FocalLength
- 37387 = FlashEnergy
- 37388 = SpatialFrequencyResponse
- 37389 = Noise
- 37393 = ImageNumber
- 37394 = SecurityClassification
- 37395 = ImageHistory
- 37396 = SubjectLocation
- 37397 = ExposureIndex
- 37398 = TIFF/EPStandardID
- 37500 = MakerNote
- 37510 = UserComment
- 37520 = SubsecTime
- 37521 = SubsecTimeOriginal
- 37522 = SubsecTimeDigitized
- 37888 = AmbientTemperature
- 37889 = Humidity
- 37890 = Pressure
- 37891 = WaterDepth
- 37892 = Acceleration
- 37893 = CameraElevationAngle
- 40091 = XPTitle
- 40092 = XPComment
- 40093 = XPAuthor
- 40094 = XPKeywords
- 40095 = XPSubject
- 40960 = FlashPixVersion
- 40961 = ColorSpace
- 40962 = ExifImageWidth
- 40963 = ExifImageHeight
- 40964 = RelatedSoundFile
- 40965 = ExifInteroperabilityOffset
- 41483 = FlashEnergy
- 41484 = SpatialFrequencyResponse
- 41486 = FocalPlaneXResolution
- 41487 = FocalPlaneYResolution
- 41488 = FocalPlaneResolutionUnit
- 41492 = SubjectLocation
- 41493 = ExposureIndex
- 41495 = SensingMethod
- 41728 = FileSource
- 41729 = SceneType
- 41730 = CFAPattern
- 41985 = CustomRendered
- 41986 = ExposureMode
- 41987 = WhiteBalance
- 41988 = DigitalZoomRatio
- 41989 = FocalLengthIn35mmFilm
- 41990 = SceneCaptureType
- 41991 = GainControl
- 41992 = Contrast
- 41993 = Saturation
- 41994 = Sharpness
- 41995 = DeviceSettingDescription
- 41996 = SubjectDistanceRange
- 42016 = ImageUniqueID
- 42032 = CameraOwnerName
- 42033 = BodySerialNumber
- 42034 = LensSpecification
- 42035 = LensMake
- 42036 = LensModel
- 42037 = LensSerialNumber
- 42080 = CompositeImage
- 42081 = CompositeImageCount
- 42082 = CompositeImageExposureTimes
- 42240 = Gamma
- 50341 = PrintImageMatching
- 50706 = DNGVersion
- 50707 = DNGBackwardVersion
- 50708 = UniqueCameraModel
- 50709 = LocalizedCameraModel
- 50710 = CFAPlaneColor
- 50711 = CFALayout
- 50712 = LinearizationTable
- 50713 = BlackLevelRepeatDim
- 50714 = BlackLevel
- 50715 = BlackLevelDeltaH
- 50716 = BlackLevelDeltaV
- 50717 = WhiteLevel
- 50718 = DefaultScale
- 50719 = DefaultCropOrigin
- 50720 = DefaultCropSize
- 50721 = ColorMatrix1
- 50722 = ColorMatrix2
- 50723 = CameraCalibration1
- 50724 = CameraCalibration2
- 50725 = ReductionMatrix1
- 50726 = ReductionMatrix2
- 50727 = AnalogBalance
- 50728 = AsShotNeutral
- 50729 = AsShotWhiteXY
- 50730 = BaselineExposure
- 50731 = BaselineNoise
- 50732 = BaselineSharpness
- 50733 = BayerGreenSplit
- 50734 = LinearResponseLimit
- 50735 = CameraSerialNumber
- 50736 = LensInfo
- 50737 = ChromaBlurRadius
- 50738 = AntiAliasStrength
- 50739 = ShadowScale
- 50740 = DNGPrivateData
- 50741 = MakerNoteSafety
- 50778 = CalibrationIlluminant1
- 50779 = CalibrationIlluminant2
- 50780 = BestQualityScale
- 50781 = RawDataUniqueID
- 50827 = OriginalRawFileName
- 50828 = OriginalRawFileData
- 50829 = ActiveArea
- 50830 = MaskedAreas
- 50831 = AsShotICCProfile
- 50832 = AsShotPreProfileMatrix
- 50833 = CurrentICCProfile
- 50834 = CurrentPreProfileMatrix
- 50879 = ColorimetricReference
- 50931 = CameraCalibrationSignature
- 50932 = ProfileCalibrationSignature
- 50934 = AsShotProfileName
- 50935 = NoiseReductionApplied
- 50936 = ProfileName
- 50937 = ProfileHueSatMapDims
- 50938 = ProfileHueSatMapData1
- 50939 = ProfileHueSatMapData2
- 50940 = ProfileToneCurve
- 50941 = ProfileEmbedPolicy
- 50942 = ProfileCopyright
- 50964 = ForwardMatrix1
- 50965 = ForwardMatrix2
- 50966 = PreviewApplicationName
- 50967 = PreviewApplicationVersion
- 50968 = PreviewSettingsName
- 50969 = PreviewSettingsDigest
- 50970 = PreviewColorSpace
- 50971 = PreviewDateTime
- 50972 = RawImageDigest
- 50973 = OriginalRawFileDigest
- 50974 = SubTileBlockSize
- 50975 = RowInterleaveFactor
- 50981 = ProfileLookTableDims
- 50982 = ProfileLookTableData
- 51008 = OpcodeList1
- 51009 = OpcodeList2
- 51022 = OpcodeList3
- 51041 = NoiseProfile

as you can see all the metadata are there, we only need to map back to the one available from the picture

First python script

This first python script is just to evaluate if we are able to retrieve the image tags

import os
import sys
from PIL import Image, ExifTags

image = sys.argv[1]

img = Image.open(image)

for tag, value in img._getexif().items():
        print(f"{tag} = {ExifTags.TAGS[tag]} = {value}")

This will produce something similar to

...
274 = Orientation = 6
531 = YCbCrPositioning = 1
306 = DateTime = 2020:09:02 12:38:53
282 = XResolution = 72.0
283 = YResolution = 72.0
36864 = ExifVersion = b'0220'
37377 = ShutterSpeedValue = 5.64
37378 = ApertureValue = 2.52
36867 = DateTimeOriginal = 2020:09:02 12:38:53
36868 = DateTimeDigitized = 2020:09:02 12:38:53
37379 = BrightnessValue = 2.51
37380 = ExposureBiasValue = 0.0
37381 = MaxApertureValue = 1.16
37383 = MeteringMode = 2
37385 = Flash = 0
40960 = FlashPixVersion = b'0100'
37386 = FocalLength = 4.3
37510 = UserComment = b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
...

At this stage we have all we need. Below the first version of the algo that is just accepting a path and scan for files

Second python script

This is an improvement of the previous one. Will just scan a folder instead of a single file

import os
import sys
from PIL import Image, ExifTags

folder_to_scan = sys.argv[1]

def list_folder(path):
        folder_content = []
        if not os.path.isdir(path):
                print(f"ERROR: [{path}] is not a valid folder")
                exit()
        else:
                folder_content = next(os.walk(path))[2]
        if len(folder_content):
                return folder_content
        print(f"WARNING: [{path}] is empty")
        exit()

def get_img_tags(image_path):
        tags_obj = {}
        try:
                img = Image.open(image)
                for tag, value in img._getexif().items():
                        tags_obj[ExifTags.TAGS[tag]] = value
        except Exception as e:
                print(e)
        return tags_obj

folder_elemets = list_folder(folder_to_scan)
for i in folder_elemets:
        print(i)
        print("======")

Third python script

This is the final version, that it can still be vastly improved, however is doing the job. Is taking as input 2 parameters: the folder in which we have all the images, the output folder. Starting the FOLDER_TO_SCAN is building a dictionary with keys as dates (i.e. Apr_2021) to which is associated a list that contains all the files to copy. Once this dictionary is built the copy of the files will just require a loop through this object (done by themove_files function)

The dictionary structure will looks like

{
    "month_year": [],
    ...
}

example

{
    "Apr_2021": [
        "file1.jpg",
        "file2.jpg",
        "file3.jpg",
    ]
}


Please note that at the moment this script is working only with images format, is not moving the file images but only COPYING, and is not production ready.

import os
import sys
from datetime import datetime
from PIL import Image, ExifTags

DTO = "DateTimeOriginal"

FOLDER_TO_WRITE=""
FOLDER_TO_SCAN=""

def list_folder():
        folder_content = []
        if not os.path.isdir(FOLDER_TO_SCAN):
                print(f"ERROR: [{FOLDER_TO_SCAN}] is not a valid folder")
                exit()
        else:
                folder_content = next(os.walk(FOLDER_TO_SCAN))[2]
        if len(folder_content):
                return folder_content
        print(f"WARNING: [{FOLDER_TO_SCAN}] is empty")
        exit()

def get_img_tags(image_path):
        tags_obj = {}
        try:
                img = Image.open(image_path)
                for tag, value in img._getexif().items():
                        tags_obj[ExifTags.TAGS[tag]] = value
        except Exception as e:
                print(e)
        return tags_obj

def get_exif_tags(elements, func_to_exec):
        obj = {}
        for item in elements:
                obj[item] = func_to_exec(os.path.join(FOLDER_TO_SCAN, item))
        return obj

def get_date_string(data_obj):
        if not data_obj:
                return None
        
        data_string = datetime.strptime(data_obj.split()[0], "%Y:%m:%d")
        month = data_string.strftime('%b')
        year = data_string.year
        return "{}_{}".format(month, year)

# create the dates groups of images
def get_grouped_images(images_list):
        data_obj = {}
        
        for item in images_list:
                date_string = get_date_string(images_list[item].get('DateTimeOriginal', None))
                if not date_string:
                        date_string = "No_date"
                if not date_string in  data_obj:
                        print("Adding {} to the list for {}".format(date_string, item))
                        data_obj[date_string] = []
                        print("******")
                data_obj[date_string].append(item)
        return data_obj

# create the folders in folder_to_write/Month_Year
def create_folders(data_obj):
        for item in data_obj:
                if not os.path.isdir( os.path.join(FOLDER_TO_WRITE, item) ):
                        print("Creating {}".format(os.path.join(FOLDER_TO_WRITE, item) ))
                        os.makedirs(os.path.join(FOLDER_TO_WRITE, item))
                else:
                        print("{} already exists".format(os.path.join(FOLDER_TO_WRITE, item) ))

def move_files(data_obj):
        for item in data_obj:
                new_path = os.path.join(FOLDER_TO_WRITE, item)
                files_to_copy = data_obj[item]
                for file_name in files_to_copy:
                        print("Copying .... {} in {}".format(file_name, new_path))
                        os.popen("cp ./{}/{} {}".format(FOLDER_TO_SCAN, file_name, new_path))
                        

if __name__ == "__main__":
        
        FOLDER_TO_SCAN = sys.argv[1]
        FOLDER_TO_WRITE = sys.argv[2]
        folder_elemets = list_folder()    
        images_list = get_exif_tags(folder_elemets, get_img_tags)    
        data_obj = get_grouped_images(images_list)
        create_folders(data_obj)
        move_files(data_obj)

here an example of run

$> python test2.py test_script new-folder

cannot identify image file 'test_script/20180514064433.mp4'
cannot identify image file 'test_script/20200923_075122.mp4'

cannot identify image file 'test_script/20200904_200350.mp4'
cannot identify image file 'test_script/20210411_155452.mp4'cannot identify image file 'test_script/20200912_095821.mp4'
cannot identify image file 'test_script/20210411_154427.mp4'
cannot identify image file 'test_script/20200921_201115.mp4'
Adding Apr_2021 to the list for 20210406_120550.jpg
******
Adding Oct_2020 to the list for 20201018_083107.jpg
******
Adding No_date to the list for 20210407_103853.mp4
******
Adidng Sep_2020 to the list for 20200912_082208.jpg
******
Creating new-folder/Apr_2021
Creating new-folder/Oct_2020
Creating new-folder/No_date
Creating new-folder/Sep_2020
Copying .... 20200919_084224.jpg in new-folder/Sep_2020
Copying .... 20200910_163210.jpg in new-folder/Sep_2020
Copying .... 20200909_204355.jpg in new-folder/Sep_2020
Copying .... 20200921_075240.jpg in new-folder/Sep_2020
Copying .... 20200920_075614.jpg in new-folder/Sep_2020
Copying .... 20200906_130835.jpg in new-folder/Sep_2020
Copying .... 20200923_075047.jpg in new-folder/Sep_2020
Copying .... 20200902_123853.jpg in new-folder/Sep_2020
Copying .... 20200912_082200.jpg in new-folder/Sep_2020
Copying .... 20200923_075045.jpg in new-folder/Sep_2020
Copying .... 20200927_103155.jpg in new-folder/Sep_2020
Copying .... 20200910_163259.jpg in new-folder/Sep_2020
Copying .... 20200924_192522.jpg in new-folder/Sep_2020
Copying .... 20200922_162745.jpg in new-folder/Sep_2020

Hope you enjoyed and if so, please share and help us grow!

Share this content:

0
Would love your thoughts, please comment.x
()
x