How to extract metadata from images
With this article we are going to explain how to extract metadata from images so that we can organise a folder/subfolder full of images.
- Understanding the problem
- What are the images metadata
- How to get the metadata
- First python script
- Second python script
- Third python script
Understanding the problem
The problem I wanted to solve within this series of articles is to have some sort of automated python script that looks into a folder and subfolders full of images and organise them by date moving into a well organised folder structure (i.e. Year/Month/*.jpg). That is the end goal…my vision. The real issue (or the use case) is that someone in my family (I’m not saying my wife here) has the tendency to shot an incredible average number of pictures per day. It is not wrong, but if you don’t move soon into a cloud archive or somewhere else after few months (weeks in this case) could be really hard to get organised. To achieve this we need to understand how to write an application that will look at all the images from a folder and organise by date into freshly created folder. This is possible only by looking at theimages metadata.
What are the images metadata?
The metadata is text information related to the image and embedded into the file itself. They can include details relevant to when it was created, geo location, originator etc. But let’s have a look in more detail. We can have a look at few of them just by looking at the properties of a picture on the computer (Command+I for Mac or right click and properties for Windows) and we should see something like the image below
As we can see there are a lot of information already but also one important thing that I can see here is that the created date is different from the modified! Looking at this from a logical point of view, the only thing I can think of is that the meaning of “Created” here is the date the image was actually created on the disk. But that’s just an assumption. Reading here and there, trying to to get more information I found out that (is not a holy secret) each image has a lot of dates (created, edit, modified, accessed, etc.) and of course each one of them has a specific meaning. To solve my problem I must have access to the date field that represents the true date of creation on the phone/cam so that I can then move to the proper folder. To have access to that I might need to understand what actually is happening when we press the “click” button on the phone/cam for taking the picture. Do I? No! What I need to know is that all these information travel with the picture from system to system and, more important, these information are specified by the EXIF standard. Just in case you want to write your own software for extracting these information here these are great resources https://www.media.mit.edu/pia/Research/deepview/exif.html and https://exiftool.org/TagNames/EXIF.html and https://web.archive.org/web/20190624045241if_/http://www.cipa.jp:80/std/documents/e/DC-008-Translation-2019-E.pdf
Reading through the details…the metadata we really need is one of these two
Tag ID | Tag Name | Writable | Group | Notes |
0x9003 | DateTimeOriginal | string | ExifIFD | (date/time when original image was taken) |
0x9004 | CreateDate | string | ExifIFD | (called DateTimeDigitized by the EXIF spec.) |
There is a slight difference in these 2 in terms of definition. In fact DateTimeOriginal represents the date and time the original image data was generated while the DateTimeDigitized is the date and time when the image was stored as digital data. Let’s force an assumption here: considering that these pictures were all taken from a digital device I’m expecting that both should be almost identical if not, there must be some millisecond of difference. Taking that as true (for now) our application working is pretty clear then:
- Scan a folder and subfolders for images
- foreach one of them look at the metadata DateTimeOriginal
- create the Year/Date folder
- move the picture there (copy for now, I had a really bad experience with automation around pictures that nearly costed the divorce)
How to get the metadata
The big question is: how we look at the metadata?
The big answer is: there is an infinite amount of libraries that does that but PIL (https://pillow.readthedocs.io
/en/stable/)is the one we need for this basic application.
Assuming we have our test image in /tmp/some-picture.jpg let’s run these 2 lines of code from the python interpreter
>>> from PIL import Image
>>> image_path="/tmp/some-picture.jpg"
>>> img = Image.open(image_path)
>>> meta_available=img._getexif().items()
>>> meta_available
dict_items([(256, 4032), (257, 2268), (296, 2), (34665, 237), (271, 'samsung'), (272, 'SM-G965F'), (305, 'G965FXXUAETG3'), (274, 6), (531, 1), (306, '2020:09:02 12:38:53'), (282, 72.0), (283, 72.0), (36864, b'0220'), (37377, 5.64), (37378, 2.52), (36867, '2020:09:02 12:38:53'), (36868, '2020:09:02 12:38:53'), (37379, 2.51), (37380, 0.0), (37381, 1.16), (37383, 2), (37385, 0), (40960, b'0100'), (37386, 4.3), (40961, 1), (37121, b'\x01\x02\x03\x00'), (40962, 4032), (37520, '0607'), (37521, '0607'), (37522, '0607'), (40963, 2268), (33434, 0.02), (40965, 827), (33437, 2.4), (41729, b'\x01\x00\x00\x00'), (42016, 'I12LLKF00SM'), (34850, 2), (41985, 0), (34855, 125), (41986, 0), (41987, 0), (41988, nan), (41989, 26), (41990, 0), (41992, 0), (41993, 0), (41994, 0)])
Please note the dict_items above was manually changed to remove the GPS coordinates 🙂
As we can see, we already have a lot of information, the one we need to look at is the one that has the decimal value of0x9003 =>DateTimeOriginal => 36867
How did I get to that? Easy, the Exif specs are using the Tag ID in the format 0x… that means that is an HEX number and on the page they are just one after the other. If you run this from the python cli, it will give you back all the meta data tags we have access with their ID in decimal format i.e.– 264 = CellWidth, or
– 36867 = DateTimeOriginal
– 36868 = DateTimeDigitized
from PIL import ExifTags
for item in ExifTags.TAGS: print(f"- {item} = {ExifTags.TAGS[item]}")
- 1 = InteropIndex
- 11 = ProcessingSoftware
- 254 = NewSubfileType
- 255 = SubfileType
- 256 = ImageWidth
- 257 = ImageLength
- 258 = BitsPerSample
- 259 = Compression
- 262 = PhotometricInterpretation
- 263 = Thresholding
- 264 = CellWidth
- 265 = CellLength
- 266 = FillOrder
- 269 = DocumentName
- 270 = ImageDescription
- 271 = Make
- 272 = Model
- 273 = StripOffsets
- 274 = Orientation
- 277 = SamplesPerPixel
- 278 = RowsPerStrip
- 279 = StripByteCounts
- 280 = MinSampleValue
- 281 = MaxSampleValue
- 282 = XResolution
- 283 = YResolution
- 284 = PlanarConfiguration
- 285 = PageName
- 288 = FreeOffsets
- 289 = FreeByteCounts
- 290 = GrayResponseUnit
- 291 = GrayResponseCurve
- 292 = T4Options
- 293 = T6Options
- 296 = ResolutionUnit
- 297 = PageNumber
- 301 = TransferFunction
- 305 = Software
- 306 = DateTime
- 315 = Artist
- 316 = HostComputer
- 317 = Predictor
- 318 = WhitePoint
- 319 = PrimaryChromaticities
- 320 = ColorMap
- 321 = HalftoneHints
- 322 = TileWidth
- 323 = TileLength
- 324 = TileOffsets
- 325 = TileByteCounts
- 330 = SubIFDs
- 332 = InkSet
- 333 = InkNames
- 334 = NumberOfInks
- 336 = DotRange
- 337 = TargetPrinter
- 338 = ExtraSamples
- 339 = SampleFormat
- 340 = SMinSampleValue
- 341 = SMaxSampleValue
- 342 = TransferRange
- 343 = ClipPath
- 344 = XClipPathUnits
- 345 = YClipPathUnits
- 346 = Indexed
- 347 = JPEGTables
- 351 = OPIProxy
- 512 = JPEGProc
- 513 = JpegIFOffset
- 514 = JpegIFByteCount
- 515 = JpegRestartInterval
- 517 = JpegLosslessPredictors
- 518 = JpegPointTransforms
- 519 = JpegQTables
- 520 = JpegDCTables
- 521 = JpegACTables
- 529 = YCbCrCoefficients
- 530 = YCbCrSubSampling
- 531 = YCbCrPositioning
- 532 = ReferenceBlackWhite
- 700 = XMLPacket
- 4096 = RelatedImageFileFormat
- 4097 = RelatedImageWidth
- 4098 = RelatedImageLength
- 18246 = Rating
- 18249 = RatingPercent
- 32781 = ImageID
- 33421 = CFARepeatPatternDim
- 33422 = CFAPattern
- 33423 = BatteryLevel
- 33432 = Copyright
- 33434 = ExposureTime
- 33437 = FNumber
- 33723 = IPTCNAA
- 34377 = ImageResources
- 34665 = ExifOffset
- 34675 = InterColorProfile
- 34850 = ExposureProgram
- 34852 = SpectralSensitivity
- 34853 = GPSInfo
- 34855 = ISOSpeedRatings
- 34856 = OECF
- 34857 = Interlace
- 34858 = TimeZoneOffset
- 34859 = SelfTimerMode
- 34864 = SensitivityType
- 34865 = StandardOutputSensitivity
- 34866 = RecommendedExposureIndex
- 34867 = ISOSpeed
- 34868 = ISOSpeedLatitudeyyy
- 34869 = ISOSpeedLatitudezzz
- 36864 = ExifVersion
- 36867 = DateTimeOriginal
- 36868 = DateTimeDigitized
- 36880 = OffsetTime
- 36881 = OffsetTimeOriginal
- 36882 = OffsetTimeDigitized
- 37121 = ComponentsConfiguration
- 37122 = CompressedBitsPerPixel
- 37377 = ShutterSpeedValue
- 37378 = ApertureValue
- 37379 = BrightnessValue
- 37380 = ExposureBiasValue
- 37381 = MaxApertureValue
- 37382 = SubjectDistance
- 37383 = MeteringMode
- 37384 = LightSource
- 37385 = Flash
- 37386 = FocalLength
- 37387 = FlashEnergy
- 37388 = SpatialFrequencyResponse
- 37389 = Noise
- 37393 = ImageNumber
- 37394 = SecurityClassification
- 37395 = ImageHistory
- 37396 = SubjectLocation
- 37397 = ExposureIndex
- 37398 = TIFF/EPStandardID
- 37500 = MakerNote
- 37510 = UserComment
- 37520 = SubsecTime
- 37521 = SubsecTimeOriginal
- 37522 = SubsecTimeDigitized
- 37888 = AmbientTemperature
- 37889 = Humidity
- 37890 = Pressure
- 37891 = WaterDepth
- 37892 = Acceleration
- 37893 = CameraElevationAngle
- 40091 = XPTitle
- 40092 = XPComment
- 40093 = XPAuthor
- 40094 = XPKeywords
- 40095 = XPSubject
- 40960 = FlashPixVersion
- 40961 = ColorSpace
- 40962 = ExifImageWidth
- 40963 = ExifImageHeight
- 40964 = RelatedSoundFile
- 40965 = ExifInteroperabilityOffset
- 41483 = FlashEnergy
- 41484 = SpatialFrequencyResponse
- 41486 = FocalPlaneXResolution
- 41487 = FocalPlaneYResolution
- 41488 = FocalPlaneResolutionUnit
- 41492 = SubjectLocation
- 41493 = ExposureIndex
- 41495 = SensingMethod
- 41728 = FileSource
- 41729 = SceneType
- 41730 = CFAPattern
- 41985 = CustomRendered
- 41986 = ExposureMode
- 41987 = WhiteBalance
- 41988 = DigitalZoomRatio
- 41989 = FocalLengthIn35mmFilm
- 41990 = SceneCaptureType
- 41991 = GainControl
- 41992 = Contrast
- 41993 = Saturation
- 41994 = Sharpness
- 41995 = DeviceSettingDescription
- 41996 = SubjectDistanceRange
- 42016 = ImageUniqueID
- 42032 = CameraOwnerName
- 42033 = BodySerialNumber
- 42034 = LensSpecification
- 42035 = LensMake
- 42036 = LensModel
- 42037 = LensSerialNumber
- 42080 = CompositeImage
- 42081 = CompositeImageCount
- 42082 = CompositeImageExposureTimes
- 42240 = Gamma
- 50341 = PrintImageMatching
- 50706 = DNGVersion
- 50707 = DNGBackwardVersion
- 50708 = UniqueCameraModel
- 50709 = LocalizedCameraModel
- 50710 = CFAPlaneColor
- 50711 = CFALayout
- 50712 = LinearizationTable
- 50713 = BlackLevelRepeatDim
- 50714 = BlackLevel
- 50715 = BlackLevelDeltaH
- 50716 = BlackLevelDeltaV
- 50717 = WhiteLevel
- 50718 = DefaultScale
- 50719 = DefaultCropOrigin
- 50720 = DefaultCropSize
- 50721 = ColorMatrix1
- 50722 = ColorMatrix2
- 50723 = CameraCalibration1
- 50724 = CameraCalibration2
- 50725 = ReductionMatrix1
- 50726 = ReductionMatrix2
- 50727 = AnalogBalance
- 50728 = AsShotNeutral
- 50729 = AsShotWhiteXY
- 50730 = BaselineExposure
- 50731 = BaselineNoise
- 50732 = BaselineSharpness
- 50733 = BayerGreenSplit
- 50734 = LinearResponseLimit
- 50735 = CameraSerialNumber
- 50736 = LensInfo
- 50737 = ChromaBlurRadius
- 50738 = AntiAliasStrength
- 50739 = ShadowScale
- 50740 = DNGPrivateData
- 50741 = MakerNoteSafety
- 50778 = CalibrationIlluminant1
- 50779 = CalibrationIlluminant2
- 50780 = BestQualityScale
- 50781 = RawDataUniqueID
- 50827 = OriginalRawFileName
- 50828 = OriginalRawFileData
- 50829 = ActiveArea
- 50830 = MaskedAreas
- 50831 = AsShotICCProfile
- 50832 = AsShotPreProfileMatrix
- 50833 = CurrentICCProfile
- 50834 = CurrentPreProfileMatrix
- 50879 = ColorimetricReference
- 50931 = CameraCalibrationSignature
- 50932 = ProfileCalibrationSignature
- 50934 = AsShotProfileName
- 50935 = NoiseReductionApplied
- 50936 = ProfileName
- 50937 = ProfileHueSatMapDims
- 50938 = ProfileHueSatMapData1
- 50939 = ProfileHueSatMapData2
- 50940 = ProfileToneCurve
- 50941 = ProfileEmbedPolicy
- 50942 = ProfileCopyright
- 50964 = ForwardMatrix1
- 50965 = ForwardMatrix2
- 50966 = PreviewApplicationName
- 50967 = PreviewApplicationVersion
- 50968 = PreviewSettingsName
- 50969 = PreviewSettingsDigest
- 50970 = PreviewColorSpace
- 50971 = PreviewDateTime
- 50972 = RawImageDigest
- 50973 = OriginalRawFileDigest
- 50974 = SubTileBlockSize
- 50975 = RowInterleaveFactor
- 50981 = ProfileLookTableDims
- 50982 = ProfileLookTableData
- 51008 = OpcodeList1
- 51009 = OpcodeList2
- 51022 = OpcodeList3
- 51041 = NoiseProfile
as you can see all the metadata are there, we only need to map back to the one available from the picture
First python script
This first python script is just to evaluate if we are able to retrieve the image tags
import os
import sys
from PIL import Image, ExifTags
image = sys.argv[1]
img = Image.open(image)
for tag, value in img._getexif().items():
print(f"{tag} = {ExifTags.TAGS[tag]} = {value}")
This will produce something similar to
...
274 = Orientation = 6
531 = YCbCrPositioning = 1
306 = DateTime = 2020:09:02 12:38:53
282 = XResolution = 72.0
283 = YResolution = 72.0
36864 = ExifVersion = b'0220'
37377 = ShutterSpeedValue = 5.64
37378 = ApertureValue = 2.52
36867 = DateTimeOriginal = 2020:09:02 12:38:53
36868 = DateTimeDigitized = 2020:09:02 12:38:53
37379 = BrightnessValue = 2.51
37380 = ExposureBiasValue = 0.0
37381 = MaxApertureValue = 1.16
37383 = MeteringMode = 2
37385 = Flash = 0
40960 = FlashPixVersion = b'0100'
37386 = FocalLength = 4.3
37510 = UserComment = b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
...
At this stage we have all we need. Below the first version of the algo that is just accepting a path and scan for files
Second python script
This is an improvement of the previous one. Will just scan a folder instead of a single file
import os
import sys
from PIL import Image, ExifTags
folder_to_scan = sys.argv[1]
def list_folder(path):
folder_content = []
if not os.path.isdir(path):
print(f"ERROR: [{path}] is not a valid folder")
exit()
else:
folder_content = next(os.walk(path))[2]
if len(folder_content):
return folder_content
print(f"WARNING: [{path}] is empty")
exit()
def get_img_tags(image_path):
tags_obj = {}
try:
img = Image.open(image)
for tag, value in img._getexif().items():
tags_obj[ExifTags.TAGS[tag]] = value
except Exception as e:
print(e)
return tags_obj
folder_elemets = list_folder(folder_to_scan)
for i in folder_elemets:
print(i)
print("======")
Third python script
This is the final version, that it can still be vastly improved, however is doing the job. Is taking as input 2 parameters: the folder in which we have all the images, the output folder. Starting the FOLDER_TO_SCAN is building a dictionary with keys as dates (i.e. Apr_2021) to which is associated a list that contains all the files to copy. Once this dictionary is built the copy of the files will just require a loop through this object (done by themove_files function)
The dictionary structure will looks like
{
"month_year": [],
...
}
example
{
"Apr_2021": [
"file1.jpg",
"file2.jpg",
"file3.jpg",
]
}
Please note that at the moment this script is working only with images format, is not moving the file images but only COPYING, and is not production ready.
import os
import sys
from datetime import datetime
from PIL import Image, ExifTags
DTO = "DateTimeOriginal"
FOLDER_TO_WRITE=""
FOLDER_TO_SCAN=""
def list_folder():
folder_content = []
if not os.path.isdir(FOLDER_TO_SCAN):
print(f"ERROR: [{FOLDER_TO_SCAN}] is not a valid folder")
exit()
else:
folder_content = next(os.walk(FOLDER_TO_SCAN))[2]
if len(folder_content):
return folder_content
print(f"WARNING: [{FOLDER_TO_SCAN}] is empty")
exit()
def get_img_tags(image_path):
tags_obj = {}
try:
img = Image.open(image_path)
for tag, value in img._getexif().items():
tags_obj[ExifTags.TAGS[tag]] = value
except Exception as e:
print(e)
return tags_obj
def get_exif_tags(elements, func_to_exec):
obj = {}
for item in elements:
obj[item] = func_to_exec(os.path.join(FOLDER_TO_SCAN, item))
return obj
def get_date_string(data_obj):
if not data_obj:
return None
data_string = datetime.strptime(data_obj.split()[0], "%Y:%m:%d")
month = data_string.strftime('%b')
year = data_string.year
return "{}_{}".format(month, year)
# create the dates groups of images
def get_grouped_images(images_list):
data_obj = {}
for item in images_list:
date_string = get_date_string(images_list[item].get('DateTimeOriginal', None))
if not date_string:
date_string = "No_date"
if not date_string in data_obj:
print("Adding {} to the list for {}".format(date_string, item))
data_obj[date_string] = []
print("******")
data_obj[date_string].append(item)
return data_obj
# create the folders in folder_to_write/Month_Year
def create_folders(data_obj):
for item in data_obj:
if not os.path.isdir( os.path.join(FOLDER_TO_WRITE, item) ):
print("Creating {}".format(os.path.join(FOLDER_TO_WRITE, item) ))
os.makedirs(os.path.join(FOLDER_TO_WRITE, item))
else:
print("{} already exists".format(os.path.join(FOLDER_TO_WRITE, item) ))
def move_files(data_obj):
for item in data_obj:
new_path = os.path.join(FOLDER_TO_WRITE, item)
files_to_copy = data_obj[item]
for file_name in files_to_copy:
print("Copying .... {} in {}".format(file_name, new_path))
os.popen("cp ./{}/{} {}".format(FOLDER_TO_SCAN, file_name, new_path))
if __name__ == "__main__":
FOLDER_TO_SCAN = sys.argv[1]
FOLDER_TO_WRITE = sys.argv[2]
folder_elemets = list_folder()
images_list = get_exif_tags(folder_elemets, get_img_tags)
data_obj = get_grouped_images(images_list)
create_folders(data_obj)
move_files(data_obj)
here an example of run
$> python test2.py test_script new-folder
cannot identify image file 'test_script/20180514064433.mp4'
cannot identify image file 'test_script/20200923_075122.mp4'
cannot identify image file 'test_script/20200904_200350.mp4'
cannot identify image file 'test_script/20210411_155452.mp4'cannot identify image file 'test_script/20200912_095821.mp4'
cannot identify image file 'test_script/20210411_154427.mp4'
cannot identify image file 'test_script/20200921_201115.mp4'
Adding Apr_2021 to the list for 20210406_120550.jpg
******
Adding Oct_2020 to the list for 20201018_083107.jpg
******
Adding No_date to the list for 20210407_103853.mp4
******
Adidng Sep_2020 to the list for 20200912_082208.jpg
******
Creating new-folder/Apr_2021
Creating new-folder/Oct_2020
Creating new-folder/No_date
Creating new-folder/Sep_2020
Copying .... 20200919_084224.jpg in new-folder/Sep_2020
Copying .... 20200910_163210.jpg in new-folder/Sep_2020
Copying .... 20200909_204355.jpg in new-folder/Sep_2020
Copying .... 20200921_075240.jpg in new-folder/Sep_2020
Copying .... 20200920_075614.jpg in new-folder/Sep_2020
Copying .... 20200906_130835.jpg in new-folder/Sep_2020
Copying .... 20200923_075047.jpg in new-folder/Sep_2020
Copying .... 20200902_123853.jpg in new-folder/Sep_2020
Copying .... 20200912_082200.jpg in new-folder/Sep_2020
Copying .... 20200923_075045.jpg in new-folder/Sep_2020
Copying .... 20200927_103155.jpg in new-folder/Sep_2020
Copying .... 20200910_163259.jpg in new-folder/Sep_2020
Copying .... 20200924_192522.jpg in new-folder/Sep_2020
Copying .... 20200922_162745.jpg in new-folder/Sep_2020
Hope you enjoyed and if so, please share and help us grow!
Share this content: