Unit 2.2 Data Compression, Images
Lab will perform alterations on images, manipulate RGB values, and reduce the number of pixels. College Board requires you to learn about Lossy and Lossless compression.
- Enumerate "Data" Big Idea from College Board
- Image Files and Size
- Python Libraries and Concepts used for Jupyter and Files/Directories
- Reading and Encoding Images (2 implementations follow)
- Data Structures, Imperative Programming Style, and working with Images
- Data Structures and OOP
- Additionally, review all the imports in these three demos. Create a definition of their purpose, specifically these ...
- Hacks
- 2.2 CB Video Notes
- 2.3 CB Video Notes
- CB Questions
- The Code Below, when Ran, will give you the original photo and the blurred image
Enumerate "Data" Big Idea from College Board
Some of the big ideas and vocab that you observe, talk about it with a partner ...
- "Data compression is the reduction of the number of bits needed to represent data"
- "Data compression is used to save transmission time and storage space."
- "lossy data can reduce data but the original data is not recovered"
- "lossless data lets you restore and recover"
The Image Lab Project contains a plethora of College Board Unit 2 data concepts. Working with Images provides many opportunities for compression and analyzing size.
Image Files and Size
Here are some Images Files. Download these files, load them into
images
directory under _notebooks in your Blog. - Clouds Impression
Describe some of the meta data and considerations when managing Image files. Describe how these relate to Data Compression ...
- File Type, PNG and JPG are two types used in this lab
- Size, height and width, number of pixels
- Visual perception, lossy compression
Python Libraries and Concepts used for Jupyter and Files/Directories
Introduction to displaying images in Jupyter notebook
IPython
Support visualization of data in Jupyter notebooks. Visualization is specific to View, for the web visualization needs to be converted to HTML.
pathlib
File paths are different on Windows versus Mac and Linux. This can cause problems in a project as you work and deploy on different Operating Systems (OS's), pathlib is a solution to this problem.
- What are commands you use in terminal to access files?
- To access files, you use "cd" to change the directory of the location that you want to "command" and you use "sudo nano" to edit said files.
- What are the command you use in Windows terminal to access files?
- Commands in Windows terminal to access files is "dir" and "cd" for specific files
- What are some of the major differences?
- One uses ls in both to list servers using LS and both use CD for changing. However, the command used for editing is starkly contrasted.
Provide what you observed, struggled with, or leaned while playing with this code.
- Why is path a big deal when working with images?
- It helps organize the images and therefore helps give the specific images "tags" so that a user can correctly access it and utilize said "artifiacts" for a project
- How does the meta data source and label relate to Unit 5 topics?
- Meta data source and labels relate to unit 5 topics because, similar to licenses, you have certain attributes and information given to each image that can help identify either the source or info for further organization.
- Look up IPython, describe why this is interesting in Jupyter Notebooks for both Pandas and Images? -IPuthon is an interactive computation environment that can be used to combine code, inticate text, math, and complex media. This can allow you to manipulate images, alter display, and perform filters and other rich accessories
from IPython.display import Image, display
from pathlib import Path # https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f
# prepares a series of images
def image_data(path=Path("images/"), images=None): # path of static images is defaulted
if images is None: # default image
images = [
{'source': "Peter Carolin", 'label': "Clouds Impression", 'file': "clouds-impression.png"},
{'source': "Peter Carolin", 'label': "Lassen Volcano", 'file': "lassen-volcano.jpg"}
]
for image in images:
# File to open
image['filename'] = path / image['file'] # file with path
return images
def image_display(images):
for image in images:
display(Image(filename=image['filename']))
# Run this as standalone tester to see sample data printed in Jupyter terminal
if __name__ == "__main__":
# print parameter supplied image
green_square = image_data(images=[{'source': "Internet", 'label': "Green Square", 'file': "green-square-16.png"}])
image_display(green_square)
# display default images from image_data()
default_images = image_data()
image_display(default_images)
# These are raw images, therefore there is no data compression or pixel manipulation
It seems like the image scaling is toyed with to make a bigger image.
Reading and Encoding Images (2 implementations follow)
PIL (Python Image Library)
Pillow or PIL provides the ability to work with images in Python. Geeks for Geeks shows some ideas on working with images.
base64
Image formats (JPG, PNG) are often called *Binary File formats, it is difficult to pass these over HTTP. Thus, base64 converts binary encoded data (8-bit, ASCII/Unicode) into a text encoded scheme (24 bits, 6-bit Base64 digits). Thus base64 is used to transport and embed binary images into textual assets such as HTML and CSS.- How is Base64 similar or different to Binary and Hexadecimal?
- Translate first 3 letters of your name to Base64.
numpy
Numpy is described as "The fundamental package for scientific computing with Python". In the Image Lab, a Numpy array is created from the image data in order to simplify access and change to the RGB values of the pixels, converting pixels to grey scale.
io, BytesIO
Input and Output (I/O) is a fundamental of all Computer Programming. Input/output (I/O) buffering is a technique used to optimize I/O operations. In large quantities of data, how many frames of input the server currently has queued is the buffer. In this example, there is a very large picture that lags.
- Where have you been a consumer of buffering?
- I experience a lot of consumer buffering when playing very demanding games or rendering softwares that try to render images in extremely high qualities.
- From your consumer experience, what effects have you experienced from buffering?
- My computer is relatively strong, so I do not experience the same lagginess that my peers do.
- How do these effects apply to images?
- Increasing the image quality.
Data Structures, Imperative Programming Style, and working with Images
Introduction to creating meta data and manipulating images. Look at each procedure and explain the the purpose and results of this program. Add any insights or challenges as you explored this program.
- Does this code seem like a series of steps are being performed?
- Yes, it lists the attributes that each image has and then displays it.
- Describe Grey Scale algorithm in English or Pseudo code?
- The grey scale will take the images values and add a grey scale on it, the code is shown below.
- Describe scale image? What is before and after on pixels in three images?
- The pixels start off with vibrant colors, but then a grey scale (filter) is added onto the image to give it toning.
- Is scale image a type of compression? If so, line it up with College Board terms described?
- Is scale image a type of compression? Yes, because you are taking a multiple pixels and smushing them into one pixel and taking the average of all of them.
from IPython.display import HTML, display
from pathlib import Path # https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f
from PIL import Image as pilImage # as pilImage is used to avoid conflicts
from io import BytesIO
import base64
import numpy as np
# prepares a series of images
def image_data(path=Path("images/"), images=None): # path of static images is defaulted
if images is None: # default image
images = [
{'source': "Internet", 'label': "Green Square", 'file': "green-square-16.png"},
{'source': "Peter Carolin", 'label': "Clouds Impression", 'file': "clouds-impression.png"},
{'source': "Peter Carolin", 'label': "Lassen Volcano", 'file': "lassen-volcano.jpg"},
{'source': "Internet", 'label': "Happy Face", 'file': "happyfaces.png"}
]
for image in images:
# File to open
image['filename'] = path / image['file'] # file with path
return images
# Large image scaled to baseWidth of 320
def scale_image(img):
baseWidth = 320
scalePercent = (baseWidth/float(img.size[0]))
scaleHeight = int((float(img.size[1])*float(scalePercent)))
scale = (baseWidth, scaleHeight)
return img.resize(scale)
# PIL image converted to base64
def image_to_base64(img, format):
with BytesIO() as buffer:
img.save(buffer, format)
return base64.b64encode(buffer.getvalue()).decode()
# Set Properties of Image, Scale, and convert to Base64
def image_management(image): # path of static images is defaulted
# Image open return PIL image object
img = pilImage.open(image['filename'])
# Python Image Library operations
image['format'] = img.format
image['mode'] = img.mode
image['size'] = img.size
# Scale the Image
img = scale_image(img)
image['pil'] = img
image['scaled_size'] = img.size
# Scaled HTML
image['html'] = '<img src="data:image/png;base64,%s">' % image_to_base64(image['pil'], image['format'])
# Create Grey Scale Base64 representation of Image
def image_management_add_html_grey(image):
# Image open return PIL image object
img = image['pil']
format = image['format']
img_data = img.getdata() # Reference https://www.geeksforgeeks.org/python-pil-image-getdata/
image['data'] = np.array(img_data) # PIL image to numpy array
image['gray_data'] = [] # key/value for data converted to gray scale
# 'data' is a list of RGB data, the list is traversed and hex and binary lists are calculated and formatted
for pixel in image['data']:
# create gray scale of image, ref: https://www.geeksforgeeks.org/convert-a-numpy-array-to-an-image/
average = (pixel[0] + pixel[1] + pixel[2]) // 3 # average pixel values and use // for integer division
if len(pixel) > 3:
image['gray_data'].append((average, average, average, pixel[3])) # PNG format
else:
image['gray_data'].append((average, average, average))
# end for loop for pixels
img.putdata(image['gray_data'])
image['html_grey'] = '<img src="data:image/png;base64,%s">' % image_to_base64(img, format)
# Jupyter Notebook Visualization of Images
if __name__ == "__main__":
# Use numpy to concatenate two arrays
images = image_data()
# Display meta data, scaled view, and grey scale for each image
for image in images:
image_management(image)
print("---- meta data -----")
print(image['label'])
print(image['source'])
print(image['format'])
print(image['mode'])
print("Original size: ", image['size'])
print("Scaled size: ", image['scaled_size'])
print("-- original image --")
display(HTML(image['html']))
print("--- grey image ----")
image_management_add_html_grey(image)
display(HTML(image['html_grey']))
print()
# Everything is scaled now, also metadata is included
Data Structures and OOP
Most data structures classes require Object Oriented Programming (OOP). Since this class is lined up with a College Course, OOP will be talked about often. Functionality in remainder of this Blog is the same as the prior implementation. Highlight some of the key difference you see between imperative and oop styles.
- Read imperative and object-oriented programming on Wikipedia
- Consider how data is organized in two examples, in relations to procedures
- Look at Parameters in Imperative and Self in OOP
Additionally, review all the imports in these three demos. Create a definition of their purpose, specifically these ...
- PIL
- numpy
- base64
from IPython.display import HTML, display
from pathlib import Path # https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f
from PIL import Image as pilImage # as pilImage is used to avoid conflicts
from io import BytesIO
import base64
import numpy as np
class Image_Data:
def __init__(self, source, label, file, path, baseWidth=320):
self._source = source # variables with self prefix become part of the object,
self._label = label
self._file = file
self._filename = path / file # file with path
self._baseWidth = baseWidth
# Open image and scale to needs
self._img = pilImage.open(self._filename)
self._format = self._img.format
self._mode = self._img.mode
self._originalSize = self.img.size
self.scale_image()
self._html = self.image_to_html(self._img)
self._html_grey = self.image_to_html_grey()
@property
def source(self):
return self._source
@property
def label(self):
return self._label
@property
def file(self):
return self._file
@property
def filename(self):
return self._filename
@property
def img(self):
return self._img
@property
def format(self):
return self._format
@property
def mode(self):
return self._mode
@property
def originalSize(self):
return self._originalSize
@property
def size(self):
return self._img.size
@property
def html(self):
return self._html
@property
def html_grey(self):
return self._html_grey
# Large image scaled to baseWidth of 320
def scale_image(self):
scalePercent = (self._baseWidth/float(self._img.size[0]))
scaleHeight = int((float(self._img.size[1])*float(scalePercent)))
scale = (self._baseWidth, scaleHeight)
self._img = self._img.resize(scale)
# PIL image converted to base64
def image_to_html(self, img):
with BytesIO() as buffer:
img.save(buffer, self._format)
return '<img src="data:image/png;base64,%s">' % base64.b64encode(buffer.getvalue()).decode()
# Create Grey Scale Base64 representation of Image
def image_to_html_grey(self):
img_grey = self._img
numpy = np.array(self._img.getdata()) # PIL image to numpy array
grey_data = [] # key/value for data converted to gray scale
# 'data' is a list of RGB data, the list is traversed and hex and binary lists are calculated and formatted
for pixel in numpy:
# create gray scale of image, ref: https://www.geeksforgeeks.org/convert-a-numpy-array-to-an-image/
average = (pixel[0] + pixel[1] + pixel[2]) // 3 # average pixel values and use // for integer division
if len(pixel) > 3:
grey_data.append((average, average, average, pixel[3])) # PNG format
else:
grey_data.append((average, average, average))
# end for loop for pixels
img_grey.putdata(grey_data)
return self.image_to_html(img_grey)
# prepares a series of images, provides expectation for required contents
def image_data(path=Path("images/"), images=None): # path of static images is defaulted
if images is None: # default image
images = [
{'source': "Internet", 'label': "Green Square", 'file': "green-square-16.png"},
{'source': "Peter Carolin", 'label': "Clouds Impression", 'file': "clouds-impression.png"},
{'source': "Peter Carolin", 'label': "Lassen Volcano", 'file': "lassen-volcano.jpg"}
]
return path, images
# turns data into objects
def image_objects():
id_Objects = []
path, images = image_data()
for image in images:
id_Objects.append(Image_Data(source=image['source'],
label=image['label'],
file=image['file'],
path=path,
))
return id_Objects
# Jupyter Notebook Visualization of Images
if __name__ == "__main__":
for ido in image_objects(): # ido is an Imaged Data Object
print("---- meta data -----")
print(ido.label)
print(ido.source)
print(ido.file)
print(ido.format)
print(ido.mode)
print("Original size: ", ido.originalSize)
print("Scaled size: ", ido.size)
print("-- scaled image --")
display(HTML(ido.html))
print("--- grey image ---")
display(HTML(ido.html_grey))
print()
Hacks
Early Seed award
- Add this Blog to you own Blogging site.
- In the Blog add a Happy Face image.
- Have Happy Face Image open when Tech Talk starts, running on localhost. Don't tell anyone. Show to Teacher.
AP Prep
- In the Blog add notes and observations on each code cell that request an answer.
- In blog add College Board practice problems for 2.3
- Choose 2 images, one that will more likely result in lossy data compression and one that is more likely to result in lossless data compression. Explain.
Project Addition
- If your project has images in it, try to implement an image change that has a purpose. (Ex. An item that has been sold out could become gray scale)
Pick a programming paradigm and solve some of the following ...
- Numpy, manipulating pixels. As opposed to Grey Scale treatment, pick a couple of other types like red scale, green scale, or blue scale. We want you to be manipulating pixels in the image.
- Binary and Hexadecimal reports. Convert and produce pixels in binary and Hexadecimal and display.
- Compression and Sizing of images. Look for insights into compression Lossy and Lossless. Look at PIL library and see if there are other things that can be done.
- There are many effects you can do as well with PIL. Blur the image or write Meta Data on screen, aka Title, Author and Image size.
2.2 CB Video Notes
What is Data Compression?
Data Compression is the reduction in the number of bits needed to represent data and is used for time tramission saving and storage space.
When data is compression, you are essentially looking for repreated patterns and predictability. The larger a data file, the more patterns that can be pulled out of it.
Text Compression: Removing all repeated characters and inserting a single character or symbol to represent it is a sign of data compression. Ex: Twinkle, twinkle, little star can be replaced with, say "S, S, little star."
Example: Unscramble. Humpty dumpty sat on a wall, Humpty Dumpty had a great fall, all the king's horses and all the king's mean, Couldn't put Humpty together again.
Image Compression: Photographs do have predictable patterns. Taking pixels and finding patterns between them canhelp compress the image by not sending every pixel individually.
Lossless: reduces the number or bits tored or transmitted while guaranteeing complete reconstruction of the original data. (Typical approach where the loss of words or nubers would change the info, ex: executable files, text) Lossy: significantly reduces the number of bits stored or transmitted but only allow reconstruction of an approximation o fthe original data. (The typical approach where the removal of some data has little or no discernible effect on the representation of the content since the data removed are redundant, unimportant, or imperciptible, ex: graphics)
Which is Better?
It dpeends on your needs.
- Lossy can reduce the data more, but og data cannot be recovered
- Lossless data compression lets you restore and recover og data but it cannot be compressed as much as lossy algorithms.
- If quality is important, lossless
- Having the smallest transmission time, lossly
- Just because compressing data, doesn't mean we lose more data!
2.3 CB Video Notes
Where do we start with data?
- Collect data
- Consider the source
- sources on top of sources?
- Consider the source
-
Processing data is affected by size
- How much info?
- Can one machine handle it?
- May require parallel processing
- Use two or more processors to handle different tasks
-
Is there potential Bias?
- Who collected the data
- Do they have an agenda?
- How is the data collected?
-
Data Cleaning:
- Identify incomplete, corrupot, duplicate, or inaccurate records
- Replacing, modifying, or deleting the data
-
Be careful about modify/delete
- Be sure there is a mistake
- Keep records iof whst data is modified/deleted and WHY
- Invalid data may need to be verified
-
Thing that can be invalid:
- Missing Data
- Invalid Data
- Inaccurate Data
- Some decisions will be made with ease, while others may require some research to verify.
Extracting Information From Data
-
What is metadata?
- prefix meta: behind among beween: data about data
-
Some info has info about itself:
- Author
- Data
- Length/Size
-
Why though?
- Identify
- Organization
- Process
-
How might metadata be used?
- Suggestions for related content
- By looking at metadata, algorithms can be used to make suggestions for things of interest
- Organize inventory + catalog
- Suggestions for related content
-
Data allows:
- identify trends
- knowledge
- potential insight
-
Be careful:
- Look out for misleading trends
- Correlation does not mean causation.
CB Questions
- Which of the following is an advantage of a lossless compression algorithm over a lossy compression algorithm?
- Lossless guarantees reconstruction of original data whereas lossy compression cannot
- A user wants to save data files for an online storage site. The user wants to reduce the size of the file, if possible, and wants to be able to completely restore the file to its original version. Which of the following actions best supports the user's needs?
- Use a lossless compression algorithm before uploading
- A programmer is developing software for a social media platform. The programmer is planning to use compression when users send attachments to other users. What is a true statement about the use of compression.
- Lossless for images < Lossy for images
- A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.Upon compiling the data, the researcher identifies a problem due to the fact that neither data source uses a unique ID number for each student. Which of the following best describes the problem caused by the lack of unique ID numbers?
- Students with the same name will be confused
- A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?
- Different countries may use different ways of organizing data.
- A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?
- Abbreviations and Misspelling
- Which of the following additional pieces of information would be most useful in determining the artist with the greatest attendance during a particular month?
- Avg. ticket price
- A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?
- The number of bicycles.
- A teacher sends students an anonymous survey in order to learn more about the students’ work habits. The survey contains the following questions. On average, how long does homework take you each night (in minutes) ? On average, how long do you study for each test (in minutes) ? Do you enjoy the subject material of this class (yes or no) ? Which of the following questions about the students who responded to the survey can the teacher answer by analyzing the survey results?
- 1 and 2
from IPython.display import HTML, display
from pathlib import Path # https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f
from PIL import Image as pilImage # as pilImage is used to avoid conflicts
from io import BytesIO
from PIL import Image, ImageFilter
import base64
import numpy as np
images = Image.open('/Users/ederickwong/vscode/Ederick-s-2022-2023-APCSP-Blog/images/group.JPG')
blurImage = images.filter(ImageFilter.BLUR)
images.show()
blurImage.show()
print(blurImage)
Choose 2 images, one that will more likely result in lossy data compression and one that is more likely to result in lossless data compression. Explain.
Two images, to make this easy, let's use the forest and the green square.
from IPython.display import Image, display
from pathlib import Path # https://medium.com/@ageitgey/python-3-quick-tip-the-easy-way-to-deal-with-file-paths-on-windows-mac-and-linux-11a072b58d5f
# prepares a series of images
def image_data(path=Path("images/"), images=None): # path of static images is defaulted
if images is None: # default image
images = [
{'source': "Peter Carolin", 'label': "Clouds Impression", 'file': "clouds-impression.png"},
{'source': "Peter Carolin", 'label': "Lassen Volcano", 'file': "lassen-volcano.jpg"}
]
for image in images:
# File to open
image['filename'] = path / image['file'] # file with path
return images
def image_display(images):
for image in images:
display(Image(filename=image['filename']))
# Run this as standalone tester to see sample data printed in Jupyter terminal
if __name__ == "__main__":
# print parameter supplied image
green_square = image_data(images=[{'source': "Internet", 'label': "Green Square", 'file': "green-square-16.png"}])
image_display(green_square)
# display default images from image_data()
default_images = image_data()
image_display(default_images)
The green square will most likely result in lossy compression because it is a consistent colorl therefore you can save a lot more space when using lossy compared to lossless. The forest, on th. other hand, has many different shades and colors involved, so a lossless compression algorithm is preferred because the colors will not be averaged and make it look worse.