DSP Blog

Vector Data and Where to Use It - Part 1

Written by Alastair Steele | 06-Nov-2025 10:20:44

In this blog, I explain in detail what vector data is, before covering how it can be put to good use and where. Keep an eye out for part 2, where we’ll walk through practical implementation steps.

 

What is Vector Data?


A relatively new datatype to Oracle Database introduced in 23ai (along with Boolean!), vector data, in short, is the outputted translation of regular data into a computer-friendly format.

This is done so that we can find similarities between data that we’d normally be unable to compare directly, by getting the database to compare their vector representations. No person will ever create and input this type of information directly, and no sane application developer will attempt to display it to users once generated, either. In this regard, it’s like BLOB data, which is incredibly useful, but no one needs to see the actual 1s and 0s.

That all sounds a bit ‘out there’, so here are some examples of vector data in its most digestible format, an array.

[8, 2, 3, 4]

[4, 2, 9, 4, 3, 6]

[1, 2]

That’s it, not that scary really. But, as it turns out, very powerful when combined with AI. To explain further, let’s take a little detour with the shortest example above.

The vector of something in mathematics and physics is to say that that something has a direction, and a distance in that direction, from a relative starting point.

A ball starting at point A rolls/travels to point B. The direction from point A to point B, combined with the distance between the two, is the ball's vector.

 

If you imagine the path the ball took to be a line on a graph. Starting at 0, 0 and ending at some X, Y coordinate, that line is also a vector. The X, Y coordinate then becomes our numerical representation of the vector, something like [1, 2]. Now imagine introducing a Z-axis to represent something travelling in the 3D/real world [X, Y, Z]. Our vector is now represented as [1, 2, 3].

But we don’t need to stop there. We can keep adding directions to our coordinates (vector), e.g., [4, 2, 9, 4, 3, 6…]. These additional directions (or dimensions, as they’re known as when talking about vectors) are mathematical only and cannot be visualised or translated to real-world examples.



However many dimensions make up your vector, ultimately, it can be thought of as a hypothetical ‘line’ on a hypothetical graph, that has a direction and a distance.

Representing data in this way doesn’t immediately make sense, but now that we have our ‘hypothetical line on a hypothetical graph’ analogy, it’s only a short leap to see why this type of data is useful.

If we draw two vectors on the same impossible-to-visualise graph, how close they are to each other on this graph represents how comparable the actual data is. Vectors that go off in different directions? Not very similar at all.

As you can imagine, there are a few options when comparing how ‘close’ two vectors are to each other. I won’t go into the differences here, but I would encourage further reading on the topic as it’s very interesting.

 

Dimensions & Format

So, if how ‘close’ vectors are to each other represents how similar they are, and the vectors themselves are determined by their dimensions, what goes into deciding the value for each dimension?

This is where AI comes in. Each dimension represents an ‘attribute’ of the data in question. For single words, this could be attributes as simple as the number of characters, language, capitalisation, etc. For larger bodies of text, AI models can derive more complex things like context and sentiment. In images, it could be if the background is blue and if something that looks like a chair is featured.

The point is, these dimensions can cover 1000s of attributes (up to ~65k on Oracle), all of which contribute to a vector's overall direction and distance, and by extension their ‘closeness’, to other vectors.

Another concept to introduce at this point is the size/format of each dimension. Consider the following example.

[1, 0, 1, 1, 0, 1, 0…]

Each dimension is represented by 1 or 0, which would equate to true or false. Now think about comparing data and its individual attributes. The AI model's evaluation of the search text ‘image includes a chair’ becomes very strict. Maybe an image includes a stool, which is close to a chair (some kind of seat), but not exactly what you’re after. Because the model can only pick 1 or 0 (true or false), it ‘decides’ 1 - as a stool is closer to ‘a chair’ than ‘no chair’. The result, you’re searching for pictures that include a chair, and a bunch of pictures with a stool get mixed in as well, not very accurate.

What we need are options beyond just 1 or 0, so our AI model can be more accurate when scoring each attribute. Oracle supports formats up to 64-bit or 18 quintillion. This allows our attributes to be scored very precisely, meaning images that include a chair are attributed more closely to our search text, pushing those results to the top, and images that include a stool further down, but still above images without either.

What each dimension represents in terms of an attribute, and the ‘scoring’ of each attribute, is very specific to the AI model used. This is why you must always use the same AI model for the data being compared. A change in AI provider, or sometimes even model version, and any stored/historical vector data will need to be regenerated to be of any use before being compared against newly vectorised data.

And there we have it, that’s what vector data is: large arrays of very large integers used by complex functions to determine the similarity between regular data. And while vector data (also called ‘Embeddings’ in this area of technology) can only be generated by AI services, everything after that is pure mathematics and computational grunt, which I find fascinating.

 

How Do I Store It?

As previously mentioned, vectors are essentially large arrays of very large integers. So with that in mind, you have two choices to make… how large do you want your large array to be, and how very large are your integers going to be?

For Oracle, declaring a vector is done as VECTOR (dimensions, format).

The dimensions can be stipulated to be anything between 1 and 65k. And the format as one of the following shortcodes:

BINARY, INT8, FLOAT32, FLOAT64

What options you pick depends on a few considerations, such as total bytes (both on-disk for storage and in-memory for indexing), performance (how quickly you can perform vector comparisons), and accuracy (how relevant your results are). All of which are also affected by how large (or small) your dataset is.

 

How Do I Generate / Use It?

Generating vector data (i.e., Embeddings) is done by AI models trained specifically for the type of data being targeted, e.g., text, image, or audio. Some models can only target one datatype. Others are ‘multimodal’, which can handle text and image data.

Appropriate models can be utilised via a few formats. Many are accessible as web service endpoints, others can be hosted locally on servers you administrate, and since Oracle 23ai, some models can even be installed directly to the database.

Many of these models have SDKs, so you can interface in your preferred language. Oracle provides database packages DBMS_VECTOR, DBMS_VECTOR_CHAIN, and APEX_AI, and Oracle APEX includes configuration options at the workspace and application level to make using your chosen model as easy as possible.

By using an Oracle Autonomous Database, loaded with the ONNX model of your choice and using Oracle APEX, getting and using vector embeddings has never been more straightforward and maintenance-free. And if you have the strictest of data policies to adhere to, what can be more compliant than your data never leaving your database? Check out one of our previous blogs to get started.

 

What Has Vector Data Ever Done For Us?

As embeddings can be generated for more than just text data, it’s worth taking a look at the different ways this comparison functionality can be employed.

 

Text

For text-only models, vector comparison can be used as a direct replacement for simple operators (e.g., ‘like’) or advanced functions FUZZY_MATCH and PHONIC_ENCODE.

When used with a multi-lingual model, you can search in one language and get relevant results from another!

Large datasets that have been chunked (i.e., lengthy text that has been split into smaller, more manageable sizes) can be vectorised, allowing relevant data to be searched for and extracted before being passed to a Generative AI service as part of its prompt, also known as RAG.

 

Image

Embeddings generated from images can be used to find images that are similar. Useful for quality control, or the reverse, identifying anomalies.

Searching for similar-looking items across product ranges, suppliers, or marketplaces.

Face recognition for automated identity document checks.

 

Text + Image

Using CLIP models (Contrastive Language–Image Pretraining), text and image data share the same ‘space’. In other words, they use the same attributes to score their dimensions, which results in embeddings generated from images that can be compared directly to embeddings generated from text. The result? Searches made in text can be used to return image-only results (e.g., ‘a red car’ returns images of a red car). Something that previously could only be achieved with manually intensive image labelling.

 

Summary

In this blog, we’ve considered what vector data is in detail, how it’s stored, how it’s generated, and some of its use cases.

There are plenty of resources covering the same subjects, but during my own learning journey, I found myself cross-referencing a few sources until it clicked, so hopefully this blog can aid in your journey, too.

Some advanced concepts like indexing, quantization, and reranking were beyond the scope of this blog, but they’re worth mentioning for further reading when you feel ready.

Databases have been cemented at the centre of digital solutions since the dawn of computing; it’s rare for new datatypes to appear at all, let alone make the impact we’ve seen with vectors. The era of AI-driven data has well and truly arrived.

Keep an eye on our blog page for part 2, where we’ll take practical steps to  implement the concepts mentioned here.

If you’d like to speak to one of our experts about vector data, please don’t hesitate to get in touch.