Description
This session was prepared to provide answers and directions to the question I recently received, "The data size is over 5TB and is too large to be stored on a hard disk. How should I read and analyze the data?" It will be a useful session for those who have similar concerns.
1️⃣ First, this session will introduce strategies for processing large amounts of data using Pandas and related libraries. Pandas has played an important role in data processing, but its limitations are becoming more apparent as we enter the era of the cloud and large amounts of data.
We will look at the timeline and major changes from the first release in 2008 to the release of Pandas 2.0 in April of this year, 2023.
2️⃣ Next, we will cover the process of verifying whether the strategies presented in the session announced in 2019 are still valid in 2023.
Through this process, we will be able to take a deep look at whether the existing strategies are suitable for the current data environment and what parts have been improved.
Before attending this session, it would be good to check and participate in the strategy presented in 2019 by following the link below. • Presentation video: https://www.youtube.com/watch?v=0Vm9Yi_ig58 • Presentation materials: https://drive.google.com/file/d/12faqaslFIF-Sg_sU3jeGyauW5ClRqS8D/view
3️⃣ In particular, we will look at what changes have occurred through the application of CoW and integration with Apache Arrow in Pandas 2.0. And we have also added new contents on Method Chaining and String data types.
4️⃣ Next, we will look at a useful tool for streaming and processing large amounts of data,
Oh Seong-woo I have been working in the machine learning and artificial intelligence fields for nearly 10 years and have received help from Python and numerous open source communities. I worked as an AI engineer in the field of natural language processing at KB Kookmin Bank and mainly learned language models and developed chatbots. Recently, I have been applying the SFT model to practice, and I am also working on various tasks for the development and application of generative AI. As for open community activities, I am studying the LLM of the Ultra-Large Language Model at the Korea Institute of Financial Artificial Intelligence (KIFAI), and I am also planning to develop/open the Financial AI Assistant for the general public.